From my FMA Quick Note - Part 1, I have shown you about the FMA architecture, event types, event names and some examples. This part I will show the CLI commands and how to use it.
FMA Quick Note (Part 2) CLI Commands
fmd (1M) – deamon which receives telemetry info relating to problems, diag & initiates proactive self-healing activities
the fmd deamon is also working with the diag engine plug-ins, the plug-ins are used to perform diagnosis for particular hardware that previously fired error report to find out whether the hardware is really fault. If the hardware is fault, the fault event will be raised, and a case (correlated event) will be opened. the diag engine plug-inst are located in these directories:
/usr/platform/`uname -i`/lib/fm/fmd/plugins
/usr/platform/`uname -m`/lib/fm/fmd/plugins
/usr/lib/fm/fmd/plugins (generic)
the plug-ins are searched in the above directories ORDER, specific modules allow to be loaded prior to generic modules.
fmadm (1M) – utility to view/modify system config parameters that maintained by the fmd deamon
fmstat (1M) – utility to report statistics associated with the fmd deamon and its associated modules
fmdump (1M) – utility to display the contents of the fault and error logs
Diag engine plug-inst example for Sun Fire T2000
The Sun Fire T2000 does not have this plugins directory /usr/platform/`uname -i`/lib/fm/fmd/plugins
Instead, it has this directory /usr/platform/`uname -m`/lib/fm/fmd/plugins and /usr/lib/fm/fmd/plugins (generic plugins)
Note:
uname -i shows SUNW,Sun-Fire-T2000
uname -m shows sun4v
fmd plug-ins on T2000 (specific)
# ls -l /usr/platform/sun4v/lib/fm/fmd/plugins
total 456
-rw-r--r-- 1 root bin 195 Aug 23 2006 cpumem-diagnosis.conf
-r-xr-xr-x 1 root bin 109480 Oct 14 2006 cpumem-diagnosis.so
-rw-r--r-- 1 root bin 473 Oct 4 2006 cpumem-retire.conf
-r-xr-xr-x 1 root bin 31584 Oct 14 2006 cpumem-retire.so
-rw-r--r-- 1 root bin 240 Oct 4 2006 etm.conf
-r-xr-xr-x 1 root bin 75368 Oct 14 2006 etm.so
fmd plug-ins on T2000 (generic)
# ls -l /usr/lib/fm/fmd/plugins
total 1008
-rw-r--r-- 1 root bin 411 Jan 22 2005 cpumem-retire.conf
-r-xr-xr-x 1 root bin 30772 Oct 14 2006 cpumem-retire.so
-rw-r--r-- 1 root bin 1679 Aug 23 2006 eft.conf
-r-xr-xr-x 1 root bin 311588 Sep 6 2006 eft.so
-rw-r--r-- 1 root bin 315 Jan 22 2005 io-retire.conf
-r-xr-xr-x 1 root bin 15924 Sep 6 2006 io-retire.so
-rw-r--r-- 1 root bin 149 Aug 23 2006 ip-transport.conf
-r-xr-xr-x 1 root bin 36768 Sep 6 2006 ip-transport.so
-rw-r--r-- 1 root bin 190 Jul 11 2006 snmp-trapgen.conf
-r-xr-xr-x 1 root bin 32356 Jul 22 2006 snmp-trapgen.so
-rw-r--r-- 1 root bin 864 Jan 22 2005 syslog-msgs.conf
-r-xr-xr-x 1 root bin 21356 Sep 6 2006 syslog-msgs.so
-rw-r--r-- 1 root bin 309 Aug 23 2006 zfs-diagnosis.conf
-r-xr-xr-x 1 root bin 19984 Sep 6 2006 zfs-diagnosis.so
-rw-r--r-- 1 root bin 234 Oct 17 2006 zfs-retire.conf
-r-xr-xr-x 1 root bin 18844 Oct 19 2006 zfs-retire.so
Note that the cpumem-retire engine plug-ins appears in both specific and generic plug-ins directory, since the specific plug-ins will be loaded before the generic one, in this case ONLY the cpumem-retire module in directory/usr/platform/sun4v/lib/fm/fmd/plugins will be loaded.
Viewing the FMA data using fmadm
fmadm (FM stands for Fault Manager)
config – show FM config
faulty – list of faulty resources
flush
load
repair
reset [-s serd]
rotate
unload
Show all modules that are currently loaded
# fmadm config
MODULE VERSION STATUS DESCRIPTION
cpumem-diagnosis 1.5 active CPU/Memory Diagnosis
cpumem-retire 1.1 active CPU/Memory Retire Agent
eft 1.16 active eft diagnosis engine
etm 1.0 active FMA Event Transport Module
fmd-self-diagnosis 1.0 active Fault Manager Self-Diagnosis
io-retire 1.0 active I/O Retire Agent
snmp-trapgen 1.0 active SNMP Trap Generation Agent
sysevent-transport 1.0 active SysEvent Transport Agent
syslog-msgs 1.0 active Syslog Messaging Agent
zfs-diagnosis 1.0 active ZFS Diagnosis Engine
zfs-retire 1.0 active ZFS Retire Agent
Show faulty components
# fmadm faulty
STATE RESOURCE / UUID
------------------------------------------------------------------
faulted mem:///component=Slot%20B%3A%20J3000
dbdc7f15-848c-cbdc-b47f-deb9d9fff5c9
------------------------------------------------------------------
(after) Repair the faulty component
# fmadm repair mem:///component=Slot%20B%3A%20J3000
fmadm: recorded repair to mem:///component=Slot%20B%3A%20J3000
# fmadm faulty
STATE RESOURCE / UUID
------------------------------------------------------------------
Show the FMA statistics
# fmstat
module ev_recv ev_acpt wait svc_t %w %b open solve memsz bufsz
cpumem-diagnosis 4 4 0.0 262.8 0 0 1 1 308b 356b
cpumem-retire 1 0 0.0 66.6 0 0 0 0 0 0
fmd-self-diagnosis 0 0 0.0 0.0 0 0 0 0 0 0
syslog-msgs 1 0 0.0 0.1 0 0 0 0 32b 0
Show module statistics
# fmstat -m cpumem-retire
NAME VALUE DESCRIPTION
auto_flts 0 auto-close faults received
bad_flts 0 invalid fault events received
cpu_blfails 0 failed cpu blacklists
cpu_blsupp 0 cpu blacklists suppressed
cpu_fails 0 cpu faults unresolveable
cpu_flts 0 cpu faults resolved
cpu_supp 0 cpu offlines suppressed
nop_flts 0 inapplicable fault events received
page_fails 0 page faults unresolveable
page_flts 0 page faults resolved
page_nonent 0 retires for non-existent fmris
page_supp 0 page retires suppressed
FMA Logging Explained
When an error event is received by the fmd deamon, the event is logged into the error log prior toacknowledging receipt of the event to the FMA event transport. Initially the event is logged with HEADER that means it has NOT yet been processed by a module. When a module accepts an error event, that HEADER is changed to indicated the event does not need to be replayed in the case of a failure.
The CLI commands used for verifying error and fault log are listed below :
fmdump (display fault log)
fmdump -e (display error log)
fmdump -eV (display error log in verbose mode)
================================================================================
part 3 will show how to identify system error and fault
No comments:
Post a Comment