Sunday, June 21, 2009

Solaris FMA Quick Note - Part 1 (Architecture)

Since the introduction of the Solaris 10 to the world, you may have heard about the 'Predictive Self Healing' features which tremendously increase the system reliability, availability and serviceability (RAS).

The Predictive Self Healing itself consists of 2 major things--FMA and SMF (Fault Management Architecture & Service Management Facility). Here is the introduction to FMA.

FMA Quick Note (Part 1) Architecture

The FMA has 3 main components working together to make system SELF HEALING possible.

(1) FMA Components

1.1) “Fault manager”
is the software component to multiplex between error reports (produced by system components) and companion software (designed to diag & response to those errors) to facilitate self-healing and improve availability.


1.2) “Diag engines”
is the fault manager's clients (software components) automatically diag problems using error reports

1.3) “Agents” or “FMA Agents”
will automatically respond to the problems disabling faulty components alert admin providing more info to the higher level management software

(2) FMA Features

FMA is the new software architecture and methodology for fault management across Sun's product line that provides 3 (self healing) activities -- Error handling, Fault Diagnosis and Response (Self Healing)


2.1.1) Error handling

- Detection of an error
- Captures data to diagnose the underlying problem
- Isolate the effects of the error
- Generate the appropriate error report
- send error report to other diag engines (if needed)
2.1.2) Fault diagnosing

- Analysis of problem or defect in the system


2.1.3) Response (Self Healing)

- Perform isolation and self-healing tasks ie. reconfigured or disabled faulty components until it is repaired

(3) FMA Event Naming Schemes

When the event happens whether it is 'error' or 'fault' event, FMA has the way to call it in the standard way using Hierarchical Tree and/or FMRI format.

(3.1) Hierarchical trees
- FMA using 3 root schemes
ereport (events related to error events)
fault (events related to faults)
list (events related to lists)

each root scheme can have its subcategories for further specify error-generated or faulty-component.


Subcategories for ereport and fault :
asic – ASICs circuit events
cpu – cpu & memory subsystem events
io – I/O device events


Subcategory for list (only one)
list.suspect (lists all fault suspects for a set of related events (a case))

Events can be passed among various “FMA” components, for examples
ereport.cpu.ultraSPARC-III.CE
fault.asic.schizo.*
list.suspect

(3.2) FMRI (Fault Management Resource Identifier)

- the URL like format to identify error and fault event
- FMRI is the name-value pair to identify each event

FMRI identifies resources that
- detected an error
- are affected by an error
- has a change of state following fault diagnosis event may be reported in name-value pair format or hierarchical text string (derived from FMRI) , ie. fmri-scheme://[authority]/path

FMRI format description (fmri-scheme://[authority]/path)
fmri-scheme – specifies format of the string and owner of the resource repository
authority – provides a mean by which resource may be identified
path – specifies the resource

FMRI Examples


hc:///product-id=SunFire15000,chassis-id=123456789,domain-id=A/chassis/
system-board=0/cpu-module=2/cpu=8
diag-engine://product-id=E450,chassis-id=123456/de=eversholt/:version=1.0
dev:///devices/pci@8,700000/scsi@6/sd@6,0:g/:devid=1234

References


(1) BigAdmin Predictive Self healing : http://www.sun.com/bigadmin/content/selfheal/

(2) FMA Search http://onesearch.sun.com/search/onesearch/index.jsp?col=community-bigadmin&qt=FMA

(3) OpenSolaris.org Fault Management : http://www.opensolaris.org/os/community/fm/

==================END====OF====PART====1======================================

Part 2 will discuss about CLI commands for FMA.....bye for now, Peerapong.K@Sun.COM.

No comments:

Post a Comment