Monitor REC_HOST_BOARD_FAULT (105)

IBMStorageSubsystem.FailureID_0105_Monitor (UnitMonitor)

Monitor Description for (105)

Knowledge Base article:

Host Switch Card Problem

What Caused the Problem?

The host switch card in one of the controllers is not functioning properly. The Recovery Guru Details area provides specific information you will need as you follow the recovery steps.

 Caution: Electrostatic discharge can damage sensitive components. Always use proper antistatic protection when handling components. Touching components without using a proper ground may damage the equipment.

Important Notes

If it is determined in this procedure that the host switch card has failed, you must replace the controller canister containing the faulty switch card.

Recovery Steps

1

Select the View Event Log option to determine the initial cause of the problem. The host switch card either:

  • Failed its diagnostic tests during power-up (Event 2843),

  • Experienced a runtime failure because of a communication problem or a specific critical fault (Event 2844), or

  • Is running in a degraded mode or reporting a warning such as an over-temperature condition (Event 2845)

Note: Event 2844 indicates a problem that could be temporary or intermittent. Therefore, it could recover on its own. However, it is best to use the SAN Switch Management application described in step 2 to obtain more details.

2

To further diagnose and possibly fix the problem, start the separate SAN Switch Management application for the host switch card.

Note: Make sure that the SAN Switch Management application is connected to the IP address of the host switch card associated with the affected controller and storage subsystem.

  • If you were NOT able to fix the problem using the Switch Management application, go to step 3 to prepare the controller for replacement.

  • If you were able to fix the problem using the Switch Management application, then go to step 5.

3

Place the affected controller offline using the following steps. The affected controller is listed in the Details area.

a

Select the controller in the Subsystem Management Window.

b

Select Advanced >> Recovery >> Place Controller >> Offline.

c

Complete the instructions in the dialog, then select Yes.

4

Select Recheck to rerun the Recovery Guru. An Offline Controller problem should be reported in the Summary area. Follow that procedure to remove and replace the controller and then go to step 5.

5

Select Recheck to rerun the Recovery Guru. The failure should no longer appear in the Summary area. If the failure appears again, contact your technical support representative.

Element properties:

TargetIBMStorageSubsystem.StorageSubsystem
Parent MonitorIBMStorageSubsystem.StorageSubsystemAvailability
CategoryCustom
EnabledTrue
Alert GenerateTrue
Alert SeverityError
Alert PriorityNormal
Alert Auto ResolveTrue
Monitor TypeIBMStorageSubsystem.FailureUnitMonitorType
RemotableTrue
AccessibilityInternal
Alert Message
Alert: REC_HOST_BOARD_FAULT
Alert Value: {0}
RunAsDefault
CommentMachine generated entity

Source Code:

<UnitMonitor ID="IBMStorageSubsystem.FailureID_0105_Monitor" Accessibility="Internal" Enabled="true" Target="IBMStorageSubsystem.StorageSubsystem" ParentMonitorID="IBMStorageSubsystem.StorageSubsystemAvailability" Remotable="true" Priority="Normal" TypeID="IBMStorageSubsystem.FailureUnitMonitorType" ConfirmDelivery="true" Comment="Machine generated entity">
<Category>Custom</Category>
<AlertSettings AlertMessage="IBMStorageSubsystem.REC_HOST_BOARD_FAULT_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
<AlertParameters>
<AlertParameter1>$Data/Context/Property[@Name='FailureDescription']$</AlertParameter1>
</AlertParameters>
</AlertSettings>
<OperationalStates>
<OperationalState ID="IBMStorageSubsystem.StateId7B0996D051F17BCC99AFE9086E6849FE" MonitorTypeStateID="NoIssue" HealthState="Success"/>
<OperationalState ID="IBMStorageSubsystem.StateId869D21EA3B8C6594C0AB935585D97DBC" MonitorTypeStateID="IssueFound" HealthState="Error"/>
</OperationalStates>
<Configuration>
<FailureID>105</FailureID>
<IntervalSeconds>59</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
<Trace>0</Trace>
</Configuration>
</UnitMonitor>