Dell OM : Hardware Controller is in critical state

Dell.ManagedServer.Alert.2329 (Rule)

Knowledge Base article:

Summary

Hardware Controller critical state alert

Causes

Hardware Controller has generated critical alert. Probable causes and corresponding resolutions for this condition are:

Cause

Resolutions

The <name> is absent.

Re-install or reconnect the hardware.

The storage adapter is absent.

Install storage adapter.

The backplane is absent.

If removal was unintended, check presence, then re-install or reconnect.

The USB cable is absent.

If removal was unintended, check presence, then re-install or reconnect.

The mezzanine card <number> is absent.

If removal was unintended, check presence, then re-install or reconnect.

The <name> cable or interconnect is not connected or is improperly connected.

Check presence, then re-install or reconnect.

The storage <name> cable is not connected, or is improperly connected.

Check presence, then re-install or reconnect.

The system board <name> cable or interconnect is not connected, or is improperly connected.

Check presence, then re-install or reconnect.

The <name> is not installed correctly.

Check presence, then re-install or reconnect.

A fabric mismatch detected for mezzanine card <number>.

Check chassis fabric type in CMC GUI and compare to the type of IOM or mezzanine card.

The riser board cable or interconnect is not connected, or is improperly connected.

Check presence, then re-install or reconnect.

The <name> is removed.

If removal was unintended, check presence, then re-install or reconnect

A hardware incompatibility detected between BMC/iDRAC firmware and CPU.

Update the BMC/iDRAC firmware. If the problem persists, contact support.

A hardware incompatibility detected between BMC/iDRAC firmware and other hardware.

Update the BMC/iDRAC firmware. If the problem persists, contact support.

Hardware unsuccessfully updated for mezzanine card <number>.

Check presence, re-install or reconnect, then re-attempt the update. If the problem persists, contact support.

Link Tuning error detected.

Update the CMC firmware. If the problem persists, contact support.

Hardware incompatibility detected with mezzanine card <number>.

  • Review system documentation for supported mezzanine cards.

  • Replace mezzanine card with a supported mezzanine card.

A fabric mismatch detected on fabric <name> with server in slot <number>.

Check chassis fabric type in CMC GUI and compare to the type of IOM or mezzanine card.

A hardware misconfiguration detected on <name>.

Make sure the hardware is installed correctly. Refer to the product documentation for correct configuration and installation procedures.

Server <number> is removed.

If removal was unintended, check presence, then re-insert

IO module <number> is removed.

If removal was unintended, check presence, then re-insert

A hardware incompatibility is detected between <first component name><first component location> and <second component name><second component location>.

Do the following: 1) Review system documentation for the hardware components identified in the message. 2) Replace unsupported components with supported components.

Unable to control the fan speed because a sled mismatch or hardware incompatibility is detected.

Remove the sled in which a hardware incompatibility is detected and replace with a compatible working sled. For more information about hardware compatibility, see the Platform Owner's Manual available on the support site.

A failure is detected on <name>.

Contact technical support. Refer to your product documentation to choose a convenient contact method.

An over-temperature event detected on I/O module <number>.

Do the following: 1) Make sure fans are installed and working correctly. 2) Check the temperature sensor status for the chassis and make sure it is within the chassis operating temperature range. 3) Reseat the I/O module to clear the over-temperature condition. If the problem persists, contact the service provider.

A failure is detected on IO module <number>.

Contact technical support. Refer to your product documentation to choose a convenient contact method.

I/O module <number> failed to boot.

Reseat I/O module. If problem persists, contact the service provider.

The <name> controller is stuck in boot mode.

Remove and reapply input power. If problem persists, contact technical support. Refer to the product documentation to choose a convenient contact method.

Cannot communicate with <name> controller.

Remove and reapply input power. If problem persists, contact technical support. Refer to the product documentation to choose a convenient contact method.

Server <number> health changed to a critical state from either a normal or warning state.

Review System Log or front panel for additional information.

Server <number> health changed to a non-recoverable state from a less severe state.

Review System Log or front panel for additional information.

Server <number> health changed to a critical state from a non-recoverable state.

Review System Log or front panel for additional information.

Server <number> health changed to a non-recoverable state.

Review System Log or front panel for additional information.

Unable to complete the operation because of an issue with the I/O panel cable.

Do one of the following and retry the operation: 1) Connect the I/O panel cable properly 2) Replace the I/O panel cable.

The Chassis Management Controller (CMC) cannot communicate with the control panel.

Remove and reapply chassis input power. If problem persists, contact your service provider.

Unable to synchronize control panel firmware due to internal error.

Remove and reapply chassis input power. If problem persists, contact your service provider.

One or more PCIe switch heatsinks are not properly attached.

Immediately turn off the Chassis by pressing the Chassis power button or removing chassis input power. The Chassis can be turned off remotely by running the following RACADM command "racadm chassisaction -m chassis nongraceshutdown". After turning off, contact your service provider.

The sled <sled name> is removed from slot <slot number>.

No response action is required.

The <name> is removed from slot <number>.

No response action is required.

Resolutions

Additional information on this issue may be available. Launch the iDRAC Console to debug further.

Element properties:

TargetDell.ManagedServer
CategoryAlert
EnabledTrue
Event_ID2329
Event SourceLifeCycle Controller Log
Alert GenerateTrue
Alert SeverityError
Alert PriorityNormal
RemotableTrue
Alert Message
Dell OM : Hardware Controller is in critical state
Event Description: {0}
Event LogSystem

Member Modules:

ID Module Type TypeId RunAs 
DS DataSource Microsoft.Windows.EventProvider Default
Alert WriteAction System.Health.GenerateAlert Default
WriteToDW WriteAction Microsoft.SystemCenter.DataWarehouse.PublishEventData Default

Source Code:

<Rule ID="Dell.ManagedServer.Alert.2329" Enabled="true" Target="DellManagedServer!Dell.ManagedServer" ConfirmDelivery="false" Remotable="true" Priority="Normal" DiscardLevel="100">
<Category>Alert</Category>
<DataSources>
<DataSource ID="DS" TypeID="Windows!Microsoft.Windows.EventProvider">
<ComputerName>$Target/Property[Type="DellManagedServer!Dell.ManagedServer"]/HostName$</ComputerName>
<LogName>System</LogName>
<Expression>
<And>
<Expression>
<SimpleExpression>
<ValueExpression>
<XPathQuery Type="UnsignedInteger">EventDisplayNumber</XPathQuery>
</ValueExpression>
<Operator>Equal</Operator>
<ValueExpression>
<Value Type="UnsignedInteger">2329</Value>
</ValueExpression>
</SimpleExpression>
</Expression>
<Expression>
<SimpleExpression>
<ValueExpression>
<XPathQuery Type="String">PublisherName</XPathQuery>
</ValueExpression>
<Operator>Equal</Operator>
<ValueExpression>
<Value Type="String">LifeCycle Controller Log</Value>
</ValueExpression>
</SimpleExpression>
</Expression>
</And>
</Expression>
</DataSource>
</DataSources>
<WriteActions>
<WriteAction ID="Alert" TypeID="Health!System.Health.GenerateAlert">
<Priority>1</Priority>
<Severity>2</Severity>
<AlertMessageId>$MPElement[Name="Dell.ManagedServer.Alert.2329.Rule"]$</AlertMessageId>
<AlertParameters>
<AlertParameter1>$Data/EventDescription$</AlertParameter1>
</AlertParameters>
<Suppression>
<SuppressionValue>$Data/EventDisplayNumber$</SuppressionValue>
<SuppressionValue>$Data/Channel$</SuppressionValue>
<SuppressionValue>$Data/PublisherName$</SuppressionValue>
<SuppressionValue>$Data/LoggingComputer$</SuppressionValue>
<SuppressionValue>$Data/EventCategory$</SuppressionValue>
<SuppressionValue>$Data/EventLevel$</SuppressionValue>
<SuppressionValue>$Data/UserName$</SuppressionValue>
<SuppressionValue>$Data/EventNumber$</SuppressionValue>
<SuppressionValue>$Data/EventDescription$</SuppressionValue>
</Suppression>
<Custom1/>
<Custom2/>
<Custom3/>
<Custom4/>
<Custom5/>
<Custom6/>
<Custom7/>
<Custom8/>
<Custom9/>
<Custom10/>
</WriteAction>
<WriteAction ID="WriteToDW" TypeID="SCDW!Microsoft.SystemCenter.DataWarehouse.PublishEventData"/>
</WriteActions>
</Rule>