Lenovo HW PRO Pack: Component Health

IBM.HWPRO.ComponentHealth.RegularCheckup (UnitMonitor)

Überprüfung des Zustands der Komponenten.

Knowledge Base article:

Summary

This monitor regularly does health checkup's for all the components, and reports critical and warning health problems from SCVMM PRO's perspective. More details on this event are available through the Lenovo Hardware Management Pack in SCOM. NOTE: If you dismiss this PRO Tip, you will need to manually clear the monitor state of the affected machine in the Lenovo HW PRO MP in SCOM. If you implement this PRO Tip, the machine that generated this event will be placed into Maintenance Mode in SCVMM and any VMs on it will be migrated. You will need to manually remove it from Maintenance Mode once the problem is resolved.

Configuration

You can enable or disable this monitor or configure it to run with a different monitoring interval by changing the "override-controlled" parameters of this monitor. See the Operations Manager documentation about "Override" for more information.

Causes

When one or more hardware component has health problem(s), an alert is raised and the health state is set to indicate an error, according to the severity of the implication to the virtual machines on the host. It is important to note that the health checkup report as in "State Change Events" is based on the severity level as seen from the hardware component perspective. This monitor does further filtering and report the severity from SCVMM PRO's perspective. For example, a critical error of a system cooling fan is not considered a critical error from SCVMM PRO's perspective, and therefore reported as a warning.

There could be a variety of causes that could cause a compenent health problem. Exact details of the problem are available in the "State Change Events" tab in Operations Manager's Health Explorer.

State View: SCVMM-Managed Hosts on Lenovo Servers

Resolutions

For critical PRO errors, all virtual machines on this host should be migrated immediately and the host put into maintenance mode, until all the problems are resolved. Although for warnings of PRO errors that can be done a bit later, but that should still be taken care of at the earliest possible moment. Once all the errors are cleared, the alert will be automatically closed and the health state will be reset automatically.

Element properties:

TargetIBM.HWPRO.VMHost.DirAgent.5.x
Parent MonitorIBM.HWPRO.IBMPRORollupMonitor
CategoryCustom
EnabledFalse
Alert GenerateTrue
Alert SeverityMatchMonitorHealth
Alert PriorityNormal
Alert Auto ResolveTrue
Monitor TypeIBM.HWPRO.ComponentHealth.RegularCheckup.MonitorType
RemotableTrue
AccessibilityInternal
Alert Message
Lenovo HW PRO Pack Alert: Component Health
Überprüfung des Zustands der Komponenten. {0}
RunAsDefault

Source Code:

<UnitMonitor Accessibility="Internal" ConfirmDelivery="false" Enabled="true" ID="IBM.HWPRO.ComponentHealth.RegularCheckup" ParentMonitorID="IBM.HWPRO.IBMPRORollupMonitor" Priority="Normal" Remotable="true" Target="IBM.HWPRO.VMHost.DirAgent.5.x" TypeID="IBM.HWPRO.ComponentHealth.RegularCheckup.MonitorType">
<!-- This is an auto-reset monitor, as defined by RegularCheckup.MonitorType. -->
<Category>Custom</Category>
<AlertSettings AlertMessage="IBM.HWPRO.ComponentHealth.RegularCheckup.AlertMessageResourceID">
<AlertOnState>Warning</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>MatchMonitorHealth</AlertSeverity>
<AlertParameters>
<AlertParameter1>$Data/Context/Property[@Name="[List of Critical Errors]"]$</AlertParameter1>
<AlertParameter2>$Data/Context/Property[@Name="[List of Warnings]"]$</AlertParameter2>
</AlertParameters>
</AlertSettings>
<OperationalStates>
<OperationalState HealthState="Error" ID="Error" MonitorTypeStateID="State.Error"/>
<OperationalState HealthState="Warning" ID="Warning" MonitorTypeStateID="State.Warning"/>
<OperationalState HealthState="Success" ID="Success" MonitorTypeStateID="State.Healthy"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>7200</IntervalSeconds>
<!-- default to be 7200 secs = 2 hrs -->
<TimeoutSeconds>300</TimeoutSeconds>
</Configuration>
</UnitMonitor>