Performance Threshold: Compute Cluster - Number of unreachable nodes

Performance_Threshold__Compute_Cluster___Number_of_unreachable_nodes_1_Rule (Rule)

Knowledge Base article:

Management Pack
Summary

This rule monitors the number of compute nodes that have a status of Unreachable.

By default, the head node sends out one heartbeat each minute to compute nodes in the cluster. If a compute node does not respond to a heartbeat signal for three consecutive minutes, that node is considered unreachable.

 
Causes

A node can be unreachable for the following reasons:

  • The Microsoft Compute Cluster Node Manager Service is not running on the compute node.
  • Name resolution (DNS) has failed.
  • The node is disconnected from the network.
  • The node has been powered off.
 
Resolutions

To resolve this problem:

  1. From the MOM console, click the Compute Cluster 2003 view, then click the Compute Node column to find the unreachable node. Check the status of the Node Manager service on that node.
  2. If the service is down then restart Microsoft Compute Cluster Node Manager Service on that compute node. If the service cannot be started then uninstall and reinstall Compute Cluster Pack on this node.
  3. Determine whether name resolution is failing by using Remote Desktop to connect to the node using its IP address. If a remote desktop connection is successfully established, then this indicates a problem with name resolution. Check the DNS server.
  4. If name resolution is not the problem, ping the target node.
  5. Check the physical network connection between the head node and the compute node.
  6. Verify that the node is not powered down and shut off.
 
© 2006 Microsoft Corporation, all rights reserved.

Element properties:

TargetMicrosoft.Windows.Server.ComputeCluster.2003.Microsoft_Windows_Compute_Cluster_Server_2003_Head_Nodes_Installation
CategoryPerformanceHealth
EnabledTrue
Instance NameCompute Cluster
Counter NameNumber of unreachable nodes
Frequency900
Alert GenerateTrue
Alert SeverityWarning
Alert PriorityLow
RemotableTrue
Alert Message
Performance Threshold: Compute Cluster - Number of unreachable nodes

$Data/ObjectName$
:
$Data/CounterName$
:
$Data/InstanceName$
value =
$Data/Value$
CommentMom2005ID='{7BCF8646-1BBF-4B91-93E9-AFADE9C8E558}';MOM2005ComputerGroupID={3326B24C-3FCA-4A20-A242-10F6836AF625}

Member Modules:

ID Module Type TypeId RunAs 
_FEDC0239_8729_44F5_BF9B_F002680B5EC0_ DataSource System.Mom.BackwardCompatibility.Performance.FilteredDataProvider Default
SimpleThresholdFilter ConditionDetection System.Performance.SimpleThresholdCondition Default
GenerateAlert WriteAction System.Mom.BackwardCompatibility.AlertResponse Default

Source Code:

<Rule ID="Performance_Threshold__Compute_Cluster___Number_of_unreachable_nodes_1_Rule" Target="Microsoft.Windows.Server.ComputeCluster.2003.Microsoft_Windows_Compute_Cluster_Server_2003_Head_Nodes_Installation" Enabled="true" ConfirmDelivery="false" Comment="Mom2005ID='{7BCF8646-1BBF-4B91-93E9-AFADE9C8E558}';MOM2005ComputerGroupID={3326B24C-3FCA-4A20-A242-10F6836AF625}">
<Category>PerformanceHealth</Category>
<DataSources>
<DataSource ID="_FEDC0239_8729_44F5_BF9B_F002680B5EC0_" Comment="{FEDC0239-8729-44F5-BF9B-F002680B5EC0}" TypeID="MomBackwardCompatibility!System.Mom.BackwardCompatibility.Performance.FilteredDataProvider">
<ComputerName>$Target/Host/Property[Type="WindowsLibrary!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
<CounterName>Number of unreachable nodes</CounterName>
<ObjectName>Compute Cluster</ObjectName>
<Frequency>900</Frequency>
<Expression/>
</DataSource>
</DataSources>
<ConditionDetection ID="SimpleThresholdFilter" TypeID="PerformanceLibrary!System.Performance.SimpleThresholdCondition">
<Threshold>0.0</Threshold>
<Operator>Greater</Operator>
</ConditionDetection>
<WriteActions>
<WriteAction ID="GenerateAlert" TypeID="MomBackwardCompatibility!System.Mom.BackwardCompatibility.AlertResponse">
<AlertGeneration>
<GenerateAlert>true</GenerateAlert>
<Owner/>
<Description>
$Data/ObjectName$
:
$Data/CounterName$
:
$Data/InstanceName$
value =
$Data/Value$
</Description>
<AlertLevel>30</AlertLevel>
<ResolutionState/>
<Source>
$Data/ObjectName$
:
$Data/CounterName$
:
$Data/InstanceName$
</Source>
<Name>Performance Threshold: Compute Cluster - Number of unreachable nodes</Name>
</AlertGeneration>
<InvokerType>0</InvokerType>
</WriteAction>
</WriteActions>
</Rule>