Performance Threshold: Compute Cluster - Number of queued jobs

Performance_Threshold__Compute_Cluster___Number_of_queued_jobs_1_Rule (Rule)

Knowledge Base article:

Management Pack
Summary

This rule generates an alert when the number of queued jobs in the cluster surpasses the threshold value. These are jobs have not yet been sent to cluster nodes for processing.


The default threshold value is 15. You can configure the threshold value for this rule on the Threshold tab of the rule properties.

 
Causes
Queued jobs are waiting for cluster resources in order to run. The most probable cause for an excessive number of queued jobs is that insufficient resources (such as number of nodes or processors) are available to run these jobs. Numerous queued jobs can also result if the CCS job scheduler is down.
 
Resolutions

To resolve this problem, verify the health of the cluster by doing the following:

  1. Check the Compute Cluster 2003 view in the State view and then click on the Head Node column to make sure the Scheduler service on the head node is running.
  2. Check the Alert view to make sure there is no error raised regarding unreachable nodes.

Make sure you use the best scheduling practices recommended in http://technet2.microsoft.com/WindowsServer/en/library/899d300f-80cd-4b49-83ef-05e2b418a6371033.mspx?mfr=true

 
© 2006 Microsoft Corporation, all rights reserved.

Element properties:

TargetMicrosoft.Windows.Server.ComputeCluster.2003.Microsoft_Windows_Compute_Cluster_Server_2003_Head_Nodes_Installation
CategoryPerformanceHealth
EnabledTrue
Instance NameCompute Cluster
Counter NameNumber of queued jobs
Frequency900
Alert GenerateTrue
Alert SeverityWarning
Alert PriorityLow
RemotableTrue
Alert Message
Performance Threshold: Compute Cluster - Number of queued jobs

$Data/ObjectName$
:
$Data/CounterName$
:
$Data/InstanceName$
value =
$Data/Value$
CommentMom2005ID='{69E5B257-0359-44B9-8D8B-EBA84FC9EA69}';MOM2005ComputerGroupID={3326B24C-3FCA-4A20-A242-10F6836AF625}

Member Modules:

ID Module Type TypeId RunAs 
_AE86C424_F7F8_4091_93D1_B36E4C5EB9FD_ DataSource System.Mom.BackwardCompatibility.Performance.FilteredDataProvider Default
SimpleThresholdFilter ConditionDetection System.Performance.SimpleThresholdCondition Default
GenerateAlert WriteAction System.Mom.BackwardCompatibility.AlertResponse Default

Source Code:

<Rule ID="Performance_Threshold__Compute_Cluster___Number_of_queued_jobs_1_Rule" Target="Microsoft.Windows.Server.ComputeCluster.2003.Microsoft_Windows_Compute_Cluster_Server_2003_Head_Nodes_Installation" Enabled="true" ConfirmDelivery="false" Comment="Mom2005ID='{69E5B257-0359-44B9-8D8B-EBA84FC9EA69}';MOM2005ComputerGroupID={3326B24C-3FCA-4A20-A242-10F6836AF625}">
<Category>PerformanceHealth</Category>
<DataSources>
<DataSource ID="_AE86C424_F7F8_4091_93D1_B36E4C5EB9FD_" Comment="{AE86C424-F7F8-4091-93D1-B36E4C5EB9FD}" TypeID="MomBackwardCompatibility!System.Mom.BackwardCompatibility.Performance.FilteredDataProvider">
<ComputerName>$Target/Host/Property[Type="WindowsLibrary!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
<CounterName>Number of queued jobs</CounterName>
<ObjectName>Compute Cluster</ObjectName>
<Frequency>900</Frequency>
<Expression/>
</DataSource>
</DataSources>
<ConditionDetection ID="SimpleThresholdFilter" TypeID="PerformanceLibrary!System.Performance.SimpleThresholdCondition">
<Threshold>15</Threshold>
<Operator>Greater</Operator>
</ConditionDetection>
<WriteActions>
<WriteAction ID="GenerateAlert" TypeID="MomBackwardCompatibility!System.Mom.BackwardCompatibility.AlertResponse">
<AlertGeneration>
<GenerateAlert>true</GenerateAlert>
<Owner/>
<Description>
$Data/ObjectName$
:
$Data/CounterName$
:
$Data/InstanceName$
value =
$Data/Value$
</Description>
<AlertLevel>30</AlertLevel>
<ResolutionState/>
<Source>
$Data/ObjectName$
:
$Data/CounterName$
:
$Data/InstanceName$
</Source>
<Name>Performance Threshold: Compute Cluster - Number of queued jobs</Name>
</AlertGeneration>
<InvokerType>0</InvokerType>
</WriteAction>
</WriteActions>
</Rule>