This monitor tracks the average job queue time. The wait time can be used as one of the indicators to show whether the cluster is congested. This monitor is disabled by default because job queue times can be very different across different organizations.
This error can be caused by any of the following:
There are some large jobs that require a lot of nodes to run and there are not enough nodes available to run them. This can cause average wait times to increase.
The cluster is busy. In HPC Cluster Manager, in Charts and Reports, review charts such as “Cluster CPU Usage” to determine if the cluster is exhibiting high CPU usage. Alternatively, in HPC Cluster Manager, in Node Management, you can add the “Running Jobs” metric to the heat map and determine if most nodes are occupied with jobs.
Job configurations are not optimized. Some job configurations such as those that give a job exclusive access to a node can slow down other jobs. Configurations that are better suited to the requirements of the application can help jobs process faster.
To troubleshoot and fix this problem:
If the cluster load is consistently high from “Charts and Reports”, we suggest adding more resources to the cluster like: more compute nodes, more CPU and memory on the nodes.
Make better job configurations to improve cluster efficiency, like: checking whether the exclusive access to nodes is necessary for jobs or not.
Target | Microsoft.HPC.2008.HeadNode.HPCPack.JobScheduler | ||
Parent Monitor | System.Health.PerformanceState | ||
Category | PerformanceHealth | ||
Enabled | False | ||
Instance Name | HPC Scheduler | ||
Counter Name | Daily job queue time | ||
Frequency | 60 | ||
Alert Generate | True | ||
Alert Severity | MatchMonitorHealth | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | System.Performance.DoubleThreshold | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Default |
<UnitMonitor ID="Microsoft.HPC.2008.Monitor.JobScheduler.WaitTime" Accessibility="Public" Enabled="false" Target="Microsoft.HPC.2008.HeadNode.HPCPack.JobScheduler" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" TypeID="SystemPerf!System.Performance.DoubleThreshold" ConfirmDelivery="false">
<Category>PerformanceHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008.Monitor.JobScheduler.WaitTime_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>MatchMonitorHealth</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="UnderThreshold1" MonitorTypeStateID="UnderThreshold1" HealthState="Error"/>
<OperationalState ID="OverThreshold1UnderThreshold2" MonitorTypeStateID="OverThreshold1UnderThreshold2" HealthState="Warning"/>
<OperationalState ID="OverThreshold2" MonitorTypeStateID="OverThreshold2" HealthState="Success"/>
</OperationalStates>
<Configuration>
<ComputerName>$Target/Host/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
<CounterName>Daily job queue time</CounterName>
<ObjectName>HPC Scheduler</ObjectName>
<InstanceName/>
<AllInstances>false</AllInstances>
<Frequency>60</Frequency>
<Threshold1>-2</Threshold1>
<Threshold2>-1</Threshold2>
</Configuration>
</UnitMonitor>