Daily Job Queue Time

Summary

This monitor tracks the average job queue time. The wait time can be used as one of the indicators to show whether the cluster is congested. This monitor is disabled by default because job queue times can be very different across different organizations.

Causes

This error can be caused by any of the following:

There are some large jobs that require a lot of nodes to run and there are not enough nodes available to run them. This can cause average wait times to increase.
The cluster is busy. In HPC Cluster Manager, in Charts and Reports, review charts such as “Cluster CPU Usage” to determine if the cluster is exhibiting high CPU usage. Alternatively, in HPC Cluster Manager, in Node Management, you can add the “Running Jobs” metric to the heat map and determine if most nodes are occupied with jobs.
Job configurations are not optimized. Some job configurations such as those that give a job exclusive access to a node can slow down other jobs. Configurations that are better suited to the requirements of the application can help jobs process faster.

Resolutions

To troubleshoot and fix this problem:

If the cluster load is consistently high from “Charts and Reports”, we suggest adding more resources to the cluster like: more compute nodes, more CPU and memory on the nodes.
Make better job configurations to improve cluster efficiency, like: checking whether the exclusive access to nodes is necessary for jobs or not.

Target

Microsoft.HPC.2008.HeadNode.HPCPack.JobScheduler

Parent Monitor

System.Health.PerformanceState

Category

PerformanceHealth

Enabled

False

Instance Name

HPC Scheduler

Counter Name

Daily job queue time

Frequency

Alert Generate

True

Alert Severity

MatchMonitorHealth

Alert Priority

Normal

Alert Auto Resolve

True

Monitor Type

System.Performance.DoubleThreshold

Remotable

True

Accessibility

Public

Alert Message

Daily Job Queue Time has exceeded the upper threshold

RunAs

Default

Microsoft.HPC.2008.Monitor.JobScheduler.WaitTime (UnitMonitor)

Knowledge Base article:

Summary

Causes

Resolutions

Element properties:

Source Code: