Failed Jobs Proportion - Microsoft.HPC.2008.Monitor.JobScheduler.FailedJobs (UnitMonitor)

Microsoft.HPC.2008.Monitor.JobScheduler.FailedJobs (UnitMonitor)

Knowledge Base article:

Summary

This monitor tracks the percentage of failed jobs out of the total number of finished jobs. Finished jobs are those jobs that are at the Finished, Canceled, or Failed state. A large percentage of failed jobs may indicate that the health of the HPC Job Scheduler Service is in a warning or critical level.

The health levels are defined as below:

1. Healthy – The number of failed jobs is less than or equal to 20% of total number of finished jobs.

2. Warning – The number of failed jobs is greater than 20% and less than or equal to 70% of the total number of finished jobs.

3. Critical – The number of failed jobs is greater than 70% of the total number of finished jobs.

Causes

Failed jobs can be caused by any of the following:

Application failures. These failures are indicated by applications that return a non-zero exit code, and could have a variety of causes.
Node failures.
Network failures.
Storage failures (often caused by Network Failures).
Submission errors, such as bad file or directory names for tasks.

Resolutions

To troubleshoot and fix this problem:

Check for the reason of the job failure. The reason of the job failure can be determined by using HPC Cluster Manager.
If the job failed because of the failure of one or more tasks in the job, check for the reasons of the failure in the output for the failed tasks within the job.
If the job failure is because of a node failure, check that your nodes are online and that you have network connectivity to your nodes.
Check the health state of nodes that the failed jobs ran on. Click the State view in the Compute Node folder and check the nodes that these jobs ran on.

Element properties:

Target

Microsoft.HPC.2008.HeadNode.HPCPack.JobScheduler

Parent Monitor

System.Health.PerformanceState

Category

PerformanceHealth

Enabled

True

Alert Generate

True

Alert Severity

MatchMonitorHealth

Alert Priority

Normal

Alert Auto Resolve

True

Monitor Type

Microsoft.HPC.2008.MonitorType.JobScheduler.FailedJobs

Remotable

True

Accessibility

Public

Alert Message

Failed Jobs Proportion has exceeded the upper threashold

RunAs

Default

Source Code:

<UnitMonitor ID="Microsoft.HPC.2008.Monitor.JobScheduler.FailedJobs" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008.HeadNode.HPCPack.JobScheduler" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" TypeID="Microsoft.HPC.2008.MonitorType.JobScheduler.FailedJobs" ConfirmDelivery="false"> <Category>PerformanceHealth</Category> <AlertSettings AlertMessage="Microsoft.HPC.2008.Monitor.JobScheduler.FailedJobs_AlertMessageResourceID"> <AlertOnState>Error</AlertOnState> <AutoResolve>true</AutoResolve> <AlertPriority>Normal</AlertPriority> <AlertSeverity>MatchMonitorHealth</AlertSeverity> </AlertSettings> <OperationalStates> <OperationalState ID="UIGeneratedOpStateIdb4439e4e122e47f1be04555087dd71a1" MonitorTypeStateID="Success" HealthState="Success"/> <OperationalState ID="UIGeneratedOpStateId4f57332dacda44bba2042159b663e23d" MonitorTypeStateID="Warning" HealthState="Warning"/> <OperationalState ID="UIGeneratedOpStateIdd90e2329131347df9ff3c7d412516d9e" MonitorTypeStateID="Error" HealthState="Error"/> </OperationalStates> <Configuration> <IntervalSeconds>300</IntervalSeconds> <SyncTime/> <TimeoutSeconds>300</TimeoutSeconds> <ClusterName>$Target/Host/Property[Type="Microsoft.HPC.2008.HeadNode.HPCPack"]/ClusterName$</ClusterName> </Configuration> </UnitMonitor>