Failed Jobs Proportion

Microsoft.HPC.2008.Monitor.JobScheduler.FailedJobs (UnitMonitor)

Knowledge Base article:

Summary

This monitor tracks the percentage of failed jobs out of the total number of finished jobs. Finished jobs are those jobs that are at the Finished, Canceled, or Failed state. A large percentage of failed jobs may indicate that the health of the HPC Job Scheduler Service is in a warning or critical level.

The health levels are defined as below:

1. Healthy – The number of failed jobs is less than or equal to 20% of total number of finished jobs.

2. Warning – The number of failed jobs is greater than 20% and less than or equal to 70% of the total number of finished jobs.

3. Critical – The number of failed jobs is greater than 70% of the total number of finished jobs.

Causes

Failed jobs can be caused by any of the following:

Resolutions

To troubleshoot and fix this problem:

Element properties:

TargetMicrosoft.HPC.2008.HeadNode.HPCPack.JobScheduler
Parent MonitorSystem.Health.PerformanceState
CategoryPerformanceHealth
EnabledTrue
Alert GenerateTrue
Alert SeverityMatchMonitorHealth
Alert PriorityNormal
Alert Auto ResolveTrue
Monitor TypeMicrosoft.HPC.2008.MonitorType.JobScheduler.FailedJobs
RemotableTrue
AccessibilityPublic
Alert Message
Failed Jobs Proportion has exceeded the upper threashold
RunAsDefault

Source Code:

<UnitMonitor ID="Microsoft.HPC.2008.Monitor.JobScheduler.FailedJobs" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008.HeadNode.HPCPack.JobScheduler" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" TypeID="Microsoft.HPC.2008.MonitorType.JobScheduler.FailedJobs" ConfirmDelivery="false">
<Category>PerformanceHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008.Monitor.JobScheduler.FailedJobs_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>MatchMonitorHealth</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="UIGeneratedOpStateIdb4439e4e122e47f1be04555087dd71a1" MonitorTypeStateID="Success" HealthState="Success"/>
<OperationalState ID="UIGeneratedOpStateId4f57332dacda44bba2042159b663e23d" MonitorTypeStateID="Warning" HealthState="Warning"/>
<OperationalState ID="UIGeneratedOpStateIdd90e2329131347df9ff3c7d412516d9e" MonitorTypeStateID="Error" HealthState="Error"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>300</IntervalSeconds>
<SyncTime/>
<TimeoutSeconds>300</TimeoutSeconds>
<ClusterName>$Target/Host/Property[Type="Microsoft.HPC.2008.HeadNode.HPCPack"]/ClusterName$</ClusterName>
</Configuration>
</UnitMonitor>