Failed Job Proportion

Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs (UnitMonitor)

Failed job proportion performance monitor for HPC 2008 R2 Job Scheduler

Knowledge Base article:


This monitor tracks the percentage of failed jobs out of the total number of finished jobs. Finished jobs are those jobs that are at the Finished, Canceled, or Failed state. A large percentage of failed jobs may indicate that the health of the HPC Job Scheduler Service is in a warning or critical level.

The health levels are defined as below:


Failed jobs can be caused by any of the following:


To troubleshoot and fix this problem:

Element properties:

Parent MonitorSystem.Health.PerformanceState
Alert GenerateTrue
Alert SeverityError
Alert PriorityNormal
Alert Auto ResolveTrue
Monitor TypeMicrosoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.ThreeThresholdStates
Alert Message
Failed Jobs Proportion has exceeded the upper threshold
Please see the alert context for details.

Source Code:

<UnitMonitor ID="Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008R2.JobScheduler" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" RunAs="HPCLibrary!Microsoft.HPC.RunAsProfile.AdminActionAccount" TypeID="Microsoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.ThreeThresholdStates" ConfirmDelivery="true">
<AlertSettings AlertMessage="Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs_AlertMessageResourceID">
<OperationalState ID="UIGeneratedOpStateId89d87fa084344a598acc097cad7de57d" MonitorTypeStateID="Low" HealthState="Success"/>
<OperationalState ID="UIGeneratedOpStateId14505226e15e4d51b7645a6f990dbc3c" MonitorTypeStateID="Medium" HealthState="Warning"/>
<OperationalState ID="UIGeneratedOpStateIdab667860b3a8412993c122c76f0d7627" MonitorTypeStateID="High" HealthState="Error"/>

param ($clusterName)

Add-PSSnapin Microsoft.HPC

$api = New-Object -ComObject "MOM.ScriptAPI"
$bag = $api.CreatePropertyBag()

$Parameters = "-Name HPCSchedulerJobs -Counter 'Number of failed jobs','Number of finished jobs','Number of canceled jobs' "
if ($clusterName -ne "")
$Parameters = $Parameters + "-Scheduler " + $clusterName + " "

$results = Invoke-Expression "Get-HpcMetricValue $Parameters"
$failed = 0
$total = 0

foreach ($value in $results)
if ($value.Counter -eq 'Number of failed jobs')
$failed = $value.Value

$total += $value.Value

$percent = [double]0
if ($total -ne 0)
$percent = $failed * 100 / $total

if (-not [double]::IsNaN($percent))
$bag.AddValue("Value", [double]$percent)
