Failed Job Proportion

Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs (UnitMonitor)

Failed job proportion performance monitor for HPC 2008 R2 Job Scheduler

Knowledge Base article:

Summary

This monitor tracks the percentage of failed jobs out of the total number of finished jobs. Finished jobs are those jobs that are at the Finished, Canceled, or Failed state. A large percentage of failed jobs may indicate that the health of the HPC Job Scheduler Service is in a warning or critical level.

The health levels are defined as below:

Healthy – The number of failed jobs is less than or equal to 20% of total number of finished jobs.
Warning – The number of failed jobs is greater than 20% and less than or equal to 70% of the total number of finished jobs.
Critical – The number of failed jobs is greater than 70% of the total number of finished jobs.

Causes

Failed jobs can be caused by any of the following:

Application failures. These failures are indicated by applications that return a non-zero exit code, and could have a variety of causes.
Node failures.
Network failures.
Storage failures (often caused by network failures).
Submission errors, such as bad file or directory names for tasks.

Resolutions

To troubleshoot and fix this problem:

Check for the reason of the job failure. The reason of the job failure can be determined by using HPC Cluster Manager.
If the job failed because of a failure of one or more tasks in the job, check for the reasons of the failure in the output for the failed tasks within the job.
If the job failure is because of a node failure, check that your nodes are online and that you have network connectivity to your nodes.
Check the health state of nodes that the failed jobs ran on. Click the State view in the Compute Node folder and check the nodes that these jobs ran on.

Element properties:

Target

Microsoft.HPC.2008R2.JobScheduler

Parent Monitor

System.Health.PerformanceState

Category

PerformanceHealth

Enabled

True

Alert Generate

True

Alert Severity

Error

Alert Priority

Normal

Alert Auto Resolve

True

Monitor Type

Microsoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.ThreeThresholdStates

Remotable

True

Accessibility

Public

Alert Message

Failed Jobs Proportion has exceeded the upper threshold

Please see the alert context for details.

RunAs

Microsoft.HPC.RunAsProfile.AdminActionAccount

Source Code:

<UnitMonitor ID="Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008R2.JobScheduler" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" RunAs="HPCLibrary!Microsoft.HPC.RunAsProfile.AdminActionAccount" TypeID="Microsoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.ThreeThresholdStates" ConfirmDelivery="true">

  <Category>PerformanceHealth</Category>

  <AlertSettings AlertMessage="Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs_AlertMessageResourceID">

    <AlertOnState>Error</AlertOnState>

    <AutoResolve>true</AutoResolve>

    <AlertPriority>Normal</AlertPriority>

    <AlertSeverity>Error</AlertSeverity>

  </AlertSettings>

  <OperationalStates>

    <OperationalState ID="UIGeneratedOpStateId89d87fa084344a598acc097cad7de57d" MonitorTypeStateID="Low" HealthState="Success"/>

    <OperationalState ID="UIGeneratedOpStateId14505226e15e4d51b7645a6f990dbc3c" MonitorTypeStateID="Medium" HealthState="Warning"/>

    <OperationalState ID="UIGeneratedOpStateIdab667860b3a8412993c122c76f0d7627" MonitorTypeStateID="High" HealthState="Error"/>

  </OperationalStates>

  <Configuration>

    <ScriptName>GetFailedJobs</ScriptName>

    <ScriptBody><Script>



param ($clusterName)



Add-PSSnapin Microsoft.HPC



$api = New-Object -ComObject "MOM.ScriptAPI"

$bag = $api.CreatePropertyBag()



$Parameters = "-Name HPCSchedulerJobs -Counter 'Number of failed jobs','Number of finished jobs','Number of canceled jobs' "

if ($clusterName -ne "")

{

	$Parameters = $Parameters + "-Scheduler " + $clusterName + " "

}



$results = Invoke-Expression "Get-HpcMetricValue $Parameters"

$failed = 0

$total = 0



foreach ($value in $results) 

{  

    if ($value.Counter -eq 'Number of failed jobs')

	{ 

		$failed = $value.Value

	}

    

	$total += $value.Value

}



$percent = [double]0

if ($total -ne 0)

{

	$percent = $failed * 100 / $total

}



if (-not [double]::IsNaN($percent))

{

    $bag.AddValue("Value", [double]$percent)

}

$api.Return($bag)



  </Script></ScriptBody>

    <Parameters>'$Target/Host/Property[Type="Microsoft.HPC.2008R2.ActiveHeadNode"]/ClusterName$'</Parameters>

    <LowThreshold>20</LowThreshold>

    <HighThreshold>70</HighThreshold>

    <TimeoutSeconds>300</TimeoutSeconds>

    <IntervalSeconds>900</IntervalSeconds>

  </Configuration>

</UnitMonitor>