Failed Job Proportion

Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs (UnitMonitor)

Failed job proportion performance monitor for HPC 2008 R2 Job Scheduler

Knowledge Base article:

Summary

This monitor tracks the percentage of failed jobs out of the total number of finished jobs. Finished jobs are those jobs that are at the Finished, Canceled, or Failed state. A large percentage of failed jobs may indicate that the health of the HPC Job Scheduler Service is in a warning or critical level.

The health levels are defined as below:

Causes

Failed jobs can be caused by any of the following:

Resolutions

To troubleshoot and fix this problem:

Element properties:

TargetMicrosoft.HPC.2008R2.JobScheduler
Parent MonitorSystem.Health.PerformanceState
CategoryPerformanceHealth
EnabledTrue
Alert GenerateTrue
Alert SeverityError
Alert PriorityNormal
Alert Auto ResolveTrue
Monitor TypeMicrosoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.ThreeThresholdStates
RemotableTrue
AccessibilityPublic
Alert Message
Failed Jobs Proportion has exceeded the upper threshold
Please see the alert context for details.
RunAsMicrosoft.HPC.RunAsProfile.AdminActionAccount

Source Code:

<UnitMonitor ID="Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008R2.JobScheduler" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" RunAs="HPCLibrary!Microsoft.HPC.RunAsProfile.AdminActionAccount" TypeID="Microsoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.ThreeThresholdStates" ConfirmDelivery="true">
<Category>PerformanceHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="UIGeneratedOpStateId89d87fa084344a598acc097cad7de57d" MonitorTypeStateID="Low" HealthState="Success"/>
<OperationalState ID="UIGeneratedOpStateId14505226e15e4d51b7645a6f990dbc3c" MonitorTypeStateID="Medium" HealthState="Warning"/>
<OperationalState ID="UIGeneratedOpStateIdab667860b3a8412993c122c76f0d7627" MonitorTypeStateID="High" HealthState="Error"/>
</OperationalStates>
<Configuration>
<ScriptName>GetFailedJobs</ScriptName>
<ScriptBody><Script>

param ($clusterName)

Add-PSSnapin Microsoft.HPC

$api = New-Object -ComObject "MOM.ScriptAPI"
$bag = $api.CreatePropertyBag()

$Parameters = "-Name HPCSchedulerJobs -Counter 'Number of failed jobs','Number of finished jobs','Number of canceled jobs' "
if ($clusterName -ne "")
{
$Parameters = $Parameters + "-Scheduler " + $clusterName + " "
}

$results = Invoke-Expression "Get-HpcMetricValue $Parameters"
$failed = 0
$total = 0

foreach ($value in $results)
{
if ($value.Counter -eq 'Number of failed jobs')
{
$failed = $value.Value
}

$total += $value.Value
}

$percent = [double]0
if ($total -ne 0)
{
$percent = $failed * 100 / $total
}

if (-not [double]::IsNaN($percent))
{
$bag.AddValue("Value", [double]$percent)
}
$api.Return($bag)

</Script></ScriptBody>
<Parameters>'$Target/Host/Property[Type="Microsoft.HPC.2008R2.ActiveHeadNode"]/ClusterName$'</Parameters>
<LowThreshold>20</LowThreshold>
<HighThreshold>70</HighThreshold>
<TimeoutSeconds>300</TimeoutSeconds>
<IntervalSeconds>900</IntervalSeconds>
</Configuration>
</UnitMonitor>