Failed job proportion performance monitor for HPC 2008 R2 Job Scheduler
This monitor tracks the percentage of failed jobs out of the total number of finished jobs. Finished jobs are those jobs that are at the Finished, Canceled, or Failed state. A large percentage of failed jobs may indicate that the health of the HPC Job Scheduler Service is in a warning or critical level.
The health levels are defined as below:
Healthy – The number of failed jobs is less than or equal to 20% of total number of finished jobs.
Warning – The number of failed jobs is greater than 20% and less than or equal to 70% of the total number of finished jobs.
Critical – The number of failed jobs is greater than 70% of the total number of finished jobs.
Failed jobs can be caused by any of the following:
Application failures. These failures are indicated by applications that return a non-zero exit code, and could have a variety of causes.
Node failures.
Network failures.
Storage failures (often caused by network failures).
Submission errors, such as bad file or directory names for tasks.
To troubleshoot and fix this problem:
Check for the reason of the job failure. The reason of the job failure can be determined by using HPC Cluster Manager.
If the job failed because of a failure of one or more tasks in the job, check for the reasons of the failure in the output for the failed tasks within the job.
If the job failure is because of a node failure, check that your nodes are online and that you have network connectivity to your nodes.
Check the health state of nodes that the failed jobs ran on. Click the State view in the Compute Node folder and check the nodes that these jobs ran on.
Target | Microsoft.HPC.2008R2.JobScheduler | ||
Parent Monitor | System.Health.PerformanceState | ||
Category | PerformanceHealth | ||
Enabled | True | ||
Alert Generate | True | ||
Alert Severity | Error | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.ThreeThresholdStates | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Microsoft.HPC.RunAsProfile.AdminActionAccount |
<UnitMonitor ID="Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008R2.JobScheduler" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" RunAs="HPCLibrary!Microsoft.HPC.RunAsProfile.AdminActionAccount" TypeID="Microsoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.ThreeThresholdStates" ConfirmDelivery="true">
<Category>PerformanceHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008R2.Monitor.JobScheduler.Performance.FailedJobs_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="UIGeneratedOpStateId89d87fa084344a598acc097cad7de57d" MonitorTypeStateID="Low" HealthState="Success"/>
<OperationalState ID="UIGeneratedOpStateId14505226e15e4d51b7645a6f990dbc3c" MonitorTypeStateID="Medium" HealthState="Warning"/>
<OperationalState ID="UIGeneratedOpStateIdab667860b3a8412993c122c76f0d7627" MonitorTypeStateID="High" HealthState="Error"/>
</OperationalStates>
<Configuration>
<ScriptName>GetFailedJobs</ScriptName>
<ScriptBody>
param ($clusterName)
Add-PSSnapin Microsoft.HPC
$api = New-Object -ComObject "MOM.ScriptAPI"
$bag = $api.CreatePropertyBag()
$Parameters = "-Name HPCSchedulerJobs -Counter 'Number of failed jobs','Number of finished jobs','Number of canceled jobs' "
if ($clusterName -ne "")
{
$Parameters = $Parameters + "-Scheduler " + $clusterName + " "
}
$results = Invoke-Expression "Get-HpcMetricValue $Parameters"
$failed = 0
$total = 0
foreach ($value in $results)
{
if ($value.Counter -eq 'Number of failed jobs')
{
$failed = $value.Value
}
$total += $value.Value
}
$percent = [double]0
if ($total -ne 0)
{
$percent = $failed * 100 / $total
}
if (-not [double]::IsNaN($percent))
{
$bag.AddValue("Value", [double]$percent)
}
$api.Return($bag)
</ScriptBody>
<Parameters>'$Target/Host/Property[Type="Microsoft.HPC.2008R2.ActiveHeadNode"]/ClusterName$'</Parameters>
<LowThreshold>20</LowThreshold>
<HighThreshold>70</HighThreshold>
<TimeoutSeconds>300</TimeoutSeconds>
<IntervalSeconds>900</IntervalSeconds>
</Configuration>
</UnitMonitor>