This monitor runs a diagnostic test of the HPC Job Scheduler Service once a day. The Job Submission Test submits a simple job to all nodes in the cluster. It tests the ability of the HPC Job Scheduler Service to accept and run a job on the cluster.
This monitor will enter the Critical state if the Job Submission Test fails on the cluster.
This error can be caused by a variety of reasons. The following is a list of some of the major reasons:
The HPC Job Scheduler Service is not running.
There are no available compute nodes in the cluster.
There are network issues on the cluster.
The Job Scheduler Service has an error.
To troubleshoot and fix this problem:
Check whether the status for HPC Job Scheduler Service monitor is healthy. If it is not healthy, follow the resolution steps for the HPC Job Scheduler Service health monitor.
In HPC Cluster Manager, in Diagnostics, click Test Results to view detailed error information about the diagnostics failure.
Target | Microsoft.HPC.2008.HeadNode.HPCPack.JobScheduler | ||
Parent Monitor | System.Health.AvailabilityState | ||
Category | AvailabilityHealth | ||
Enabled | True | ||
Alert Generate | True | ||
Alert Severity | Error | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.HPC.2008.MonitorType.RunDiagnosticResult | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Default |
<UnitMonitor ID="Microsoft.HPC.2008.Monitor.JobSubmissionTestResult" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008.HeadNode.HPCPack.JobScheduler" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="Microsoft.HPC.2008.MonitorType.RunDiagnosticResult" ConfirmDelivery="true">
<Category>AvailabilityHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008.Monitor.JobSubmissionTestResult_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="UIGeneratedOpStateIde51dafbb58294d4b8b35f49c51d37074" MonitorTypeStateID="Success" HealthState="Success"/>
<OperationalState ID="UIGeneratedOpStateIde60a0185d56c4731b2a0b71eb2957002" MonitorTypeStateID="Failed" HealthState="Error"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>86400</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
<NodeName>Null</NodeName>
<TestName>SimpleSchedulerTest</TestName>
</Configuration>
</UnitMonitor>