Job scheduler service availability monitor for HPC 2008 R2
This monitor tracks the status of the HPC Job Scheduler Service. When this service is stopped, no new jobs can be submitted, no jobs in the queue will begin running, and no new jobs or tasks will be started. The tasks that are already running will complete.
In a cluster configured for high availability of the head node, the HPC Job Scheduler Service is not configured to start automatically on both of the head nodes, and the “Alert only if service startup type is automatic” option is set to True by default in the management pack. To monitor the HPC Job Scheduler Service on a failover cluster by using the management pack, you must manually change the “Alert only if service startup type is automatic” option to False on the current active head node. The monitoring tools in Failover Cluster Manager can also be used to monitor the service.
This error can be caused by any of the following:
The HPC Job Scheduler Service encountered an error and had to stop running.
The HPC Job Scheduler Service is disabled.
Group policy does not allow this service to start.
The HPC Job Scheduler Service depends on the Remote Procedure Call (RPC) Service. The HPC Job Scheduler Service will not be able to run if the RPC Service is stopped or disabled.
To troubleshoot and fix this problem:
Restart the service on the target head node. (Use a link to the Start Job Scheduler Service task) In a cluster configured for high availability of the head node, the service runs only on the active head node. Use Failover Cluster Manager to restart the service in a high availability environment.
If the service cannot be started, resolve the errors that are reported by the Service Control Manager. The Service Control Manager will produce an error event if the service is terminated unexpectedly. Start the Event Viewer on the target head node and check for any system events from the Service Control Manager or application events from the HPC Job Scheduler Service. Resolve any errors that are reported by these events.
If the service still cannot be restarted, contact the domain administrator to make sure that this service is not disabled by the domain group policy.
If the preceding steps do not resolve the problem, uninstall and reinstall the HPC Pack on the head node.
A recovery task will be run automatically to restart the service, so you may find the service keeps restarting while you are trying to stop it. There are a couple of options to avoid this:
Disable the recovery task
Change the service to start manually.
For more information about high availability head nodes, see “Configuring Windows HPC Server 2008 R2 for High Availability of the Head Node” ( http://go.microsoft.com/fwlink/?LinkId=198285).
Target | Microsoft.HPC.2008R2.JobScheduler | ||
Parent Monitor | System.Health.AvailabilityState | ||
Category | AvailabilityHealth | ||
Enabled | True | ||
Alert Generate | True | ||
Alert Severity | Error | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.Windows.CheckNTServiceStateMonitorType | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Microsoft.HPC.RunAsProfile.AdminActionAccount |
<UnitMonitor ID="Microsoft.HPC.2008R2.Monitor.JobScheduler.Availability.JobSchedulerService" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008R2.JobScheduler" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" RunAs="HPCLibrary!Microsoft.HPC.RunAsProfile.AdminActionAccount" TypeID="Windows!Microsoft.Windows.CheckNTServiceStateMonitorType" ConfirmDelivery="false">
<Category>AvailabilityHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008R2.Monitor.JobScheduler.Availability.JobSchedulerService_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="Running" MonitorTypeStateID="Running" HealthState="Success"/>
<OperationalState ID="NotRunning" MonitorTypeStateID="NotRunning" HealthState="Error"/>
</OperationalStates>
<Configuration>
<ComputerName>$Target/Host/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
<ServiceName>HpcScheduler</ServiceName>
<CheckStartupType/>
</Configuration>
</UnitMonitor>