This monitor tracks the status of the HPC Node Manager Service. When this service is stopped, no jobs will be able to run on this node.
This error can be caused by any of the following:
The HPC Node Manager Service encountered an error and had to stop running.
The HPC Node Manager Service is disabled.
Group policy does not allow this service to start.
To troubleshoot and fix this problem:
Restart the service on the target compute node.
If the service cannot be started, resolve the errors that are reported by the Service Control Manager. The Service Control Manager will produce an error event if the service is terminated unexpectedly. Start the Event Viewer on the target compute node and check for any system events from the Service Control Manager or application events from the HPC Node Manager Service. Resolve any errors that are reported by these events.
If the service still cannot be restarted, contact the domain administrator to make sure that this service is not disabled by the domain group policy.
If the preceding steps do not resolve the problem, uninstall and reinstall the HPC Pack on the compute node.
Recovery task will be run automatically to restart the service, so you may find the service keeps restarting while you are trying to stop it. There are couple of options to avoid it happen:
Disable the recovery task;
Change the service to start manually.
Target | Microsoft.HPC.2008.ComputeNode | ||
Parent Monitor | System.Health.AvailabilityState | ||
Category | AvailabilityHealth | ||
Enabled | True | ||
Alert Generate | True | ||
Alert Severity | Error | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.Windows.CheckNTServiceStateMonitorType | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Default |
<UnitMonitor ID="Microsoft.HPC.2008.Monitor.ComputeNode.NodeManager" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008.ComputeNode" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" TypeID="Windows!Microsoft.Windows.CheckNTServiceStateMonitorType" ConfirmDelivery="false">
<Category>AvailabilityHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008.Monitor.ComputeNode.NodeManager_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="Running" MonitorTypeStateID="Running" HealthState="Success"/>
<OperationalState ID="NotRunning" MonitorTypeStateID="NotRunning" HealthState="Error"/>
</OperationalStates>
<Configuration>
<ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/NetworkName$</ComputerName>
<ServiceName>HpcNodeManager</ServiceName>
<CheckStartupType/>
</Configuration>
</UnitMonitor>