NodeManager Component State

Microsoft.HDInsight.UnitMonitor.NodeManagerComponentHealthState (UnitMonitor)

Monitors the health state of NodeManager process.

Knowledge Base article:

Summary

This monitor checks the health state of NodeManager process on the host. If NodeManager on particular host is down that reduces parallelism and negatively impacts overall system performance. If critical number of NodeManagers is down it will result either in unacceptable processing times or inability to process the application.

HDInsight Appliance

Monitor is active and reports actual component state.

HDInsight Azure

This monitor is not available in HDInsight clusters on Azure, so diagnostic and resolution steps below do not apply to this type of environment.

Causes

NodeManager service can go offline due to various reasons:

Resolutions

If NodeManager is not stopped on purpose, use the following steps to diagnose the issue:

Connecting remotely to the virtual machine that owns the failed NodeManager is a two-step operation:

To resolve the issue:

Element properties:

TargetMicrosoft.HDInsight.HostComponent.NodeManager
Parent MonitorSystem.Health.AvailabilityState
CategoryAvailabilityHealth
EnabledTrue
Alert GenerateFalse
Alert Auto ResolveTrue
Monitor TypeMicrosoft.HDInsight.UnitMonitorType.HostComponentHealthState
RemotableTrue
AccessibilityPublic
RunAsDefault

Source Code:

<UnitMonitor ID="Microsoft.HDInsight.UnitMonitor.NodeManagerComponentHealthState" TypeID="Microsoft.HDInsight.UnitMonitorType.HostComponentHealthState" Target="Microsoft.HDInsight.HostComponent.NodeManager" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" Accessibility="Public" Enabled="true" ConfirmDelivery="true">
<Category>AvailabilityHealth</Category>
<OperationalStates>
<OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success"/>
<OperationalState ID="Unhealthy" MonitorTypeStateID="Unhealthy" HealthState="Warning"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>900</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
</Configuration>
</UnitMonitor>