NodeManagers Down

Microsoft.HDInsight.UnitMonitor.YarnDeadNodeManagers (UnitMonitor)

Monitors percentage of dead NodeManagers in the Cluster.

Knowledge Base article:

Summary

This monitor checks the percentage of dead NodeManagers in the cluster. NodeManagers can be offline due to various reasons: network failures, virtual or physical host machine issues as well as NodeManagers component malfunction. Although single NodeManagers failure can happen from time to time, failing NodeManagers in groups indicates more serious issues in the cluster infrastructure.

Default monitor thresholds:

Causes

NodeManagers service can go offline due to various reasons:

Resolutions

HDInsight Appliance

To diagnose the issue:

Connecting remotely to the virtual machine that owns the failed NodeManager is a two-step operation:

If you are unable to diagnose the issue, contact Microsoft Support team and provide them with alert details. Be aware that diagnostic action may require administrator permissions on HDInsight cluster.

To resolve the issue:

HDInsight Azure

To resolve the issue:

Element properties:

TargetMicrosoft.HDInsight.ClusterService.Yarn
Parent MonitorSystem.Health.AvailabilityState
CategoryAvailabilityHealth
EnabledTrue
Alert GenerateTrue
Alert SeverityMatchMonitorHealth
Alert PriorityNormal
Alert Auto ResolveTrue
Monitor TypeMicrosoft.HDInsight.UnitMonitorType.YarnServiceThreeStateConsecutiveThreshold
RemotableTrue
AccessibilityPublic
Alert Message
A significant number of NodeManagers are down in the cluster.
There is {1} \% of NodeManagers down in the cluster "{0}".
RunAsDefault

Source Code:

<UnitMonitor ID="Microsoft.HDInsight.UnitMonitor.YarnDeadNodeManagers" TypeID="Microsoft.HDInsight.UnitMonitorType.YarnServiceThreeStateConsecutiveThreshold" Target="Microsoft.HDInsight.ClusterService.Yarn" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" Accessibility="Public" Enabled="true" ConfirmDelivery="true">
<Category>AvailabilityHealth</Category>
<AlertSettings AlertMessage="Microsoft.HDInsight.UnitMonitor.YarnDeadNodeManagers.AlertMessage">
<AlertOnState>Warning</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>MatchMonitorHealth</AlertSeverity>
<AlertParameters>
<AlertParameter1>$Target/Host/Property[Type="Microsoft.HDInsight.ClusterService.Private"]/ClusterName$</AlertParameter1>
<AlertParameter2>$Data/Context/SampleValue$</AlertParameter2>
</AlertParameters>
</AlertSettings>
<OperationalStates>
<OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success"/>
<OperationalState ID="Warning" MonitorTypeStateID="Warning" HealthState="Warning"/>
<OperationalState ID="Critical" MonitorTypeStateID="Critical" HealthState="Error"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>900</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
<PropertyName>deadnodemanagerspercent</PropertyName>
<WarningThreshold>0</WarningThreshold>
<CriticalThreshold>25</CriticalThreshold>
<NumSamples>1</NumSamples>
</Configuration>
</UnitMonitor>