Invalid NodeManagers

Microsoft.HDInsight.UnitMonitor.YarnInvalidNodeManagers (UnitMonitor)

Monitors existence of lost and unhealthy NodeManagers.

Knowledge Base article:

Summary

This monitor checks existence of lost and unhealthy NodeManagers and it will alert if any of these categories is not empty.

Causes

NodeManagers can fail executing application tasks due to various reasons including improper Yarn configuration as well as various infrastructure issues.

Resolutions

HDInsight Appliance

To diagnose the issue:

Connecting remotely to the virtual machine that owns the failed NodeManager is a two-step operation:

To resolve the issue:

HDInsight Azure

To resolve the issue:

Element properties:

TargetMicrosoft.HDInsight.ClusterService.Yarn
Parent MonitorSystem.Health.PerformanceState
CategoryPerformanceHealth
EnabledTrue
Alert GenerateTrue
Alert SeverityMatchMonitorHealth
Alert PriorityNormal
Alert Auto ResolveTrue
Monitor TypeMicrosoft.HDInsight.UnitMonitorType.YarnInvalidNodeManagers
RemotableTrue
AccessibilityPublic
Alert Message
There are NodeManager nodes which are in the invalid state.
There are {1} lost and {2} unhealthy NodeManagers in cluster "{0}".
RunAsDefault

Source Code:

<UnitMonitor ID="Microsoft.HDInsight.UnitMonitor.YarnInvalidNodeManagers" TypeID="Microsoft.HDInsight.UnitMonitorType.YarnInvalidNodeManagers" Target="Microsoft.HDInsight.ClusterService.Yarn" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" Accessibility="Public" Enabled="true" ConfirmDelivery="true">
<Category>PerformanceHealth</Category>
<AlertSettings AlertMessage="Microsoft.HDInsight.UnitMonitor.YarnInvalidNodeManagers.AlertMessage">
<AlertOnState>Warning</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>MatchMonitorHealth</AlertSeverity>
<AlertParameters>
<AlertParameter1>$Target/Host/Property[Type="Microsoft.HDInsight.ClusterService.Private"]/ClusterName$</AlertParameter1>
<AlertParameter2>$Data/Context/Property[@Name='yarn.numlostnms']$</AlertParameter2>
<AlertParameter3>$Data/Context/Property[@Name='yarn.numunhealthynms']$</AlertParameter3>
</AlertParameters>
</AlertSettings>
<OperationalStates>
<OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success"/>
<OperationalState ID="Warning" MonitorTypeStateID="Warning" HealthState="Warning"/>
<OperationalState ID="Critical" MonitorTypeStateID="Critical" HealthState="Error"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>900</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
<WarningLostNMCount>1</WarningLostNMCount>
<CriticalUnhealthyNMCount>1</CriticalUnhealthyNMCount>
</Configuration>
</UnitMonitor>