Invalid NodeManagers - Microsoft.HDInsight.UnitMonitor.YarnInvalidNodeManagers (UnitMonitor)

Microsoft.HDInsight.UnitMonitor.YarnInvalidNodeManagers (UnitMonitor)

Monitors existence of lost and unhealthy NodeManagers.

Knowledge Base article:

Summary

This monitor checks existence of lost and unhealthy NodeManagers and it will alert if any of these categories is not empty.

Causes

NodeManagers can fail executing application tasks due to various reasons including improper Yarn configuration as well as various infrastructure issues.

Resolutions

HDInsight Appliance

To diagnose the issue:

Connect remotely to the virtual machines that own failed NodeManagers:
- Use Remote Desktop Connection to login into secure node of the HDInsight cluster.
- Use another Remote Desktop Connection from the secure node to connect to the virtual machine that owns the failed nodemanager service.
Review NodeManager logs by remotely connecting virtual machine that owns the failed component. Log files are located at <OS disk>:\hadoop\hadoop-<HDP version>\logs.

Connecting remotely to the virtual machine that owns the failed NodeManager is a two-step operation:

Use Remote Desktop Connection to login into secure node of the HDInsight cluster.
Use another Remote Desktop Connection from the secure node to connect to the target virtual machine.

To resolve the issue:

Based on findings in diagnose step, fix all problems that caused NodeManager to fail and start it again using "Start HDInsight Host Component" action available on the task pane. This may include fixing of defects in Yarn job code (such as improper runtime error handling).
If procedure from above doesn’t solve the issue, please contact Microsoft Support team and provide them with alert name and details. Be aware that diagnostic action may require administrator permissions on HDInsight cluster.

HDInsight Azure

To resolve the issue:

Remotely connect to worker node virtual machines and inspect NodeManagers logs. Log files are located at <OS disk="">\apps\dist\hadoop-<HDP version>\logs. Connecting to a worker node is two-step operation:
- Use Remote Desktop Connection to login into HDInsight cluster on Azure.
- Use another Remote Desktop Connection from HDInsight cluster to connect to the virtual machine that owns the failed NodeManagers service (worker node virtual machines are named like workernode0, workernode1 and so on).
If issue persists, please contact Microsoft Support team and provide them with the alert name and details.

Element properties:

Target

Microsoft.HDInsight.ClusterService.Yarn

Parent Monitor

System.Health.PerformanceState

Category

PerformanceHealth

Enabled

True

Alert Generate

True

Alert Severity

MatchMonitorHealth

Alert Priority

Normal

Alert Auto Resolve

True

Monitor Type

Microsoft.HDInsight.UnitMonitorType.YarnInvalidNodeManagers

Remotable

True

Accessibility

Public

Alert Message

There are NodeManager nodes which are in the invalid state.

There are {1} lost and {2} unhealthy NodeManagers in cluster "{0}".

RunAs

Default

Source Code:

<UnitMonitor ID="Microsoft.HDInsight.UnitMonitor.YarnInvalidNodeManagers" TypeID="Microsoft.HDInsight.UnitMonitorType.YarnInvalidNodeManagers" Target="Microsoft.HDInsight.ClusterService.Yarn" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" Accessibility="Public" Enabled="true" ConfirmDelivery="true"> <Category>PerformanceHealth</Category> <AlertSettings AlertMessage="Microsoft.HDInsight.UnitMonitor.YarnInvalidNodeManagers.AlertMessage"> <AlertOnState>Warning</AlertOnState> <AutoResolve>true</AutoResolve> <AlertPriority>Normal</AlertPriority> <AlertSeverity>MatchMonitorHealth</AlertSeverity> <AlertParameters> <AlertParameter1>$Target/Host/Property[Type="Microsoft.HDInsight.ClusterService.Private"]/ClusterName$</AlertParameter1> <AlertParameter2>$Data/Context/Property[@Name='yarn.numlostnms']$</AlertParameter2> <AlertParameter3>$Data/Context/Property[@Name='yarn.numunhealthynms']$</AlertParameter3> </AlertParameters> </AlertSettings> <OperationalStates> <OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success"/> <OperationalState ID="Warning" MonitorTypeStateID="Warning" HealthState="Warning"/> <OperationalState ID="Critical" MonitorTypeStateID="Critical" HealthState="Error"/> </OperationalStates> <Configuration> <IntervalSeconds>900</IntervalSeconds> <TimeoutSeconds>300</TimeoutSeconds> <WarningLostNMCount>1</WarningLostNMCount> <CriticalUnhealthyNMCount>1</CriticalUnhealthyNMCount> </Configuration> </UnitMonitor>