Monitors existence of lost and unhealthy NodeManagers.
This monitor checks existence of lost and unhealthy NodeManagers and it will alert if any of these categories is not empty.
NodeManagers can fail executing application tasks due to various reasons including improper Yarn configuration as well as various infrastructure issues.
HDInsight Appliance
To diagnose the issue:
Connect remotely to the virtual machines that own failed NodeManagers:
Use Remote Desktop Connection to login into secure node of the HDInsight cluster.
Use another Remote Desktop Connection from the secure node to connect to the virtual machine that owns the failed nodemanager service.
Review NodeManager logs by remotely connecting virtual machine that owns the failed component. Log files are located at <OS disk>:\hadoop\hadoop-<HDP version>\logs.
Connecting remotely to the virtual machine that owns the failed NodeManager is a two-step operation:
Use Remote Desktop Connection to login into secure node of the HDInsight cluster.
Use another Remote Desktop Connection from the secure node to connect to the target virtual machine.
To resolve the issue:
Based on findings in diagnose step, fix all problems that caused NodeManager to fail and start it again using "Start HDInsight Host Component" action available on the task pane. This may include fixing of defects in Yarn job code (such as improper runtime error handling).
If procedure from above doesn’t solve the issue, please contact Microsoft Support team and provide them with alert name and details. Be aware that diagnostic action may require administrator permissions on HDInsight cluster.
HDInsight Azure
To resolve the issue:
Remotely connect to worker node virtual machines and inspect NodeManagers logs. Log files are located at <OS disk="">\apps\dist\hadoop-<HDP version>\logs. Connecting to a worker node is two-step operation:
Use Remote Desktop Connection to login into HDInsight cluster on Azure.
Use another Remote Desktop Connection from HDInsight cluster to connect to the virtual machine that owns the failed NodeManagers service (worker node virtual machines are named like workernode0, workernode1 and so on).
If issue persists, please contact Microsoft Support team and provide them with the alert name and details.
Target | Microsoft.HDInsight.ClusterService.Yarn | ||
Parent Monitor | System.Health.PerformanceState | ||
Category | PerformanceHealth | ||
Enabled | True | ||
Alert Generate | True | ||
Alert Severity | MatchMonitorHealth | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.HDInsight.UnitMonitorType.YarnInvalidNodeManagers | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Default |
<UnitMonitor ID="Microsoft.HDInsight.UnitMonitor.YarnInvalidNodeManagers" TypeID="Microsoft.HDInsight.UnitMonitorType.YarnInvalidNodeManagers" Target="Microsoft.HDInsight.ClusterService.Yarn" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" Accessibility="Public" Enabled="true" ConfirmDelivery="true">
<Category>PerformanceHealth</Category>
<AlertSettings AlertMessage="Microsoft.HDInsight.UnitMonitor.YarnInvalidNodeManagers.AlertMessage">
<AlertOnState>Warning</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>MatchMonitorHealth</AlertSeverity>
<AlertParameters>
<AlertParameter1>$Target/Host/Property[Type="Microsoft.HDInsight.ClusterService.Private"]/ClusterName$</AlertParameter1>
<AlertParameter2>$Data/Context/Property[@Name='yarn.numlostnms']$</AlertParameter2>
<AlertParameter3>$Data/Context/Property[@Name='yarn.numunhealthynms']$</AlertParameter3>
</AlertParameters>
</AlertSettings>
<OperationalStates>
<OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success"/>
<OperationalState ID="Warning" MonitorTypeStateID="Warning" HealthState="Warning"/>
<OperationalState ID="Critical" MonitorTypeStateID="Critical" HealthState="Error"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>900</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
<WarningLostNMCount>1</WarningLostNMCount>
<CriticalUnhealthyNMCount>1</CriticalUnhealthyNMCount>
</Configuration>
</UnitMonitor>