Monitors the health state of NodeManager process.
This monitor checks the health state of NodeManager process on the host. If NodeManager on particular host is down that reduces parallelism and negatively impacts overall system performance. If critical number of NodeManagers is down it will result either in unacceptable processing times or inability to process the application.
HDInsight Appliance
Monitor is active and reports actual component state.
HDInsight Azure
This monitor is not available in HDInsight clusters on Azure, so diagnostic and resolution steps below do not apply to this type of environment.
NodeManager service can go offline due to various reasons:
Maintenance action is in progress, performed by HDInsight cluster administrator. Please consider switching your cluster to maintenance mode to avoid alerting in case of regular maintenance procedures.
ResourceManager is down for a longer period of time.
There are problems with physical host machine or virtual machine that own the failed NodeManager node.
If NodeManager is not stopped on purpose, use the following steps to diagnose the issue:
Check if ResourceManager is running. If ResourceManager is not running, go to corresponding monitor and follow the KB instructions which will help you to resolve that issue.
Review NodeManager logs by remotely connecting to virtual machine that owns the failed component. Log files are located at <OS disk>:\hadoop\hadoop-<HDP version>\logs.
Connecting remotely to the virtual machine that owns the failed NodeManager is a two-step operation:
Use Remote Desktop Connection to login into secure node of the HDInsight cluster.
Use another Remote Desktop Connection from the secure node to connect to the target virtual machine.
To resolve the issue:
Based on findings in diagnose step, fix all problems that caused NodeManager to fail and start it again using Start HDInsight Host Component action available on the Tasks pane.
If procedure from above doesn’t solve the issue, please contact Microsoft Support team and provide them with alert name and details (this monitor doesn't have its own alert, but it will trigger alert for "NodeManagers down" Yarn monitor when significant number of NodeManagers are down). Be aware that diagnostic action may require administrator permissions on HDInsight cluster.
Target | Microsoft.HDInsight.HostComponent.NodeManager |
Parent Monitor | System.Health.AvailabilityState |
Category | AvailabilityHealth |
Enabled | True |
Alert Generate | False |
Alert Auto Resolve | True |
Monitor Type | Microsoft.HDInsight.UnitMonitorType.HostComponentHealthState |
Remotable | True |
Accessibility | Public |
RunAs | Default |
<UnitMonitor ID="Microsoft.HDInsight.UnitMonitor.NodeManagerComponentHealthState" TypeID="Microsoft.HDInsight.UnitMonitorType.HostComponentHealthState" Target="Microsoft.HDInsight.HostComponent.NodeManager" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" Accessibility="Public" Enabled="true" ConfirmDelivery="true">
<Category>AvailabilityHealth</Category>
<OperationalStates>
<OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success"/>
<OperationalState ID="Unhealthy" MonitorTypeStateID="Unhealthy" HealthState="Warning"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>900</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
</Configuration>
</UnitMonitor>