Monitors percentage of dead NodeManagers in the Cluster.
This monitor checks the percentage of dead NodeManagers in the cluster. NodeManagers can be offline due to various reasons: network failures, virtual or physical host machine issues as well as NodeManagers component malfunction. Although single NodeManagers failure can happen from time to time, failing NodeManagers in groups indicates more serious issues in the cluster infrastructure.
Default monitor thresholds:
Warning: when percent of live NodeManagers is less than 100% but higher or equal to 75%.
Error: when percent of live NodeManagers is less than 75%.
NodeManagers service can go offline due to various reasons:
Maintenance action is in progress, performed by HDInsight cluster administrator. Please consider switching your cluster to maintenance mode to avoid alerting in case of regular maintenance procedures.
ResourceManager is down for a longer period of time.
There are problems with physical host machine or virtual machine that own the failed node manager.
HDInsight Appliance
To diagnose the issue:
Check if ResourceManager is running. If ResourceManager is not running, go to corresponding monitor and follow the KB instructions which will help you to resolve that issue.
Use Cluster diagram to identify cluster node(s) where NodeManagers are not running and review their logs by remotely connecting virtual machine that owns the failed component. Log files are located at <OS disk>:\hadoop\hadoop-<HDP version>\logs.
Connecting remotely to the virtual machine that owns the failed NodeManager is a two-step operation:
Use Remote Desktop Connection to login into secure node of the HDInsight cluster.
Use another Remote Desktop Connection from the secure node to connect to the target virtual machine.
If you are unable to diagnose the issue, contact Microsoft Support team and provide them with alert details. Be aware that diagnostic action may require administrator permissions on HDInsight cluster.
To resolve the issue:
Based on findings in diagnose step, fix all problems that caused NodeManagers to fail and start them again using "Start HDInsight Host Component" action available on the Tasks pane.
If procedure from above doesn’t solve the issue, please contact Microsoft Support team and provide them with the alert name and details.
HDInsight Azure
To resolve the issue:
Remotely connect to the cluster and check if ResourceManager is running. Start this service if it is stopped (run "View local services" and check for "Apache Hadoop resourcemanager" service).
Remotely connect to worker node virtual machines and check if NodeManagers service is running. Start this service if it is stopped (run "View local services" and check for "Apache Hadoop nodemanager" service). Connecting to worker nodes is two-step operation:
Use Remote Desktop Connection to login into HDInsight cluster on Azure.
Use another Remote Desktop Connection from HDInsight cluster to connect to the virtual machine that owns the failed nodemanager service (worker node virtual machines are named like workernode0, workernode1 and so on).
If procedure from above doesn’t help, please contact Microsoft Support team and provide them with the alert name and details.
Target | Microsoft.HDInsight.ClusterService.Yarn | ||
Parent Monitor | System.Health.AvailabilityState | ||
Category | AvailabilityHealth | ||
Enabled | True | ||
Alert Generate | True | ||
Alert Severity | MatchMonitorHealth | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.HDInsight.UnitMonitorType.YarnServiceThreeStateConsecutiveThreshold | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Default |
<UnitMonitor ID="Microsoft.HDInsight.UnitMonitor.YarnDeadNodeManagers" TypeID="Microsoft.HDInsight.UnitMonitorType.YarnServiceThreeStateConsecutiveThreshold" Target="Microsoft.HDInsight.ClusterService.Yarn" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" Accessibility="Public" Enabled="true" ConfirmDelivery="true">
<Category>AvailabilityHealth</Category>
<AlertSettings AlertMessage="Microsoft.HDInsight.UnitMonitor.YarnDeadNodeManagers.AlertMessage">
<AlertOnState>Warning</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>MatchMonitorHealth</AlertSeverity>
<AlertParameters>
<AlertParameter1>$Target/Host/Property[Type="Microsoft.HDInsight.ClusterService.Private"]/ClusterName$</AlertParameter1>
<AlertParameter2>$Data/Context/SampleValue$</AlertParameter2>
</AlertParameters>
</AlertSettings>
<OperationalStates>
<OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success"/>
<OperationalState ID="Warning" MonitorTypeStateID="Warning" HealthState="Warning"/>
<OperationalState ID="Critical" MonitorTypeStateID="Critical" HealthState="Error"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>900</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
<PropertyName>deadnodemanagerspercent</PropertyName>
<WarningThreshold>0</WarningThreshold>
<CriticalThreshold>25</CriticalThreshold>
<NumSamples>1</NumSamples>
</Configuration>
</UnitMonitor>