Monitors the health state of DataNode process.
This monitor checks the health state of DataNode process on the host virtual machine. If DataNode on particular host is down, that has negative impact on data availability and overall system performance. DataNodes which are down augment number of under-replicated blocks in the system. If critical number of DataNodes is down it can result in some data blocks being completely unavailable to rest of the system and to end users.
HDInsight Appliance
Monitor is active and reports actual component state.
HDInsight Azure
This monitor is not available in HDInsight clusters on Azure, so diagnostic and resolution steps below do not apply to this type of environment.
DataNode service can go offline due to various reasons:
Maintenance action is in progress, performed by HDInsight cluster administrator. Please consider switching your cluster to maintenance mode to avoid alerting in case of regular maintenance procedures.
NameNode is down when DataNode gets restarted. In this case, DataNode will fail at startup ending up in the stopped state.
There are problems with physical host machine or virtual machine that own the failed DataNode.
Issues with physical or virtual disk hosting worker node OS.
Issues with InfiniBand network adapters.
If DataNode is not stopped on purpose, use the following steps to diagnose the issue:
Check if NameNode is running. If NameNode is not running, go to corresponding monitor and follow the KB instructions which will help you to resolve that issue.
Review DataNode logs by remotely connecting virtual machine that owns the failed component. Log files are located at <OS disk>:\hadoop\hadoop-<HDP version>\logs.
Connecting remotely to the virtual machine that owns the failed DataNode is a two-step operation:
Use Remote Desktop Connection to login into secure node of the HDInsight cluster.
Use another Remote Desktop Connection from the secure node to connect to the target virtual machine.
To resolve the issue:
Based on findings in diagnose step, fix all problems that caused DataNode to fail and start it again using Start HDInsight Host Component action available on the Tasks pane.
If procedure from above doesn’t solve the issue, please contact Microsoft Support team and provide them with alert name and details (this monitor doesn't have its own alert, but it will trigger alert for "DataNodes Down" HDFS monitor when significant number of nodes are down).
Target | Microsoft.HDInsight.HostComponent.DataNode |
Parent Monitor | System.Health.AvailabilityState |
Category | AvailabilityHealth |
Enabled | True |
Alert Generate | False |
Alert Auto Resolve | True |
Monitor Type | Microsoft.HDInsight.UnitMonitorType.HostComponentHealthState |
Remotable | True |
Accessibility | Public |
RunAs | Default |
<UnitMonitor ID="Microsoft.HDInsight.UnitMonitor.DataNodeComponentHealthState" TypeID="Microsoft.HDInsight.UnitMonitorType.HostComponentHealthState" Target="Microsoft.HDInsight.HostComponent.DataNode" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" Accessibility="Public" Enabled="true" ConfirmDelivery="true">
<Category>AvailabilityHealth</Category>
<OperationalStates>
<OperationalState ID="Healthy" MonitorTypeStateID="Healthy" HealthState="Success"/>
<OperationalState ID="Unhealthy" MonitorTypeStateID="Unhealthy" HealthState="Warning"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>900</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
</Configuration>
</UnitMonitor>