This monitor checks the NetworkDirect registration with the Windows operating system. The monitor will enter the Critical state if:
•NetworkDirect is disabled on an RDMA-capable network when NetworkDirect capable drivers have been installed on the cluster.
•NetworkDirect is enabled on a non-RDMA-capable network.
NetworkDirect is a Remote Direct Memory Access (RDMA) networking interface that offers low latency and high throughput performance for Message Passing Interface (MPI) traffic, but the use of this interface requires RDMA-capable networking hardware and drivers. The usual configuration is to enable NetworkDirect on the application network for MPI traffic.
If NetworkDirect is disabled on an RDMA-capable network, the most likely cause is that the NetworkDirect driver (also called a “provider”) has not been registered with the Windows operating system in which the provider would be added to the Winsock Catalog.
If NetworkDirect is enabled on a non-RDMA-capable network, the network cards of a system were likely swapped out with non-RDMA-capable cards after the prior network interface card drivers were registered with the Winsock Catalog.
The registration of the NetworkDirect provider with the operating system is unique to each hardware vendor - some hardware vendors may include this step in their driver installer (.msi) and some may use a separate step. The Infiniband providers accomplish this step using a separate utility that can register (ndinstall –i), de-register (ndinstall –r), and list (ndinstall –l) the networking providers on a system. For detailed instructions about using NetworkDirect-enabled drivers, refer to the instructions provided by your hardware vendor.
InfiniBand device drivers with NetworkDirect support can be installed and registered at the same time that you deploy the compute nodes in your HPC cluster. This is accomplished by deploying the compute nodes using a node template that has been specially configured for this purpose. For more information, see http://go.microsoft.com/fwlink/?LinkId=130612.
Target | Microsoft.HPC.2008.HeadNode.Network | ||
Parent Monitor | System.Health.ConfigurationState | ||
Category | ConfigurationHealth | ||
Enabled | True | ||
Alert Generate | True | ||
Alert Severity | Error | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.HPC.2008.MonitorType.HeadNode.NetworkDirect | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Default |
<UnitMonitor ID="Microsoft.HPC.2008.Monitors.HeadNode.Network.NetworkDirect" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008.HeadNode.Network" ParentMonitorID="Health!System.Health.ConfigurationState" Remotable="true" Priority="Normal" TypeID="Microsoft.HPC.2008.MonitorType.HeadNode.NetworkDirect" ConfirmDelivery="false">
<Category>ConfigurationHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008.Monitors.HeadNode.Network.NetworkDirect_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="Success" MonitorTypeStateID="Success" HealthState="Success"/>
<OperationalState ID="Error" MonitorTypeStateID="Error" HealthState="Error"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>300</IntervalSeconds>
<SyncTime/>
<TimeoutSeconds>300</TimeoutSeconds>
<ClusterName>$Target/Host/Property[Type="Microsoft.HPC.2008.HeadNode"]/ClusterName$</ClusterName>
<NetType>$Target/Property[Type="Microsoft.HPC.2008.HeadNode.Network"]/NetworkType$</NetType>
</Configuration>
</UnitMonitor>