Network Direct Configuration

Microsoft.HPC.2008.Monitors.HeadNode.Network.NetworkDirect (UnitMonitor)

Knowledge Base article:

Summary

This monitor checks the NetworkDirect registration with the Windows operating system. The monitor will enter the Critical state if:

•NetworkDirect is disabled on an RDMA-capable network when NetworkDirect capable drivers have been installed on the cluster.

•NetworkDirect is enabled on a non-RDMA-capable network.

NetworkDirect is a Remote Direct Memory Access (RDMA) networking interface that offers low latency and high throughput performance for Message Passing Interface (MPI) traffic, but the use of this interface requires RDMA-capable networking hardware and drivers. The usual configuration is to enable NetworkDirect on the application network for MPI traffic.

Causes

If NetworkDirect is disabled on an RDMA-capable network, the most likely cause is that the NetworkDirect driver (also called a “provider”) has not been registered with the Windows operating system in which the provider would be added to the Winsock Catalog.

If NetworkDirect is enabled on a non-RDMA-capable network, the network cards of a system were likely swapped out with non-RDMA-capable cards after the prior network interface card drivers were registered with the Winsock Catalog.

Resolutions

The registration of the NetworkDirect provider with the operating system is unique to each hardware vendor - some hardware vendors may include this step in their driver installer (.msi) and some may use a separate step. The Infiniband providers accomplish this step using a separate utility that can register (ndinstall –i), de-register (ndinstall –r), and list (ndinstall –l) the networking providers on a system. For detailed instructions about using NetworkDirect-enabled drivers, refer to the instructions provided by your hardware vendor.

External

InfiniBand device drivers with NetworkDirect support can be installed and registered at the same time that you deploy the compute nodes in your HPC cluster. This is accomplished by deploying the compute nodes using a node template that has been specially configured for this purpose. For more information, see http://go.microsoft.com/fwlink/?LinkId=130612.

Element properties:

TargetMicrosoft.HPC.2008.HeadNode.Network
Parent MonitorSystem.Health.ConfigurationState
CategoryConfigurationHealth
EnabledTrue
Alert GenerateTrue
Alert SeverityError
Alert PriorityNormal
Alert Auto ResolveTrue
Monitor TypeMicrosoft.HPC.2008.MonitorType.HeadNode.NetworkDirect
RemotableTrue
AccessibilityPublic
Alert Message
Network Direct Configuration is not correct
RunAsDefault

Source Code:

<UnitMonitor ID="Microsoft.HPC.2008.Monitors.HeadNode.Network.NetworkDirect" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008.HeadNode.Network" ParentMonitorID="Health!System.Health.ConfigurationState" Remotable="true" Priority="Normal" TypeID="Microsoft.HPC.2008.MonitorType.HeadNode.NetworkDirect" ConfirmDelivery="false">
<Category>ConfigurationHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008.Monitors.HeadNode.Network.NetworkDirect_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="Success" MonitorTypeStateID="Success" HealthState="Success"/>
<OperationalState ID="Error" MonitorTypeStateID="Error" HealthState="Error"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>300</IntervalSeconds>
<SyncTime/>
<TimeoutSeconds>300</TimeoutSeconds>
<ClusterName>$Target/Host/Property[Type="Microsoft.HPC.2008.HeadNode"]/ClusterName$</ClusterName>
<NetType>$Target/Property[Type="Microsoft.HPC.2008.HeadNode.Network"]/NetworkType$</NetType>
</Configuration>
</UnitMonitor>