Cluster Network Usage

Microsoft.HPC.2008R2.Monitor.HeadNode.Performance.ClusterNetworkUsage (UnitMonitor)

Cluster Network Usage performance monitor for HPC 2008 R2 cluster

Knowledge Base article:

Summary

This monitor tracks the aggregated network usage of the whole cluster. The value is calculated by adding the network byte/sec throughput values of all nodes.

Configuration

This monitor is disabled by default. You can enable it and configure the network usage threshold for Healthy, Warning and Critical states. (For information about how to override the threshold values, see the management pack guide.)

Causes

Sustained high network usage is usually caused by jobs that send an excessive number of messages between nodes. If network throughput usage is close to the maximum network capacity, then jobs, especially MPI jobs, may take longer to complete.

Resolutions

Use job policies to manage the load. You can also use a network that has greater amount of throughput to increase the capacity of the cluster.

Element properties:

TargetMicrosoft.HPC.2008R2.ActiveHeadNode
Parent MonitorSystem.Health.PerformanceState
CategoryPerformanceHealth
EnabledFalse
Alert GenerateTrue
Alert SeverityError
Alert PriorityNormal
Alert Auto ResolveTrue
Monitor TypeMicrosoft.HPC.2008R2.MonitorType.PSPerfCounterMonitor.ThreeStates
RemotableTrue
AccessibilityPublic
Alert Message
Cluster Network Usage has exceeded the upper threshold
Please see the alert context for details.
RunAsMicrosoft.HPC.RunAsProfile.AdminActionAccount

Source Code:

<UnitMonitor ID="Microsoft.HPC.2008R2.Monitor.HeadNode.Performance.ClusterNetworkUsage" Accessibility="Public" Enabled="false" Target="Microsoft.HPC.2008R2.ActiveHeadNode" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" RunAs="HPCLibrary!Microsoft.HPC.RunAsProfile.AdminActionAccount" TypeID="Microsoft.HPC.2008R2.MonitorType.PSPerfCounterMonitor.ThreeStates" ConfirmDelivery="true">
<Category>PerformanceHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008R2.Monitor.HeadNode.Performance.ClusterNetworkUsage_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="UIGeneratedOpStateId9d63df9cb26445559c2b1481cb25b88b" MonitorTypeStateID="Low" HealthState="Success"/>
<OperationalState ID="UIGeneratedOpStateIdac9c64dbc0d4489e81f65c05188285f6" MonitorTypeStateID="Medium" HealthState="Warning"/>
<OperationalState ID="UIGeneratedOpStateIdcf10be749c164e8eb8278a4387a1a439" MonitorTypeStateID="High" HealthState="Error"/>
</OperationalStates>
<Configuration>
<LowThreshold>10000000000000000</LowThreshold>
<HighThreshold>20000000000000000</HighThreshold>
<TimeoutSeconds>300</TimeoutSeconds>
<IntervalSeconds>300</IntervalSeconds>
<ClusterName>$Target/Property[Type="Microsoft.HPC.2008R2.ActiveHeadNode"]/ClusterName$</ClusterName>
<MetricName>HPCClusterNetwork</MetricName>
</Configuration>
</UnitMonitor>