Cluster Network Usage - Microsoft.HPC.2008R2.Monitor.HeadNode.Performance.ClusterNetworkUsage (UnitMonitor)

Microsoft.HPC.2008R2.Monitor.HeadNode.Performance.ClusterNetworkUsage (UnitMonitor)

Cluster Network Usage performance monitor for HPC 2008 R2 cluster

Knowledge Base article:

Summary

This monitor tracks the aggregated network usage of the whole cluster. The value is calculated by adding the network byte/sec throughput values of all nodes.

Configuration

This monitor is disabled by default. You can enable it and configure the network usage threshold for Healthy, Warning and Critical states. (For information about how to override the threshold values, see the management pack guide.)

Causes

Sustained high network usage is usually caused by jobs that send an excessive number of messages between nodes. If network throughput usage is close to the maximum network capacity, then jobs, especially MPI jobs, may take longer to complete.

Resolutions

Use job policies to manage the load. You can also use a network that has greater amount of throughput to increase the capacity of the cluster.

Element properties:

Target

Microsoft.HPC.2008R2.ActiveHeadNode

Parent Monitor

System.Health.PerformanceState

Category

PerformanceHealth

Enabled

False

Alert Generate

True

Alert Severity

Error

Alert Priority

Normal

Alert Auto Resolve

True

Monitor Type

Microsoft.HPC.2008R2.MonitorType.PSPerfCounterMonitor.ThreeStates

Remotable

True

Accessibility

Public

Alert Message

Cluster Network Usage has exceeded the upper threshold

Please see the alert context for details.

RunAs

Microsoft.HPC.RunAsProfile.AdminActionAccount

Source Code:

<UnitMonitor ID="Microsoft.HPC.2008R2.Monitor.HeadNode.Performance.ClusterNetworkUsage" Accessibility="Public" Enabled="false" Target="Microsoft.HPC.2008R2.ActiveHeadNode" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" RunAs="HPCLibrary!Microsoft.HPC.RunAsProfile.AdminActionAccount" TypeID="Microsoft.HPC.2008R2.MonitorType.PSPerfCounterMonitor.ThreeStates" ConfirmDelivery="true"> <Category>PerformanceHealth</Category> <AlertSettings AlertMessage="Microsoft.HPC.2008R2.Monitor.HeadNode.Performance.ClusterNetworkUsage_AlertMessageResourceID"> <AlertOnState>Error</AlertOnState> <AutoResolve>true</AutoResolve> <AlertPriority>Normal</AlertPriority> <AlertSeverity>Error</AlertSeverity> </AlertSettings> <OperationalStates> <OperationalState ID="UIGeneratedOpStateId9d63df9cb26445559c2b1481cb25b88b" MonitorTypeStateID="Low" HealthState="Success"/> <OperationalState ID="UIGeneratedOpStateIdac9c64dbc0d4489e81f65c05188285f6" MonitorTypeStateID="Medium" HealthState="Warning"/> <OperationalState ID="UIGeneratedOpStateIdcf10be749c164e8eb8278a4387a1a439" MonitorTypeStateID="High" HealthState="Error"/> </OperationalStates> <Configuration> <LowThreshold>10000000000000000</LowThreshold> <HighThreshold>20000000000000000</HighThreshold> <TimeoutSeconds>300</TimeoutSeconds> <IntervalSeconds>300</IntervalSeconds> <ClusterName>$Target/Property[Type="Microsoft.HPC.2008R2.ActiveHeadNode"]/ClusterName$</ClusterName> <MetricName>HPCClusterNetwork</MetricName> </Configuration> </UnitMonitor>