Cluster Network Usage performance monitor for HPC 2008 R2 cluster
This monitor tracks the aggregated network usage of the whole cluster. The value is calculated by adding the network byte/sec throughput values of all nodes.
This monitor is disabled by default. You can enable it and configure the network usage threshold for Healthy, Warning and Critical states. (For information about how to override the threshold values, see the management pack guide.)
Sustained high network usage is usually caused by jobs that send an excessive number of messages between nodes. If network throughput usage is close to the maximum network capacity, then jobs, especially MPI jobs, may take longer to complete.
Use job policies to manage the load. You can also use a network that has greater amount of throughput to increase the capacity of the cluster.
Target | Microsoft.HPC.2008R2.ActiveHeadNode | ||
Parent Monitor | System.Health.PerformanceState | ||
Category | PerformanceHealth | ||
Enabled | False | ||
Alert Generate | True | ||
Alert Severity | Error | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.HPC.2008R2.MonitorType.PSPerfCounterMonitor.ThreeStates | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Microsoft.HPC.RunAsProfile.AdminActionAccount |
<UnitMonitor ID="Microsoft.HPC.2008R2.Monitor.HeadNode.Performance.ClusterNetworkUsage" Accessibility="Public" Enabled="false" Target="Microsoft.HPC.2008R2.ActiveHeadNode" ParentMonitorID="Health!System.Health.PerformanceState" Remotable="true" Priority="Normal" RunAs="HPCLibrary!Microsoft.HPC.RunAsProfile.AdminActionAccount" TypeID="Microsoft.HPC.2008R2.MonitorType.PSPerfCounterMonitor.ThreeStates" ConfirmDelivery="true">
<Category>PerformanceHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008R2.Monitor.HeadNode.Performance.ClusterNetworkUsage_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="UIGeneratedOpStateId9d63df9cb26445559c2b1481cb25b88b" MonitorTypeStateID="Low" HealthState="Success"/>
<OperationalState ID="UIGeneratedOpStateIdac9c64dbc0d4489e81f65c05188285f6" MonitorTypeStateID="Medium" HealthState="Warning"/>
<OperationalState ID="UIGeneratedOpStateIdcf10be749c164e8eb8278a4387a1a439" MonitorTypeStateID="High" HealthState="Error"/>
</OperationalStates>
<Configuration>
<LowThreshold>10000000000000000</LowThreshold>
<HighThreshold>20000000000000000</HighThreshold>
<TimeoutSeconds>300</TimeoutSeconds>
<IntervalSeconds>300</IntervalSeconds>
<ClusterName>$Target/Property[Type="Microsoft.HPC.2008R2.ActiveHeadNode"]/ClusterName$</ClusterName>
<MetricName>HPCClusterNetwork</MetricName>
</Configuration>
</UnitMonitor>