Monitor of connectivity to job scheduler for HPC 2008 R2 Node
This monitor tracks the ability of a node to connect to the job scheduler port on the head node by using TCP/IP. If a node cannot connect to the job scheduler port on the head node, the node cannot communicate with the job scheduler to run jobs. A node attempts to connect to the job scheduler port on the head node using TCP/IP every 15 minutes by default.
This error can be caused by any of the following:
A problem with the HPC Job Scheduler Service on the head node
A problem with the firewall configuration or rules on the head node
Restart the HPC Job Scheduler Service on the head node.
On the head node, check the firewall settings using HPC Cluster Manager (in Configuration, click Network) or use the following HPC PowerShell command: Get-HpcNetworkInterface | Format-List
On the head node, check the configuration of the inbound firewall rules for the HPC Job Scheduler Service to ensure that ports required for communication with the nodes are open.
The HPC Job Scheduler Service monitor provides detailed information for troubleshooting the HPC Job Scheduler Service.
The Firewall monitor provides additional information about the firewall configuration on the HPC cluster networks.
For more information about the Windows firewall configuration on the head node, see “Appendix 1: HPC Cluster Networking” in the Design and Deployment Guide for Windows HPC Server 2008 R2 ( http://go.microsoft.com/fwlink/?LinkID=194689).
Target | Microsoft.HPC.2008R2.NodeRole | ||
Parent Monitor | System.Health.AvailabilityState | ||
Category | AvailabilityHealth | ||
Enabled | True | ||
Alert Generate | True | ||
Alert Severity | Error | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.TwoStates | ||
Remotable | True | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Microsoft.HPC.RunAsProfile.AdminActionAccount |
<UnitMonitor ID="Microsoft.HPC.2008R2.Monitor.NodeRole.Availability.Connectivity" Accessibility="Public" Enabled="true" Target="Microsoft.HPC.2008R2.NodeRole" ParentMonitorID="Health!System.Health.AvailabilityState" Remotable="true" Priority="Normal" RunAs="HPCLibrary!Microsoft.HPC.RunAsProfile.AdminActionAccount" TypeID="Microsoft.HPC.2008R2.MonitorType.PowershellScriptMonitor.TwoStates" ConfirmDelivery="true">
<Category>AvailabilityHealth</Category>
<AlertSettings AlertMessage="Microsoft.HPC.2008R2.Monitor.NodeRole.Availability.Connectivity_AlertMessageResourceID">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="UIGeneratedOpStateId00f669790fe541dd80803872986dcbb5" MonitorTypeStateID="Success" HealthState="Success"/>
<OperationalState ID="UIGeneratedOpStateId578767281f37481db1b682c81ac9845a" MonitorTypeStateID="Failure" HealthState="Error"/>
</OperationalStates>
<Configuration>
<ScriptName>GetConnectivityToScheduler</ScriptName>
<ScriptBody>
param([int]$schedulerPort)
Function TryConnect
{
param([string]$server, [int]$port)
&{ #TRY
$client = New-Object System.Net.Sockets.TcpClient $server,$port
$client.Close()
return "Success"
}
trap [Exception]
{
return "Failure"
}
}
$api = New-Object -ComObject "MOM.ScriptAPI"
$bag = $api.CreatePropertyBag()
$clusterName = (Get-ItemProperty hklm:\SOFTWARE\Microsoft\HPC).ClusterName
if (($clusterName -ne $null) -and ($clusterName -ne ""))
{
$result = TryConnect $clusterName $schedulerPort
$bag.AddValue("Result", $result)
}
else
{
$bag.AddValue("Result", "Failure")
}
$api.Return($bag)
</ScriptBody>
<Parameters>5800</Parameters>
<TimeoutSeconds>300</TimeoutSeconds>
<IntervalSeconds>900</IntervalSeconds>
</Configuration>
</UnitMonitor>