Monitor all agent processes to identify potential issues with the agent using too much processor time.
This monitor calculates the total CPU utilization of the Operations Manager agent and its related processes, and generates an alert when CPU utilization exceeds a specified threshold for a specified number of consecutive samples.
The monitor’s underlying script works by locating and sampling the CPU utilization for the Operations Manager agent process (HealthService.exe), its child monitoring host process (MonitoringHost.exe) and the child processes of those monitoring host processes (cscript.exe, PowerShell.exe, etc.). The script runs the calculation three times and outputs the average of the three consecutive samples, which is then used by this monitor to determine critical or healthy state.
You can use overrides to customize the following parameters to alter the default behavior of this monitor:
Frequency (seconds). This is the frequency at which the monitor samples agent processor utilization. By default, the monitor evaluates the agent processor utilization every 300 seconds (5 minutes).
Number of consecutive samples for critical state. By default, this monitor reports a critical state when 6 consecutive samples exceed the specified threshold.
Number of consecutive samples for healthy state. By default, this monitor returns a healthy state when 3 consecutive samples are under the specified threshold.
Threshold. By default, the threshold for CPU utilization is 25%.
This monitor is disabled by default for all management servers.
Excessive CPU utilization of the various Operations Manager agent processes may indicate that agent or one of its underlying dependencies is not operating properly. If the agent and its underlying dependencies are updated properly, then the agent is being over-utilized on the system being monitored. This may be short-lived, due to a recent update in the management group, such as the deployment of a new management pack, or this may be due to the agent truly being under excessive load, in which case tuning may be required.
To ensure that the agent and its underlying dependencies are operating properly, check the following:
Verify that the most recent version of the Operations Manager agent is installed on the system.
Verify that the update for MSXML 6.0 provided in Knowledge Base article 968967 (http://go.microsoft.com/fwlink/?LinkId=181885) is installed.
If the system's operating system is Windows XP, Windows 2000 Server or Windows Server 2003, ensure that the system is running Windows Script Host 5.7 or later. The following link provides the download locations for Windows Script Host 5.7 http://go.microsoft.com/fwlink/?LinkId=181884.
If the condition persists after those configurations are verified, then deeper investigation is required to understand what is driving CPU utilization. Investigate further using any combination of the following steps:
Review the recent history of agent processor utilization, workflow count, and module counts using the following view: Agent Performance View. The agent processor utilization data will give insight into whether the issue is recent or has been occurring for a longer period of time. The workflow and module count data will give an indication of the workload that the various rules, monitors, and discoveries are putting on the agent. This data should also be compared against healthy agents to use as a contrast.
Use a tool such as the Effective Configuration Viewer (http://go.microsoft.com/fwlink/?LinkId=182300) to understand the number of class instances discovered on the agent. More class instances can lead to higher workflow and module counts, which can result in more workload.
Using Performance Monitor, collect more detailed % Processor Time measurements from the Process object. This will give insight as to which processes are contributing the most significantly to overall processor utilization.
Review any recent management pack updates or changes to see if they correspond with the increase in CPU utilization.
When the cause or causes are identified, any one of the following steps may be taken to address the issue:
If a management pack change was made recently or a new management pack was deployed, monitor the situation to see if the problem continues.
Reduce the frequency of discoveries via overrides to spread out their CPU utilization across the day. Doing this comes at the trade-off of discovery potentially taking longer to occur.
Reduce the frequency of rules or monitors that are run on a schedule to spread their CPU utilization across the day. Doing this comes at the trade-off of monitoring.
If the agent is managed by multiple management groups (a configuration referred to as “multi-homed”), that will contribute to higher processor utilization as well. Consider reducing the number of management groups that the agent is managed by.
If all of the steps above do not produce a solution, contact Microsoft Customer Service and Support (http://support.microsoft.com/).
This monitor has a related diagnostic task, “Collect agent processor utilization diagnostic”, which reruns the sampling of CPU utilization. The diagnostic task is disabled by default.
There is also a task in the Operations console, ”Get the agent processor utilization”, which reruns the sampling of CPU utilization. When you run the ”Get the agent processor utilization” task, you can set the time-out and number of samples parameters. The task returns a table of results. Run the Get the 'agent processor utilization' task
Target | Microsoft.SystemCenter.HealthService | ||
Parent Monitor | Microsoft.SystemCenter.HealthService.PerformanceHealthRollup | ||
Category | Custom | ||
Enabled | True | ||
Alert Generate | True | ||
Alert Severity | Error | ||
Alert Priority | Normal | ||
Alert Auto Resolve | True | ||
Monitor Type | Microsoft.SystemCenter.HealthService.SCOMpercentageCPUTimeCounterMonitorType | ||
Remotable | False | ||
Accessibility | Public | ||
Alert Message |
| ||
RunAs | Default |
<UnitMonitor ID="Microsoft.SystemCenter.HealthService.SCOMpercentageCPUTimeMonitor" Accessibility="Public" Enabled="onEssentialMonitoring" Target="SCLibrary!Microsoft.SystemCenter.HealthService" ParentMonitorID="Microsoft.SystemCenter.HealthService.PerformanceHealthRollup" Remotable="false" Priority="Normal" TypeID="Microsoft.SystemCenter.HealthService.SCOMpercentageCPUTimeCounterMonitorType" ConfirmDelivery="true">
<Category>Custom</Category>
<AlertSettings AlertMessage="Microsoft.SystemCenter.HealthService.SCOMpercentageCPUTimeMonitor.AlertMessage">
<AlertOnState>Error</AlertOnState>
<AutoResolve>true</AutoResolve>
<AlertPriority>Normal</AlertPriority>
<AlertSeverity>Error</AlertSeverity>
</AlertSettings>
<OperationalStates>
<OperationalState ID="CPUTimeOverThreshold" MonitorTypeStateID="OverThreshold" HealthState="Error"/>
<OperationalState ID="CPUTimeUnderThreshold" MonitorTypeStateID="UnderThreshold" HealthState="Success"/>
</OperationalStates>
<Configuration>
<IntervalSeconds>321</IntervalSeconds>
<TimeoutSeconds>300</TimeoutSeconds>
<SyncTime>00:00</SyncTime>
<ComputerName>$Target/Host/Property[Type="Windows!Microsoft.Windows.Computer"]/PrincipalName$</ComputerName>
<Threshold>25</Threshold>
<ConsecutiveSampleCountCritical>6</ConsecutiveSampleCountCritical>
<ConsecutiveSampleCountHealthy>3</ConsecutiveSampleCountHealthy>
</Configuration>
</UnitMonitor>