Veeam VMware Collector: Health Service state change flow stalled

Summary

This rule is one of a set in the Veeam MP for VMware that monitors the status of the Ops Mgr Health Service running on Veeam Collector servers.

Specific events indicating data processing issues, workflow failures and resource bottlenecks are monitored.

Source: HealthService

Event ID: 5300 OR 5302 OR 5304

Level: Error

Description: Local health service is not healthy. Monitor state change flow is stalled with pending acknowledgement.

Causes

The Veeam Collector publishes data gathered from VMware vCenter for the local Ops Mgr Health Service to consume. This agentless monitoring method can place a high load on the Health Service. In the following situations the Ops Mgr Health Service may enter an unstable state:

The Collector server has not been correctly configured, or lacks sufficient resources (such as CPU, memory, disk speed)
A large VMware environment is being monitored and this Collector is overloading the Health Service by publishing more data than the Health Service can consume.
The Ops Mgr backend systems (Management Servers and Databases) lack sufficient resources (CPU, memory, disk speed) or are otherwise unresponsive

Resolutions

Review the alert description for more detail on the cause. Check the repeat count - if this is very high it indicates multiple sustained failures to process data and requires investigation.

Note that some sporadic errors may occur during a short period of very heavy load, such as initial discovery processing (when a new vCenter target is added to the Veeam Virtualization Extensions UI, or when a large number of monitoring Jobs are moved to this Collector). If no new errors are logged after initial discovery, and the Health Service has stabilized (review the Operations Manager event log for errors), then this alert can be safely closed. However review the guidance below to understand if the Veeam MP for VMware configuration can be better optimized.

Further details on the possible causes and troubleshooting steps for each are given below.

Collector server is not correctly configured or is under-resourced.
Check that the built-in MP Task 'Configure Health Service' has been applied to this server. This task applies certain registry changes that optimize the Health Service to handle the high data loads from the Veeam Collector.
The configuration of the Health Service is monitored and set using the Veeam MP for VMware Monitor 'Veeam Collector: Health Service recommended configuration monitor'. This monitor will check the registry settings for the Health Service where Veeam Collector is installed, and if the settings are incorrect will raise an alert. This monitor should be enabled to correctly track the required Health Service configuration. Check the status in Health Explorer for the Veeam Collector.
For Ops Mgr agents the Recovery Action of this monitor will run automatically. For Ops Mgr Management Servers this task will not automatically run. This is by design, as the setting change requires a restart of the Health Service. For Management Servers, use the Health Explorer to run the Recovery Action manually, or run the Task 'Configure Ops Mgr Health Service' which will be visible in the Tasks pane in the context of the Veeam Collector.
You can add more CPU and memory to this server, and/or use faster disks; although the success of this approach is limited due to the internal architecture of the Ops Mgr Health Service when acting as a proxy for large data volumes, and the recommended workflow limit will not be greatly exceeded.
An Ops Mgr Management Server will scale and perform better than an Ops Mgr proxy Agent (although the recommended maximum for workflows is the same). Consider relocating the Collector role to a Management Server, or upgrading this agent to a Management Server.
Collector is overloading the local Ops Mgr Health Service.
The Collector is capable of publishing a large amount of data, if many vSphere clusters, hosts and VMs are being monitored. However the Ops Mgr Health Service will become overloaded before the Collector capabilities are exhausted.
Review the performance counter 'workflow count' - this is a native Ops Mgr counter, also gathered by the Veeam MP for VMware. Historical data is available in the view Veeam for VMware / Veeam Collectors / Performance Views / Workflow count.
There is a built-in Veeam MP for VMware monitor for workflow count, 'Veeam Collector: Health Service recommended workflow load monitor'. The default threshold is 75000 workflows; however note that for other environmental reasons (such as resource issues or bottlenecks as described in (2) and (3) above) the real-world workflow limit may be less.
There are additional performance monitors for the Health Service, for CPU and memory usage of the Health Service and related Monitoring Host processes. Performance history for these metrics can also be seen in the Veeam Collectors view folder. Monitor history can be reviewed in the Health Explorer for the Veeam VMware Collector.
If any of the above performance monitors are also firing, additional to the events monitored by this rule, it is a clear indication of performance bottleneck causing knock-on issues with data processing.
The recommended approach is to re-distribute monitoring jobs using the Veeam Virtualization Extensions UI, so that the monitoring load is better balanced across all Collectors. If all Collectors are fully loaded then install and register a new Collector. The Load-balance feature of the Veeam Extensions service can be used to automatically distribute the monitoring jobs optimally.

Ops Mgr backend systems are unresponsive or under-resourced.
Bottlenecks in the Ops Mgr backend, such as response times from the Ops database and Data Warehouse database, can greatly impact the stability of the local Health Service.
Review the Operations Manager event log on Management Servers and on the Database Server(s) to understand if there are delays or errors in database communications.

Review this Microsoft KB article to see more detail on troubleshooting unresponsive Health Service issues.

Use the Alerts View to see all current open issues for this object. Use the Events View to review any error and warning events for this object. Open a Performance View to see the performance metrics for this object and all contained objects. Open a Diagram View to analyze the relationships of this object to other components.

External

See the Help Center for more information including reference lists of all Rules and Monitors and full set of User Guides for the Veeam MP for VMware.

See the VMware Online Documentation for more information on VMware vSphere, in particular:

ID	Module Type	TypeId	RunAs
DS	DataSource	Microsoft.Windows.EventProvider	Default
Alert	WriteAction	System.Health.GenerateAlert	Default

Veeam.Virt.Extensions.VMware.HealthService.StateChangeFlowStalled (Rule)

Knowledge Base article:

Summary

Causes

Resolutions

External

Element properties:

Member Modules:

Source Code: