What is one of the most common situations for IT environment owners? It seems that one of these common situations occurs when some HP Operations agent nodes are not sending messages at the configured time in the policy or are not working as expected.
The reasons for these issues may be any of the issues below (and more):
So what’s the impact of nodes not communicating properly?
Agent failure might affect the monitoring of critical services. In enterprise environments it takes time to detect agent nodes that may have failed, troubleshoot and then fix the issues. If this occurs—the real reason why HP Operations Manager infrastructure monitoring agent is installed on the node (for application/systems monitoring) starts again!
How do you fix this?
For every customer there is a need to reduce MTTR (Mean Time to Repair) and increase MTBF (Mean Time Between Failures).
To fix such Operations agent health issues:
To fix these issues, ideally we need an intuitive Central Dashboard to view the overall health of the Operations Manager agents.
This Dashboard should have the ability to drill down into each agent node, then into sub-agent health to quickly identify, troubleshoot and fix identified monitoring issues. It should as an add-on also provide meaningful logs and events, with enough detail to take an action to fix the problem wherever possible.
Figure 1: How to identify and fix agent problems
Here comes the Health Dashboard with the new HP Operations agent 12.0
With the latest release of HP Operations Agent version 12.0, the above formula for identifying agent health issues has been implemented and a Health Monitor Dashboard is now available!
Start with the dashboard, then see which nodes need attention and drilldown to the node.
Figure 2 - Agent Health Dashboard
Figure 3 - Node and Process views
Then you further drilldown to the sub-agent and then voila there is the reason for the agent failure!
By using the Health View Dashboard, you can detect:
Operations Agent Runtime issues like sub-agent hangs and aborts, provide meaningful info into why it happens.
System Resource and Performance issues like the current resource utilization of agent processes (CPU, memory, disk, threads, semaphores/handles, agent disk space utilization, growth patterns etc.)
Overall System Resource Usage and availability
Agent Runtime Configuration issues like policy Runtime State (Enabled/Disabled, Collection state, Last Run, Missed intervals, etc.)
Runtime Configuration (Variables) state
Errors in agent logs
The Health View is designed for both the Operations as well as Performance Personas. It can coexist with existing agent health monitoring solutions (SelfMon, HBP, HP OM Agent Health Check, HP OM Health Monitoring component etc.) and use the existing communication channels (BBC) without the need for opening an additional port.
One more thing, you can install the Dashboard on an OM server or a non-OM server as well (it is more suitable for the OMi environment).