This article is written with inputs from Tobias Mauch, a very senior and much respected engineer on the HP OM team. We worked on this technical problem together.
A few weeks back I was asked to attend to an escalation situation, wherein the customer was facing message buffering/event processing delays in HP Operations Manager. Basically this means that alerts from endpoints that are monitored, are not hitting the OM event browser and are not getting acted upon in time. This results in the OM server message buffering queues fill up and as we all would know, that creates havoc in the NOC!
The customer said that the message buffering happens once every few hours, especially when there's a large spurt of messages from a telecom switch with 100s of devices connected to it. After a while, the customer would restart OM services and flush the queues - with no other options left.
By the time I was on the case some investigation had already been done by HP support team and was presented with this information.
Total number of OM nodes
OM External nodes
On a side note, it was also observed that memory utilization on the server is always high (>90 percent) and swap utilization is also high. This also led to thinking that maybe the server was not sized correctly.
Also the customer was not using DNS - they were using /etc/hosts for IP lookup.
Basically the above meant that the customer was having a large number (3400/4200 in percent terms - almost 81 percent) of external nodes as part of their OM node configuration. There's nothing wrong with that except that we need to keep in mind how OM performs message assignment when using external nodes.
But what's an OM external node?
External nodes represent nodes that are located outside the HPOM domain. These nodes, which include all kinds of nodes (that is, not just IP nodes), have only some of the functionality of normal HPOM nodes. Usually, no HP Operations agent runs on these nodes. represents a range of nodes from which external events are integrated into HPOM. External nodes allow HPOM to receive messages from such objects as gateways, connectors, networks, hubs, and other IP devices.
(from HPOM online help documentation)
HPOM 9.x has two kinds of external nodes:
* Node name pattern. For example, to match messages for all nodes at HP in Germany, you could use a node name pattern ^<*>.deu.hp.com$
* IP address pattern. For example to match all messages of network segment 1.2.3.*, you could add an external node with pattern ^1.2.3.<*>$
In order to verify if an alert received by OM's message receiver daemon is for an OM node, OM would check the message originating IP against a set of known IP's. Along with this, OM would also attempt to match the message IP address with the list of patterns specified in OM external nodes.
While matching against known (node) IPs can happen quite fast using optimized search, matching against IP patterns in external node configurations is a sequential (and therefore time-consuming) process. Matching across 1000s of IP patterns can delay message processing extremely.
Additionally not having DNS might also cause higher lookup times which OM attempts to do.
It became clear very quickly that it is like the customer is driving a car with the parking brake engaged. What is the best way to address this situation? Gently ask the customer to release the handbrake, so that the car moves faster and hits top speed.
Here are the details.
- We ask the customer to move away from /etc/hosts lookup to using DNS - to reduce lookup times.
- We ask the customer to resize their server as the performance records indicate high memory and swap usage. Also swap is equal to memory size - general rule-of-thumb here is - if there's not much physical memory configured, swap should be configured at 2X memory.
- finally, we start having a look at the node IP patterns the customer has set. Surprisingly the node IP patterns are of the following syntax - 14.100.201., 14.100.201., ...
This means that each node IP pattern is actually pointing clearly to only 1 valid IP address - so the patterns above point to 220.127.116.11, 18.104.22.168, ... All in all, the customer had for some reason decided to go with external node configuration for each of their devices, rather than add them as 'message-allowed' nodes (type=other/other). This was the real parking-brake thing hitting their OM server message processing speeds.
NOTE: IP addresses mentioned above are not the real IP addresses. These are only examples intended to explain the pattern used.
Once we saw this we suggested to the customer to move away from their external node configuration and use the SNMP node option (message-allowed nodes).
After the above move, the message buffering problems are no longer occurring. While they may see better improvements with moving to DNS and also increasing their memory/swap allocation, those are really 'bad road' conditions, not simply the hand-brake preventing the car from achieving high speeds.
It always feels good to solve a customer problem.. I love my job!
Tips for better use of OM external nodes -
For the record, here are some tips that you may follow for more effective use of OM external nodes.
If you are testing/debugging event integrations, and feel you are missing events, for a 'catch-all' pattern you can use the regex '^<*>$' - this allows you to obtain events from any device in your network and also if you forward events between your OM servers (and nodes are not in sync). However, this mightcause many unwanted messages getting to OM.
If you have multiple OM external nodes having similar patterns try to merge the patterns together.
Exception to the point above - In some cases there are reasons to have multiple external nodes with similar IP pattern for the sake of splitting assignments/responsibilities between departments/individuals in the operations team.
If you are using hostname patterns, then ensure that name lookups (whether DNS or file based) are not slow.
The pattern match expression setting is NOT standard regex syntax - beware of assuming this works like general regex pattern match.
Use OM external nodes only for easier organization - there's always a chance that parts of OM external node pattern may become full-fledged managed nodes, if a need is felt to manage or monitor these servers.
Here are some links for HP Operations Manager and related topics -
Ramkumar Devanathan (twitter: @rdevanathan) is Product Manager for HPE Cloud Optimizer (formerly vPV). He was previously a member of the IOM-Customer Assist Team (CAT) providing technical assistance to HP Software pre-sales and support teams with Operations Management products including vPV, SHO, VISPI. He has experience of more than 14 years in this product line, working in various roles ranging from developer to product architect.