Co-written by Tobias Mauch, a very senior and much respected engineer on the HP OM team.
Generally when a storm hits, you simply have to weather it and hope it does not inflict damages to your property. In the context of HP Operations Manager, these storms often consist of a huge number of messages or events that hit Operations Manager (HP OM) in a short period of time. The source of these messages or events is the HP Operations agent which is part of the infrastructure monitoring software. In many cases, these storms are trigged by events which are reporting the same failure.
Any customer with a large installation of agents has potentially faced message storms/floods. As you know, the cost of handling and weathering such floods in terms of time and effort is quite costly.
Here are three easy methods to detect and prevent these storms. The first two approaches work on the HP OM server and the last one is provided by the HP Operations agent.
In this blog, I will introduce the first approach, in the next two blog posts I will explain more about the other options.
Event Correlation Services based message storm detection.
In this method, Event Correlation Services (ECS) circuits are used to prevent message storms (either message based or policy-based). This approach has been around the longest.
Message storm detection/suppression is done on the management server by an ECS policy. You will need to enable output of all messages to the MSI in Divert mode for this and you will need to assign the ECS policy to the management server itself. The configuration, including defining the rate of incoming events and the interval, is performed by changing lines in the ECS fact store file for the ECS policy.
Message flow scenarios:
Figure A : Message flow when suppression is enabled.
Possible message flows:
• Normal flow 1 -> 2 -> 3
• Flow when detecting a message storm 1 -> 2 -> 4 -> 5 -> 6 -> 7
• Flow after a message storm 1 -> 2 -> 3 & 3 -> 8 -> 9
Figure B: Message flow when suppression is enabled.
• Flow after a message storm 1 -> 2 -> 3 & 3 -> 7 -> 8
In addition to the steps described for ‘‘Suppression enabled’’, step 10 is performed where messages are sent to the message browser even when a message storm has been detected.
You can configure the circuit so that it does not send the messages that are received by the management server to the message browser until the message storm is stopped. (Note that for the policy-based message storm: it is also possible to create exceptions, so some policies, nodes, or combinations of both are never disabled.)
There are two ECS circuits to choose from:
a) MsgStorm_Dectect : ECS policy will suppress messages if the number of messages from a particular node crosses the configured limit.
By default, the ECS policy will create an automatic action that will stop the agent on the affected managed node—but you can configure the action to do nothing.
b) PolicyStorm_Dectect : ECS policy will suppress messages if the number of messages from a particular policy on a managed node crosses the configured limit.
By default, this ECS policy will create an automatic action that will disable the affected policy on the managed node—but you can configure the action to do nothing.