Most organizations use events to detect issues and decide how to act on them. In some cases, the number of events can be overwhelming, you are just not sure on which one to act first or which are related to the actual problem that is being experienced right now. This is where HP Operations Analytics can help you.
Operations Analytics leverages machine learning to give context from the relevant log files to the alerts generated by every part of the environment; it applies the same algorithms from HP Labs used for log files to evaluate events streams. We overlay the information from events with logs to correlate the information for improved problem analysis investigation.
What does it mean for the current event management system that you have today? It means that you know you can make your rule-based event management be much more agile and productive by further reducing event noise, not missing any events that are relevant to the situation at hand, and improving event prioritization and qualification by analyzing even those events that have no corresponding rules in the system.
How does event analytics work?
Events from HP Operations Bridge or any event management tool flow into Operations Analytics after configuration. Operations Bridge manages and remediates based on data received from sensors while Operations Analytics powers your root cause analysis, troubleshooting, aiding in understanding historical trends and identify future performance and behavior of your systems.
Figure 1 - Operations Bridge works together with Operations Analytics
Then, Operations Analytics creates a list of the top suspected significant log messages and events and displays them visually in a pane or chart.
This algorithm runs over a user-defined time range for a host or a user defined group of hosts (a service). The Log and Event Analytics algorithms use a number of different parameters to calculate message and event significance, such as:
Grouping according to text patterns
Specific keywords, out-of-the-box and user defined, on text fields and other data fields like priority (for example: Exception)
Abnormal behavior (taking seasonality into account)
Repetition and seasonality (to identify insignificant messages)
Distance from problem time (user defined)
The results can be viewed as a graph or in a list format. In both forms you can see the most significant events side by side with most significant logs.
Next, it looks at the behavior of events, anything abnormal gets a higher significance score. Out of potentially thousands of events and millions of log messages the user is presented with only the most significant 20 events and/or 20 log messages with a similar analysis.
Mary, working in the central operations team, is using an event console to ensure that everything that is impacting her time system service gets remediated as fast as possible.
For known problems there are well-defined procedures, mostly automated. But for the many remaining events Mary needs to understand which of those are impacting her service and what needs to be done to correct the situation quickly.
Mary receives many events during a short period of time and starts by focusing on the ‘most significant’ events that were determined by event analytics in Operations Analytics (based on clustering, keywords, prior user classification, etc.)
The first event indicates a problem in several branches with users accessing the ‘time tracking system’.
Looking at the logs that are automatically correlated to the most significant events Mary now understands much better the context of the event and sees that this is related to an automatic update of the Chrome browser. From the log message that Operations Analytics identified as significant along with the relevant events, she is able to tell the browser versions that are in conflict and receives additional information that was logged at the time of the detected condition.
Mary goes back to OMi (Operations Bridge) and adds a corresponding short note, creates a ticket that automatically includes all the relevant logs, breaches etc.
Chris, the expert for the time system, gets the ticket and while investigating finds that the software update they rolled out was not the right version and needs to be updated to a newer version.
He downloads and successfully validates the new version and then rolls it out to all users.
This shows how problems that are not in the universe of already known problems can be found and corrected. It also shows how much faster one gets to the root cause when your management software has enough smarts and power to correlate such very different types of data, such as events and logs.
About the authors
Alina Gicqueau is the Product Line Manager for Analytics
Alina Gicqueau is a seasoned product manager who comes from performance analytics background. For the last 15 years, she has been focusing on building products for IT Operations that leverage machine learning and dynamic baselining and correlation algorithms.
Nurit Peres is a functional architect for HP Operations Analytics.
Nurit has many years of experience as an Architect and User Experience expert in APM and analytics area. She specialize in products for the enterprise users. Follow her on twitter @nuritps.
Join the Operations Analytics team at HP Discover London to see a demonstration, get questions answered or just say hello.
Operations Analyticsis a Big Data analytics solution that helps IT use all the insights hidden in system silos of monitoring data to resolve the root cause of failures faster and improve future operational performance.