By Udi Shagal, Analytics product manager, and Eran Samuni, Analytics R&D Manager — HP Performance Anywhere
When monitoring the health of business applications, IT operations staff need to be able to detect issues quickly and respond efficiently. Identifying and isolating anomalies—and then qualifying probable causes—are important processes that lead to faster resolution and better performance.
One of the compelling features of HP Performance Anywhere cloud service is its advanced analytics capabilities. It uses HP patented technology for machine learning and predictive analytics and is designed to help users isolate problems quickly using advanced correlations.
The advanced algorithms of Performance Anywhere analytics provide early detection of developing issues and alert an application owner before the business is impacted. An intuitive summary of each detected anomaly shows the business impact, most probable causes and similar historical anomalies. For further analysis of the anomaly, a drag-n-drop user interface allows you to easily correlate end-user experience metrics with application metrics so you can isolate problems quickly.
This blog post will take you through the analytics features and simple UI of Performance Anywhere and show you how to automate the anomaly isolation process.
Algorithms that learn
Performance Anywhere collects and monitors a lot of different types of metrics, including latency for different layers in the application, CPU and Disk utilization, thread counts, open sessions and many other application metrics. The richness of the data might be overwhelming when trying to isolate a problem, but that’s where analytics come to the rescue.
Performance Anywhere uses self-learning algorithms to learn the normal behavior of the different metrics over time. Metric trend and seasonality are automatically detected and an accurate baseline is established for each metric. A baseline represents a “sleeve” of normal behavior for each metric (Figure 1).
Fig.1—transaction response time (pink) and its baseline sleeve (gray)
Detecting and Understanding Anomalies
Performance Anywhere monitors any behavior outside of the baseline sleeve. If the baseline breach is significant enough (typically across multiple metrics), Performance Anywhere sends an anomaly alert. This predictive event allows you to respond before your static thresholds are crossed and hopefully before the business is impacted.
An Analytics Overview screen (Figure 2) can help you triage and troubleshoot an anomaly. It displays the transactions and locations being impacted by the anomaly, top possible causes as ranked by the system and a list of similar anomalies. The possible causes include code change events, known issues, specific layer in the application model or specific CIs with abnormal metrics.
Fig. 2—Analytics Overview
From this screen, you can drill down to different reports to investigate further the business impact and the root cause. For example, you can see the list of changes that went into a build and which developers made them, or you can drill down to the Analytics Investigation screen to view all abnormal metrics.
The Analytics Investigation view allows you to probe the potential causes of an anomaly, understand abnormal metrics behaviors and correlate them automatically. In the example below (Figure 3), the response time of a synthetic transaction has breached its baseline. At about the same time, a few other metrics demonstrated abnormal behavior. Performance Anywhere uses topological knowledge about the application infrastructure in order to correlate only relevant metrics and group them into a single anomaly.
Fig. 3, Analytics Investigation—The dark gray area in the chart represents the normal behavior for the selected metric. The light gray bar starting at about 6:30am indicates the time period when Performance Anywhere detected an active anomaly.
The user can see mini-trends for each metric and is able to easily chart them side-by-side using a drag-and-drop UI. The user can also easily check the correlation between different metrics during the anomaly timeframe. In the table in the bottom right of Figure 3, the user can see that the abnormal response time is highly correlated with the “JDBC Wait Time” metric on a Java application server that supports the application and the system metrics on the database host. This correlation indicates that most likely this is a database issue.
To investigate further, the user can drill down to the Diagnostics views in Performance Anywhere and see more specifically in what application layer the latency is occurring down to the line of code or specific component.
Making sense of metrics
When you encounter an application issue, collecting and seeing the data is not enough. You need a simple and quick process to make sense of all the different data points.
Performance Anywhere’s predictive analytics, machine learning and advanced correlation capabilities work together to help automate anomaly isolation processes, so you can address issues quickly and minimize any impact on the business.
Discover how you can monitor the end-user performance and availability of web- and mobile-based apps with the new cloud service (SaaS) offering, HP Performance Anywhere. Sign up for our free public trial program and experience it for yourself.