About a year ago, while on a customer visit, we were discussing IT issues and solutions and the following story came up.
The customer told us that one of his primary applications began running slowly a month earlier. They received some complaints from users and also saw the decrease in performance numbers in the monitoring tools. The slowdown was inconsistent. For a minute response time was fine, and then a minute later it took forever… and it continued like this. The pace also kept changing at night, when there are far fewer users. The application owner knew there were no recent updates or patches for this application and none of the coming events seemed to be relevant.
After a couple of days (and many hours of investigation) they accidently found it was a server in debug mode that caused all this trouble.
What would have happened if they had Operations Analytics? What could they have done to shorten the time to resolution? What could they have done to reduce the number of hours invested in solving the issue? Well… a lot!
With HP Operations Analytics they could have viewed a dashboard for this problematic application (you can prepare one for each application up front, or ad-hoc as you need it). Operations Analytics is collecting the data all the time, you can view it and use it whenever you need it.
So when the problem was reported, they could have simply opened the dashboard. By using the dashboard they would have easily seen the rises in response time; as they saw it in their own monitoring tools. But in Operations Analytics they don’t only see response time; they can see availability changes, server metrics, event counts, log messages and more.
But that’s not all. By using the time slider they can easily focus on the time when response time started increasing:
The time slider affects all the dashboard panes. This allows the user to look for changes in other metrics and log messages that happened at the same time.
In our case, they would have found higher disk IO for one of the application servers and that log message rates went up at about the same time. The playback feature can help pinpoint the exact time when the issue started:
They can then select the time window when the issue started:
For the selected time window, they can now review the log messages that were written. If there are any issues, there is a good chance you can find one or more log messages that explain the root cause. The time-based correlation improves your chances of finding these relevant messages.
Looking at the log messages it is immediately clear that there are more than a few messages with the word “Debug” on the same server with high disk IO. Transactions are using this server inconsistently and therefor the performance was intermittent. The cause is now clear and it took minutes instead of days to figure it out.
Metrics and Log messages for the application
Metrics and Log messages – Focus on the start time
Log messages (with DEBUG) – Focus on the start time
HP Operations Analytics speeds up the time to resolve business issues with a single pane of glass view. It presents application metrics, system metrics and log messages in one dashboard with a time-based focus, letting you drill down from a performance issue to the logs causing it.
Architect and User Experience expert with more than 10 years of experience in designing complex applications for all platforms. Currently in Operations Analytics - Big data and Analytics for IT organisations. Follow me on twitter @nuritps