Guest post by Nagendra Solanky, Distinguished Technologist, HP IT Global Data Services
Like most IT shops, HP IT Global Data Services (GDS) operates with quite limited resources.
We are, however, very good at what we do, managing 25,000 databases and nearly 50,000 servers across six fully redundant data centers with five-nines high availability and scalability for the database application servers that we provide the business. Not surprisingly, our operations analysis are highly automated, with monitoring and management tools continuously looking for issues in the environment and correcting them before they become a big problem.
Despite all the automation and other tech goodies we have deployed, outages and performance problems do sometimes occur. That’s just life in IT, right? And when they do occur, that’s when the real test begins. How quickly you troubleshoot and get back to normal operations can define your reputation as an IT professional.
We have recently been evaluating a new tool that we think will help us dramatically improve our Mean Time to Recovery (MTTR). We’re pretty excited about its potential, and I wanted to share with you why we think it will make a big difference to how we troubleshoot.
Anatomy of an outage
Last fall a critical, highly visible customer-facing online application called Order Status Suite (OSS) experienced a severe performance degradation. OSS provides customers and partners with up-to-date information about products they have ordered, taking information from multiple systems such as order management, shipping, picking, and fulfillment, and creating a central hub.
When an outage like this occurs, we are essentially out of business — we’re losing money and credibility. Naturally, our first order of priority is to bring the app back up.
We do not always devote the resources to conduct root cause analysis. Everyone is 100 percent busy, and browsing log information can be very time consuming. A DBA assigned to the task might not be able to access certain servers, because they’re part of a different group. Then, what logs should I look at? What directory? Sometimes you can’t even open them, because the server is busy and doesn’t have the memory. You must surmount many hurdles; the hunt is rarely easy.
In this case, we did perform root cause analysis, assembling a team of experts from five towers: Application, Database, Unix, Network, and Storage. It took us 36 hours to identify the root cause of the outage — a change to how the database parallelism parameter was configured — and get the application back up. And it took a full two weeks to clear the transaction backlog of orders that were not processed during the outage.
How Operations Analytics improves MTTR
Since that time, we began evaluating HP Operations Analytics 2.1, and the type of outage I described is one of 12 use cases evaluated. The results were very promising. Based on our evaluation, we determined that we could have identified the root cause of the outage in less than 30 minutes. Here’s how it works:
In Ops Analytics, we collect transaction performance metrics from the application server, such as throughput or response times. From the database server, we collect database performance metrics such as wait chains, user IO etc., and OS performance metrics, we collect run queue, average IO, CPU consumption etc. (Figure 1).
Fig. 1: Operations Analytics data collection
When OSS experienced severe performance degradation, the application throughput dropped suddenly. System metrics also showed the impact, with transaction response time spiking across the board and OS blocked queue and DB user IO response times and wait chain counts both spiking at the same time.
In Operations Analytics, all of this information is correlated (Figure 2)
Fig. 2: Operations Analytics provides a single pane view to IT operations
The lower left pane shows the relevant logs from GDS Database Alert (bottom left in Figure 2, above) displaying a parameter change and clearly identifying it as the root cause (Figure 3).
Fig. 3: Database Alert Log identifies the parameter change as the root cause
One dashboard for fast troubleshooting
With Ops Analytics, all the relevant information is brought to one dashboard in near-real time. Database problems, storage, applications, network metrics — we can collect as much as we want—and it can all be analyzed with the “time machine” replay feature. This feature allows you to scan your operations performance with a click of your mouse.
What this means is that we don’t have to assemble a team of experts from each group, because it’s all done by the tool. We no longer have to browse through log information. No one has to try and log on somewhere and get the directory and go to the right point in the log. It’s all on the dashboard. Even someone who isn’t an expert can identify an outlier metric. We are then able to call upon just the right resources to go to the next level and accurately troubleshoot, cutting downtime significantly.
Reduce outages and improve service by delivering fast analytics of operational Big Data. Click here to learn how Operations Analytics helps you combine all your operational data, gathering metrics, events, topology, and log file data from all your IT systems into a comprehensive view, so you can make use of your existing investments. We have an upcoming webinar that you won't want to miss! Join Gary Brandt, HP Global IT Functional Architect, to learn how HP IT incorporates best operational practices to collect and analyze structured and unstructured data using big data analytics at enterprise scale.