Editor’s note: This is a guest post by Eli Eyal, Operational Support Services Manager at Playtech, the world’s largest supplier of online gaming and sports betting software and an HP Software customer.
IT environments experience a very high rate of change today, much more than they ever have before. Rapid change requires rapid adaptation, and that is true of monitoring systems as well. In my role as the Operational Support Services (OSS) manager at Playtech, I need to make smart decisions about how to keep pace. I see myself as both a vendor and a customer—the investments I make in technology platforms and tools must provide value to the service that I offer the business.
Monitoring is the eyes and ears of the business and a top priority. We need to both maintain the current monitoring capabilities that exist in Playtech’s IT environment, as well as find ways to add more monitors to the additional services and applications that are introduced almost daily. We always need more from our monitoring system: more information with greater automation that will allow us to detect and remediate faults more quickly.
Dynamic monitoring for dynamic IT
Our analysis revealed that our monitoring system needed to adapt more to changing conditions, and that required us to focus on two key areas: dynamic monitoring and predictions.
Creating such an adaptive monitoring system using the standard agent or agentless monitoring tools is very hard. Monitoring is usually performed on static sites, where we check whether the application or a server is running correctly or at fault, usually based on traffic-light status indicators. But in our constantly changing IT environment, standard monitoring methods that report on static objects like server CPU usage or memory consumption of our Java application will miss many things that could help us prevent the next downtime.
Monitor your business baseline
Dynamic monitoring and predictions are both achieved by looking a bit beyond our static IT and instead considering what it is used for — our business. If you examine how your IT environment is used, you will probably find as we did that there is a seasonality pattern. This usage pattern can inform remarkable monitoring capabilities that you cannot do with standard static monitoring.
Let’s take a simple metric like business usage, which is the number of users that are using the service. The performance of your IT application can be measured not only by whether it is up or down but also by how many users are using it. Depending on the service that your application is providing, you will find that it’s being used more at some points in time and less at others — a pattern that repeats over the course of a week, month or year. For example, you might find that usage is at its lowest post every Saturday, but on Monday it’s at its highest.
By harnessing this information you can establish a baseline of known behavior and monitor subsequent behavior for anomalies. These anomalies will indicate problems in places you have never thought you could even monitor before — your ISP provider, your external link to other vendors and even changes in your own business service.
In my case, monitoring from different angles gives me the ability to understand whether there is a problem and determine its priority. It also helps me identify a root cause twice as quickly, especially when it’s not in my environment. For example, my business relies on an ISP to connect users to our systems. When there is a problem with our ISP, standard monitoring provides no indication of it, because it monitors the environment from the inside. But with the baseline monitoring in place, I can know that there is a drop in my business activity — obviously a high-priority issue. Since there are no additional infrastructure alerts, I can start to search for the root cause outside of my environment, including checking external dependencies such as my ISP.
Prediction is the name of the game
The benefits of having your dynamic usage KPIs monitored are enormous. Not only do you no longer have tune any static thresholds, which reduces your maintenance to almost nothing, but the baseline business activity statistics let you know what should happen tomorrow, next week and next month. You can also see the real-time behavior in front of you, and compare it your predicted metrics. If the prediction fails due to a fault in any aspect, you can move as quickly as possible to find the fault.
In IT, knowledge is power, and being able to monitor things that I never could have before and provide more information, makes monitoring a more powerful tool for my business.