IT Operations Management (ITOM)

Monitor the job queue, ascertain system bottlenecks - best practice #1

Monitor the job queue, ascertain system bottlenecks - best practice #1

Ramkumar Devana

Let's start the new year with a re-reading of the basics :).


How many times have you had a red alert situation indicating just high CPU usage? Don't you wish that the alert would be more 'intelligent' to tell you that the problem is only transient and it is not really a bottleneck?


In the course of normal working of a computer server, spikes in usage levels are quite normal and in fact good usage of the money spent on the horsepower :) In this blog article I will be discussing how to differentiate these 'spikes' from a real bottleneck situation.




Before i go further - what really is a bottleneck?

"bottleneck is a phenomenon where the performance or capacity of an entire system is limited by a single or limited number of components or resources."

Source: wikipedia


Ok now that we have that out of the way, why is a bottleneck different from a high usage situation? A high resource usage situation indicates potential problems in the future. A bottleneck situation is a current problem and there's no easy way to restore the system to equilibrium, unless a few 'offending' processes are terminated - Unplanned Outage.


For example, if you are just starting up MS outlook on your laptop there's a high cpu usage situation, but it will pass as the initial loading alone takes time. If you are however stuck with a Windows 'hang' situation, that could be due to a bottleneck - the limiting factor there being the number of CPU cores on your laptop.



Why does the number of CPU cores matter? The higher the number of cores you have, the more processes that can be executed concurrently on your server. So if you have only a few cores on your servers then you can run only proportionately few processes. If there are too many active processes they are queued to be processed - in the job queue. Also known as the CPU run-queue.


If you have a long run-queue exceeding the number of cores on the system, then potentially there's a problem.

If there's a short run-queue near zero, the problem is somewhat different. In this scenario you spent thousands of dollars on a system and it is lying wasted.


There's lots and lots of white paper material that will tell you this -


For those of you who use HP Operations Manager / Agents, you use the GBL_RUN_QUEUE, GBL_CPU_TOTAL_UTIL and GBL_NUM_CPU metrics to monitor for and detect a CPU bottleneck situation. You would find an example implementation with the HP OM InfraSPI policies (SI-CPUBottleneckDiagnosis policy).


Can we do something similar with memory monitoring? Should we? The answer is 'yes' to both. However we look at this a bit differently - we use the pageouts in case of memory to ascertain how much of a resource constraint is persisting along with monitoring memory usage levels. If the memory usage levels are high and the amount of pageouts is high, then there's a good indication of a memory bottleneck. (SI-MemoryBottleneckDiagnosis policy)


Let me know if you like this kind of tips for monitoring system performance. While this might be common knowledge, I do get asked this question quite often by customers or people new to system monitoring.


Feel free to reach out to me in the comments section below if you have experienced how a bottleneck is different from a high usage situation. I would also love to hear from you if you have any questions related to the topic.


Learn more

If you are interested to finding out how the HP System Management suite of products can help you monitor your systems infrastructure, visit HP Operations Manager i software site or the HP System Management sites.

HPE Software Rocks!
  • operations bridge
About the Author

Ramkumar Devana

Ramkumar Devanathan (twitter: @rdevanathan) is Product Manager for HPE Cloud Optimizer (formerly vPV). He was previously a member of the IOM-Customer Assist Team (CAT) providing technical assistance to HP Software pre-sales and support teams with Operations Management products including vPV, SHO, VISPI. He has experience of more than 14 years in this product line, working in various roles ranging from developer to product architect.

Super Collector

This really helps!  Looking for more like this :)

HPE Blogger

Thank you, Ramkumar. Do you have any further tips for looking at bottlenecks on virtual machines?

HPE Expert

Hi Stefan, yes there are lots of cases where virtualization can throw a curve ball in performance monitoring. I will write a blog article on this soon - thanks for the idea.

HPE Expert

Hi Ram,


thanks a lot for your refreshing blog on CPU/memory monitoring.


Are there any additional considerations to care about, if we talk about virtualized environments - meaning we talk about vCPUs and vMEM instead of physical?

Does HP provide corresponding monitoring policies / tools to cover that as well?


Best regards,



HPE Expert

Hi Patrik, appreciate your reading through this blog post. Thanks for bringing up the point about virtual machines.


I plan to cover some detail around monitoring / detecting bottlenecks in virtual systems, in a different article. However the important thing to keep in mind is that metrics like CPU-ready time and other virtualization-induced wait times need to be taken into consideration. We must not forget that there are many problems at the physical layer (the host running the virtual machines) as well as right-sizing of the VMs. For a start, the readers can have a look at related points from one of my earlier blog posts here.


Hi Ram,


Good article!!!


Just wondering what you pointed out is fr OM Agents. How about Sitescope? Are there any metrics to look out for CPU and Memory bottlenecks if we only have Sitescope?




HPE Expert

Hi, yes it is possible to instrument this using something like a custom script monitor in SiteScope. The default monitor for cpu looks at only CPU usage. For detailed monitoring like this, it is better to go with agent-based collection.


There's a trade-off that you would typically do with remote monitoring vs agent-based monitoring in general. Remember that even if you did do a remote check with sitescope once every 10 minutes you would get only data collected at that instant of time, and so you might miss a spike that may happened earlier within the 10 minute period.


While you can reduce the polling interval, that again increases the logins and logouts on the system - especially on UNIX/linux. there's a definitely a network latency tax.


Hi Ram, nice post!


Thanks for your guidelines. Your link is very helpful. queuing system 


Hi Ram Kumar,


It’s really making clear understanding;

Thanks for your post;


But wants more like this in Forum;





//Add this to "OnDomLoad" event