There are blogs, white-papers and KBs from VMware and VMware enthusiasts all across the world-wide web providing steps to resolve high CPU ready utilization in VMware guests. The classification of problems generally tends to be around the following topics:
I have searched across the web looking for some solutions to the above problems but I haven’t found the answers I need. Especially when everything appears to be normal, yet ready-utilization is really high.
So here is what I have found up until now...
Before I proceed here’s the quick definition of ready utilization – this VMware counter presents percentage of time in the last interval that the VM was in a ‘ready-to-run’ state but did not actually run since it did not get CPU time from the host side – Ready time represents a kind of wait time for the VM but must not be confused with wait time as caused by IO waits within the guest OS. This is primarily a scheduling problem and it is important to keep this ready util to not more than 5 percent. From 0-5 percent is a warning and above 5 percent is something to look into immediately.
CPU Ready Util (%RDY for esxtop fans) can be high due to following reasons –
Resource pool limits which work against guest OS need for processor time – Like the VM admin puts your VMs into a resource pool that has a really low limit on CPU utilization.
VMs allocated vCPU configuration which does not match very well with the core count on the host processors. Running 3 or more 8-vCPU VMs on a host with 6-core physical CPUs.
VMs of varying configuration (1-core, 2-core, 4-core, 8-core, higher – combinations thereof) all running on 1 host – if CPU over-commitment is also in place. If you have a server with 2 4-core processors and you have a mix of 1-vCPU, 2-vCPU and 4-vCPU VMs allocated beyond the 8 available cores there’s a good chance that you see high ready-util VMs in place. With just over-commitment, but not a mix of different VMs, the ready util due to core contention will be present but expected to be within reasonable limits.
Excess allocation of vCPU to VMs beyond the guest OS/application needs
Here’s a case study. I am running a 2-host VMware cluster. I find a few VMs in my cluster running constantly at high CPU-ready utilization (>5 percent). These show up in HP’s free tool Virtualization Performance Viewer (vPV) as seen below – for more details about this free tool go to http://www.hp.com/go/vpv.
NOTE: See the VMs I have marked out below in the graphic (click to enlarge) and note their ready utilization (%RDY).
TIP: Do you notice how the VMs with ready-util appear in bright yellow above? The standard vPV settings mark all VMs with less than 5percent ready util in green. This can be changed in such a way that anything above 0 percent ready util figures in the green/yellow/red colour spectrum (with VMs nearing 10percent ready util show as deep red). The setting is in the /opt/OV/newconfig/OVPM/VCENTER_GC_Integration.xml. Just comment out line 95 in this file as shown below (no need to restart vPV, just enough if you reload the page):
Ok back to the problem - of the boxes highlighted above, note that 2 VMs have 8-CPU each allocated, and the 1 VM with highest ready-CPU has 2 vCPU allocated to it. Also, note that these VMs do not really have high CPU utilization as can be seen from this tree-map visual below showing the same cluster (in expanded form) – mostly green – the lightest of the green boxes (indicating most busy VM) is showing 15.74 percent CPU utilization.
The first step is to check the configuration of the host running this VM. The vPV report gives us an indication of the level of over-commitment.
Here’s another vPV tip – to find out the host a VM is running on, open up the VM->Status report – the host name is mentioned in the location details.
The over-allocation of CPU at host-level is only 161 percent of available CPUs – while this is definitely over-committing the available CPUs, this level is really not very high considering that with multi-CPU VMs running it is considered okay to go up to and beyond 200 percent.
Also note that the CPU utilization for this host is found to be really low/moderate (<20 percent), looking into the vPV workbench. An interesting insight is the high IOPS (esp writes) on the host.
As a next step I used the vPV host-configuration report, which along with other information, also shows the allocation of vCPU to the guests. Sure enough, I found a lot of VMs with varied vCPU allocations in this setup. However it was confirmed that this is not causing that much of a problem because core-contention is either very low or zero on most occasions on the server. (This was ascertained from vSphere client).
NOTE: core contention is a situation wherein a VM is unable to run due to co-scheduling constraints – this basically comes up due to the fact that a 8-vCPU or 4-vCPU VM does not always get all CPUs to run. It has been documented and expounded in several blogs that VMware has fixed this with their ‘relaxed co-scheduling’.
So we’ve ruled out the possibilities for over-commit at host-level and core count contention causing high vCPU. Also, I am not setting any limits for my VMs either at VM-level or at resource pool-level – so there cannot be a possibility of my VMs getting constrained by these limits. However there’s still one thing to check - that’s co-stop (%CSTP).
NOTE: co-stop is a counter that applies for SMP virtual machines and it is a measure of the amount of time a vCPU is stopped in order for other vCPUs can catch up. This is not physical world and so vCPUs might be delayed and run into a skew. This basically implies that on VMs which were allocated multiple vCPUs which are not used, the running vCPU may move forward while the other vCPUs are left behind – so they need to catch up at some point of time.
In this case, and this is where I had to turn to esxtop to confirm the %CSTP values – it was noticed that the above VMs with high ready utilization rates had really high co-stop values too. The simple recommendation here is to reduce the number of vCPUs allocated to these VMs so that their ready-time accordingly reduces automatically.
vPV is the tool that helped to identify high-ready time on the VMs in the cluster and triage the problem, setting the ball rolling for me.
My next step therefore is to work with the owner of these VMs to reduce the vCPUs allocated to the above VMs – it is actually the reason I wrote this blog. I plan on using another tool from the HP software arsenal - Service Health Optimizer (SHO) to suggest a right-size for CPUs for this VM, based on the VM demand trend. In my setup, I already see SHO marking these VMs as ‘over-sized’ which adds up well to my case study – but more on this later.
References (leads to external blogs. HP is not responsible for the content):
Ramkumar Devanathan (twitter: @rdevanathan) is Product Manager for HPE Cloud Optimizer (formerly vPV). He was previously a member of the IOM-Customer Assist Team (CAT) providing technical assistance to HP Software pre-sales and support teams with Operations Management products including vPV, SHO, VISPI. He has experience of more than 14 years in this product line, working in various roles ranging from developer to product architect.