IT Operations Management (ITOM)
cancel

Right and tight, makes your cloud fight!

Right and tight, makes your cloud fight!

Ramkumar Devana

A monitoring team would typically monitor for the 'base' parameters of a server - this includes, cpu usage, memory usage, disk space used, syslog/event logs, and any snmp traps. This was the norm and continues to be so today. The idea is that monitoring these aspects or parameters will give overall perspective. This has worked well until virtualization came along. In virtual world, we cannot stop with monitoring just the guest OS or the virtual machines. The source of the power is the hypervisors, these must be monitored, and in addition, the distribution of resources and latencies in data communication must be monitored.

HPE20160817001_1600_0_72_sRGB-0000.png

 

 For instance, disk and network IO rely on the virtual cpu in case of VMs, but with physical x86 boxes, this is not really the case. IO can happen separately. This only means that if a VM does not have enough CPU and memory, data IO can suffer.

Take the case of a host with 10 VMs - all having 2 vCPU allocated. Typically over-allocation of CPU can happen, so even though the host has only 16 cores, 24 vCPU may be allocated to the VMs totally. in this case, if a few VMs end up using all the vCPU allotted to them then the other VMs will not get enough time and these 'machines' will then run slowly, which reflects as app slowness on the system. This can be worsened by some rules to prioritize certain VMs over others (like high-priority processes running on a single OS kernel). Though, with standard monitoring, nothing is visible in this case from the parameters discussed. CPU, memory, disk IO all will seem normal.

Take the case of a host with 10 VMs - 9 of which are allocated 2 vCPU but there's 1 VM that's allocated 8 vCPU. Can you hazard a guess on which VM will give least throughput for compute processing? if standard scheduling works, then the VMs with more vCPU will run slowly, amidst the other VMs which only need 2 vCPU to be available for them to run. The ready time or ready util% for the 8 vCPU VM in this case will be highest. Again, an example that is not immediately visible from the basic operations monitoring.

The solution for the above problem however is not to break down the 8 vCPU VM into multiple smaller VMs, since the application running on the VM might demand the 8 vCPUs. The solution, believe it or not, is to run the 8 vCPU VM along with other large VMs on a relatively bulky host with more vCPUs. The reason for this is that if the large VM contends with other large VMs there's a greater chance that it gets fair amount of time schedule to run, and it is not kept waiting in a typical priority-inversion conundrum.

Another use-case is of the case of the storage - if a bunch of VMs on a datastore are assigned thin-provisioned disks, monitoring the individiual volumes within the guest OS would not suffice. Even though the individual disk volumes are not nearing capacity, the sum total of space used could exceed the total space on the storage and cause all VMs to hang with no apparent indication. So monitoring the underlying storage is again key aspect to monitor, beyond the guest OS.

Which brings me to the final point - a lot of monitoring teams look for options to monitor only a part of the virtual estate. If you ever want to do this, make sure that there's no sharing of resources happening at the lower layers. For instance, if the monitored VMs share storage with a set of unmonitored VMs, it would be difficult to pin-point which VM might be the offender that used up most storage.

Some best practices in monitoring virtual environments

- monitor end-to-end - from virtual compute to the physical, storage and networks

- ensure that associations between entities are discovered and kept up-to-date

- monitor at all levels, setup correlations and aim for root-causing the problems

- uptime calculations are best calculated from the guest OS perspective, ensure that the guest system clock is time-synchronized

- regularly resize VMs, setting resource limits does not help as much

- suspend idle VMs - VMs that are not doing anything important, but just kept running

- monitor ready-util % metric for VMs and ensure that this is not too high (above 10% is already high)

 

Take a look at HPE Cloud Optimizer (www.hpe.com/software/cloudoptimizer) which offers you exactly these capabilities and what's more, we offer you integration to HPE Operations Bridge Manager, Operations Bridge Analytics and Operations Bridge Reporter.

CO-tree-map.jpg

HPE Software Rocks!
  • operational intelligence
  • operations bridge
About the Author

Ramkumar Devana

Ramkumar Devanathan (twitter: @rdevanathan) is Product Manager for HPE Cloud Optimizer (formerly vPV). He was previously a member of the IOM-Customer Assist Team (CAT) providing technical assistance to HP Software pre-sales and support teams with Operations Management products including vPV, SHO, VISPI. He has experience of more than 14 years in this product line, working in various roles ranging from developer to product architect.

Comments
N/A

Ram,

This is very helpfull. I just instaled CO in my lab and looking forward to monitor our VM enviroment.

Occasional Contributor

Our VM admin has his own monitors. We now own Cloud Optimizer  as part of our OpBridge. It will be fun to compare tools.

//Add this to "OnDomLoad" event