Have you ever seen large service models of computers showing 32 CPUs or so? Does having 32 CPU elements attached to the server in a service model make sense from a business service monitoring standpoint?
Thinking about it, it makes sense to have the inventory of the CPUs (or cores) from the asset management standpoint. From the service monitoring standpoint however the individual inventory of the CPUs is not required. With multi-tasking being the norm and processes easily moving from one processor to another, one CPU or a few CPUs undergoing a spike does not affect the working of the applications. Modern operating systems automatically balance load amongst processors optimally.
So it boils down to this - would the operations bridge team be interested in monitoring each CPU core? Is there a situation where an app is affected by one or a few CPUs having high utilization or maybe electrical faults?
The service model is required to show that a business service is performing optimally and is able to handle the number of transactions that the SLA states. If there's any shortfall from the SLAs, the root-cause may come down to the network, the app or the system. That level of abstraction is sufficient. Coupled with the health indicators or some such state representation labels for the element in the service model, a system could be shown to have a CPU or memory bottleneck or even some other 'system-fault' state altogether.
If I may dare generalize this - the rule is clearly 'Present only what is relevant to the service model'
NOTE: With CPU affinity settings, it is possible for instance to restrict certain apps/processes/virtual machines to just a few cores. In such cases, it becomes 'relevant' to present these cores to the service model.
On the same basis you would not want to model the memory elements and also stuff that is so intrinsic or individual to the operating system like maybe the process table, kernel, etc. These are really abstracted to the system itself.
The rule above might lead you to ask - should we show individual NICs and storage volumes on the system?
It might be prudent here to note that NICs and storage disks/volumes are unlike CPU or memory from a resource usage and so too the modelling standpoint. Here's the low-down.
For most applications the disk is the fundamental unit of storage and the bonding between the app and the storage is quite high. There's no way on local storage that an app will start writing its data to another volume just because there's no space on the disk. So it is important that we model each disk, each NIC complete with the IP address (and MAC address) as part of the computer system model.
IO problems occuring on one disk or one network card can cause applications to fail entirely. So it is important to present the potential root-causes (here that one lousy disk or NIC) in the service model.
While on this topic, we must also discuss some special cases such as teaming and aggregation.
It is common nowadays to do something called NIC teaming (a.k.a link aggregation). Teaming combines one or more NIC cards together in a single interface name, so the end result is instead of having four 1gb NICs the system appears to have one 4gb NIC thereby getting greater bandwidth.
Again, I suggest here to keep the abstraction at the right level - we must show the 'teamed' NIC interface first and foremost in the computer system model. Unless there are really compelling business reasons, we should not attempt to show the drill-down of the bonded NICs, for the service model.
For the same reason you would not show that the data is striped across multiple disks in a RAID configuration (and/or try to show associations to each disk/volume) - again, not relevant for the business service models that enterprises typically like to use in their bridge.
That leads to the interesting question - who is going to monitor the lower elements below the abstraction layer? This is where the element managers (HP SIM and OneView, Cisco UCS Manager) and domain managers (HP NNMi, HP OM, HP Storage Essentials) come into the picture. Here's a picture just to clarify.
Must be noted here that the operations bridge would be served events, 'enough' topology and performance data from the element managers also, to allow for advanced event correlation reaching upto business service layers and other interesting BSM use-cases.
Then there are other cases such as clustering with failover and load-balanced configurations. In these cases, it is important to show the health of the cluster as affecting the clustered application, rather than the health of the nodes in the cluster affecting the app. It is quite possible (thanks to the whole redundancy plan behind clustering), that nodes are not 'healthy' but the cluster is still quite healthy and so the apps running in the cluster are performing optimally too. Use propagation and calculation rules when dealing with aggregated elements such as cluster systems to ensure that the health state of the clustered nodes is passed up to the cluster level. The idea here is to not call out a fault only because one node malfunctions. Maybe that's a warning alert 'Redundancy affected. Node X down'. But definitely nothing to call out a critical alert as it would be raised in a similar situation without any clustering implemented.
When talking of system models, one cannot escape the stark reality - what you see is not what you really get :) Virtualization is a great leveler. It allows you to think that you have a disk for reads and writes, but what you really have is a file to which all reads and writes are happening. It allows you to think you have CPUs and memory slots, but again these are only 'threads' in the OS kernel and regions in memory.
So how really should one model virtual systems? I will cover this part in my next blog article.
Ramkumar Devanathan (twitter: @rdevanathan) is Product Manager for HPE Cloud Optimizer (formerly vPV). He was previously a member of the IOM-Customer Assist Team (CAT) providing technical assistance to HP Software pre-sales and support teams with Operations Management products including vPV, SHO, VISPI. He has experience of more than 14 years in this product line, working in various roles ranging from developer to product architect.