Written by Omer Shem Tov, HPSW Automation Center of Excellence
The new buzz word is 'cloud'. Everybody is talking about the cloud.
What is this “cloud” anyway?
How do we use it in our day-to-day life at R&D and DevOps?
Why and how Operations Orchestration (OO) helped us to easily achieve a robust connection between our DevOps use case to the cloud?
I work for HP Software on the ACoE team (Automation Center of Excellence). Our team has been responsible for improving processes across the site, by automating manual processes.
Defining the problem statement
In 2009 (when clouds were only in the sky) we were called to solve this problem. When an R&D person was getting a new "product build", he would need to first uninstall the previous version - This usually leaves traces and cause problems when reinstalling. Or he could take an alternative path of using Symantec Altiris to deploy a new image. This takes a couple of hours, and communicates directly with the machine and not as VMs. This process was unfortunately not stable - it was error prone and hard to automate. Moreover the user didn't have control on the hardware (CPU, memory, disk).
On retrospect, this was merely the symptom for what was a problem that laid elsewhere. The real problem being: How can we get a product build, provision a new VM and install the build automatically, then allow it to run automatic or manual tests. In other words, this process of continued deployment of product builds, is what we now know today as a DevOps problem.
Arriving at a resolution
To solve this problem, my team investigated a new approach. We would use an internal Java library called "Slick" that connected to a VMware API when provisioning a new VM using existing templates (Windows, Linux) over a VSphere Server, to set the hardware and network (IP, domain, computer name) for that new computer.
After the computer has been successfully provisioned and ready to use, the Slick library would take a product and install it onto the new computer. This process was somewhat complicated as we needed to support different operating systems (Linux and Windows), different databases, and different install configurations (support each configuration the product installation allows) – and all at the same time. The product version was not necessarily the final version, and usually a fresh build that needed to be deployed and checked immediately.
Product Type x OS x Database x Different Configurations x Products Builds = Complex Matrix
As the library turned out to be really useful, we built a server called RDE (R&D Enablement) that would use this library to deploy new VMs and do the required automatic installation of any HP product and version needed during our work.
The RDE allows user interactions in two ways:
Interactively – The user logins to RDE web UI and requests a machine with the required template, and installation of a product
Automatically – A REST request to RDE which automates the provision and installation process, and allows the CI engine (i.e. Jenkins, TeamCity) to do the build, provision new VM, and deploy a new product installation based on the build-to-run automation tests.
RDE essentially became our own private cloud (private since it's our own resources and not Amazon or Google resources). Since we introduced the RDE, it has been responsible for 100 deployments each day, and essentially became the only way to install new builds.
Recognizing RDE1.0 Limitations
But RDE 1.0 had key issues such as memory leaks. It only had partial support in Linux templates, and could not handle restart of the remote server during auto installation, as connection to the machine was lost. Moreover, its design had a major drawback. We had effectively cut the IT team from the loop even though we still needed them to handle the vSphere infrastructure. This would not work in the long run.
It was this IT team that knows how to manage resources (CPU, memory, IP addresses), VM templates (Where to store them? When we need to apply OS updates?), how to delete machines (how to release ip, remove from Active Directory), and with the VMware insight of how to create machine (Why machine provision failed?). This same IT team had the responsibility of dealing with provisioning problems but not the tools for providing it. Though they may not carry great power, they did however own great responsibility1.
Unfortunately, we also could not empower them with the responsibility for the entire RDE. RDE installs products that R&D wants to check, that IT might not be aware of. Essentially, we needed an R&D person to handle the different various installation flows that may change across the versions.
Introducing RDE2.0 revamped!
This is where we introduced a RDE 2.0, that divides the two functionalities provision and installation into two separate processes:
Provision is done using the HP Cloud Service Automation product, which is used to manage private cloud as based on OO flows. IT is responsible for building these services and maintaining their runs. In the event of a problem, IT provides the required response and fix so it won't reoccur.
During the installation process, by using HP Operations Orchestration, our products only involve a single tier (and optional use of existing DB). We did not require a complex deployment model with multiple servers. My automation team simply maintains the installation flows, and takes care of any problems or changes during the installation process. If needed, they can then involve the person who is responsible for the product installation itself.
To call on these two processes (CSA and OO), RDE2.0 uses REST to trigger CSA provision and OO installation, and to track these runs. The benefit of this design is that RDE2.0 does not change for the clients. As RDE2.0 incoming REST stays the same, subsequently, all RDE2.0 automation clients similarly stays the same.
Quite elegantly, OO and RDE 2.0 solved the problems that we had with RDE 1.0. With CSA, this gave us support across the Linux and Windows versions. Where OO flows gracefully managing for flows failure, to handle cases of remote machine restart and a fix to the initial RDE 1.0 poor design. Provision problems are now handled by IT team, with installation problems handled by ACoE team. The OO server provides us with a much more stable remote installation with minimal maintenance.
And now even after six months of usage, we are still happy. For an OO server installed on 8 GB RAM VM with Windows 2008 server , we did not need to restart it even once, and there wasn't a single failed installation owing to any real problem or defect in the OO product. Now, how good is that!