IT Operations Management (ITOM)
cancel

Using ‘R’ to develop statistical analytics in Operations Analytics!

Using ‘R’ to develop statistical analytics in Operations Analytics!

HPE-SW-Guest

Bob Bethke, Technical Lead; and Santanu Dey, HP Software R&D Project Manager

 

Why ‘R’?

 

“R” is the leading tool for statistics, data analysis, and machine learning.  It is more than a statistical package; it’s a programming language. With it you can create your own objects, functions, and packages.

These are just a few of the reasons R is being used by a growing number of data analysts inside corporations and academia. It is becoming the lingua franca of data mining and big data analysis and a variety of diverse companies use it.

 

R allows you to integrate with other languages (C/C++, Java & Python) and enables you to interact with many data sources: ODBC-compliant databases (Excel, Access) and other statistical packages (SAS, Stata, SPSS, Minitab).

 

Finally, and most importantly, the Comprehensive R Archive (CRAN) contains over 5,000 R analytic packages that can be freely used. Leveraging existing packages from the CRAN archive can give IT Management a big head start in the world of statistical operations analysis.

  ‘R’ and Big Data

 

HP Operations Analytics’ underlying database (Vertica) supports distributed ‘R’ execution for scalable statistical analytics. The programming model used is known as Presto and is developed around distributed data elements, parallel execution commands and session management to control the concurrency. For more information on Vertica’s (version 7) distributed ‘R’ programming model see: http://www.vertica.com/distributedr/distributedrprogrammingmodel/

 

  Operations Analytics’ AQL with ‘R’

 

AQL is Operations Analytics analytic query language. It is an abstraction of SQL where a lot of the SQL details and utilities have been encapsulated into functions. We follow the same functional approach for integrating R with AQL. For example if there is an ‘R’ function that performs multivariate correlation of time series data samples, the AQL syntax for calling the multivariate ‘R’ function is as follows:

mvCorr([ts1, ts2, …])

 

Where each tsX is a function for returning a time series sequence or list of time series sequences to be correlated.

Before using the ‘R’ function in AQL, it must be registered with the Operations Analytics system as an ‘R’ function. There is a command line tool for registration of the R function. Operations Analytics also provides ‘R’ convenience functions for transforming AQL results in the time series format mvCorr expects. Finally, Operations Analytics also provides the factory function for defining the input and output types.

 

All the details of configuring ‘R’ for Operations Analytics are captured in a white paper that is delivered with the product.

  • Technical white paper ‘Using R-Functions to Integrate Custom Analytics into OpsA’

Also included with the white paper is an example statistical analytic ‘multivariate correlation’.

 

The Basic Concepts

 

Statistical packages operate on data samples, also known as observations or by the technical term of random variable. An example would be yearly rainfall amounts in Hawaii for the last 20 years. For consistency sake I will refer to these observations as random variables.

In terms of Operations Analytics, a random variable is a particular measurement on a particular managed entity over a specific time period. An example random variable in Operations Analytics is:

Average cpu_utilization on host1 over the last day by every hour.

So in Operations Analytics terms, a random variable is represented by a metric on a managed entity with a set of measurements. The way Operations Analytics query language presents these measurements is by data frames based on time. Many of Operations Analytics basic analytics result in random variable results. Basically, these are time series data.

The AQL expression [metricQuery(oa_sysperf_global, {i.host_name ilike “*”)}, {i.host_name}, moving_avg(i.mem_util)}] is a query that will return a time series of memory utilization values for all hosts in the collection oa_sysperf_global. In other words, it is a query that will return a set of random variables by host where the observation is memory utilization.

To use our mvCorr function to see the correlation among all hosts around their memory utilization we only need to use this query as a parameter to mvCorr as follows:

[mvCorr[[metricQuery(oa_sysperf_global, {i.host_name ilike “*”)}, {i.host_name}, moving_avg(i.mem_util)}])]()]

 

The results will appear as follows:

 

correlation between memory utilizations for hosts.PNG

How to get started

 

First go to the CRAN web site (http://cran.us.r-project.org/) and read up on ‘R’ if you are unfamiliar with the language. Download the ‘R’ interpreter and start with some of the examples from An Introduction to R manual.

Operations Analytics ships the multivariate correlation package ($OPSA_HOME/inventory/lib/hp/r-udx-examples/MVCorr.R). Load this file into the R interpreter and get familiar with some of the functions. The package contains a ‘genDataFrame’ that generates data in a similar format to Operations Analytics AQL result. So you can experiment with the correlation functions.

 

Also, examine the mvCorrOutType and mvCorrFactory functions which are necessary for the integration with Vertica.

If you are feeling adventurous you may want to develop a package of your own or download a package from the CRAN archive and integrate it with Operations Analytics.

 

  • operational intelligence
About the Author

HPE-SW-Guest

This account is for guest bloggers. The blog post will identify the blogger.

Comments
N/A

Yes indeed R is key area to implement right aspects of analytics and certainly this helps in enhancing opertional analytics advantage.

//Add this to "OnDomLoad" event