Guest post by Eli Revach, Database expert, HP Software
I am very excited to write my first post which is a part of series, about HP Operations Analytics backend database and how we leverage its abilities. Before I start, I want to say a few words about HP Operations Analytics.
Everyone tells their horror stories about big data where organization discovers that data was simply purged in the past. That data is now critical for a current decision and would increase revenue growth.
HP Operations Analytics is a new capability of HP Business Service Management, which delivers advance analytic intelligence for structured and unstructured data. This intelligence includes: machine data, logs, events, topologies and performance statistics. You can read more about it here.
But as I promised, this blog focuses on the backend database and its special abilities with Operations Analytics.
In the recent past, a classic business intelligence reporting solution mainly consisted of FACT’s and Dimension tables. To deal with large data sets, this kind of situation was required to maintain:
Time slice, pre-aggregations tables using some kind of batch processing .
The maintenances of this aggregations tables (Few examples – The batch process is not transparent to schema change,
Its asynchrony to the RAW data and it required extra storage/extra maintenances and more) and the fact that our solution has to deal with unlimited scale, lead use to choose HP Vertica database as our structured backend data store without the need for time slice pre aggregations tables. We simply runs our queries on top of RAW data.
Vertica is an analytic database with liner scale abilities and fantastic performance. It includes a rich set of built-in analytic functions and it also provides the ability to write your own analytic function in C++ or in R. Vertica is columnar liner scale—it is a share nothing database where data is distributed amount the cluster.
Running on top of RAW data and still fast?
Vertica is columnar database, in columnar database each column of the table is store on a different file, data in Vertica is compress and Vertica has the ability to scan the data in parallel on top of compress data and postpone the decompress to later stage of the execution (so scan time is very short) they call it “Late Materialization” (one of the few options they use and of course part of the story ), consider this query:
WHERE SALE_DATE <'10/01/2013'
Both SALE_DATE and TRANSACTION_ID columns are scanned in parallel and the data is filtered on top of the compressed files (two separate files in this case). The output of the scanning filters are two sets of data positions vectors that match the SQL condition filter in each file. This position vectors are being mapped into binary representation: 1 – for exist, 0 for not exists. This allows for Binary AND execution of top of them that generate final positions vector that are being used for additional files access to bring the final “select list” columns. Using this method, the data files are scanned twice, however it improves the overall I/O.
Below is high level illustration of “Late Materialization” in a columnar database.
On the fly dynamic time slice aggregation -
In Operations Analytics, the end user can pre-define their dashboards, in addition to the list of dashboards that come with the product. Each dashboard behind the scenes executes an SQL on top of the Vertica database which different time slice aggregation. Giving the fact that the data is not being pre-aggregated—the time slice aggregation is running as a part of the running query on top of the RAW data. Vertica provides a unique analytic function (I don’t see it in other databases)—time_slice function. When the query accepts slice time as a parameter, it will go ahead and do the job for you.
Below is an example for five min slice time aggregation:
The same query on other databases will require a join-to-time dimension that includes different combinations of time units. The fact that there is no extra join process required, improves the query dramatically.
I hope you have enjoyed this in-depth view of the backend database of HP Operations Analytics. Stay tuned for additional posts that discuss the different aspects of HP Operations Analytics.
If you have any questions, feel free to reach out to us in the comments section below. You can also find more information on the Operations Analytics homepage.