IT Operations Management (ITOM)

The backend database and HP Operations Analytics Part 2

The backend database and HP Operations Analytics Part 2


Guest post by Eli Revach, Database  expert, HP Software


The next generation of HP Operations Analytics will be available soon and it will include new amazing capabilities, you can read more about it here.  While developing this version we faced many database challenges and I would like to share one of them with you. In this second blog post I will focus on how database joins are executed on Big Data platforms. You can read my first blog post here.


One of the major challenges of Big Data platforms is the ability to join two different data sets (two different tables).  For example, Apache Hadoop will perform a join by leveraging a distributed cache of small table / using external in-memory database or using Bloom filter. All previously mentioned options are very reasonable from a performance point of view for offline analytics, but they are not good enough for online analytics.



Many Big Data customers leverage Hadoop as their backend data store for processing lower level raw data and normalizing it for online analytics on top of analytic databases.  HP Operations Analytics is an online analytic platform that leverages some of the Hadoop ecosystems and uses HP Vertica as its analytic database.  We don’t use Hadoop for data store, so how are JOIN’s being executed on HP Vertica?


HP Vertica supports SQL joins:

HP Vertica supports two major join methods, Merge and HASH. Nested loop JOIN is not supported as it is not suitable for large join processing. 


Classic HASH join in relational row store database is performed using the following steps:

  • The relatively small DIMENSION table is hashed in memory (step 1, 2).
  • The big FACT table is scanned (full table scan) to probe the hash (build) table for matches (step 3, 4).

 HASH join in row store databases.png


Figure 1: HASH join in row store databases


The scan of the FACT table is done in parallel, however it is still heavy and time consuming operation. The main reasons for this are:

  1. In row store databases, full table scan reads entire FACT table (a lot of I/O).
  2. All FACT records should be joined with DIMENSION hash table that results in very big join set.


How this join is performed in Vertica?

Vertica uses the method of Sideways Information Passing a.k.a., SIP, a method that has been developed in the academy and is in use by several database vendors in order to reduce the size of a join set. To simplify the explanation of what SIP is, we can say that  it’s a method for pushing down predicates in order to reduce the join size between the FACT and the DIMENSION table (less memory less temporary disk space during the join).


During query optimization, the optimizer rewrites the query by pushing the predicate down and in such a way causing the addition of the Runtime filter attribute to the FACT table scanning phase, this Runtime filter includes values from the DIMENSION table. As a result, in this step a very compact FACT table row set is created and it leads to very small join set (now we have small FACT table joined with DIMENSION table which is by nature relatively small). We can see this Runtime filter as a SQL INLIST filter that includes a list of keys from our DIMENSION (Figure 2 - step 3). This INLIST is being pushed-down to the FACT scanning phase and leads to the optimal size of the FACT table and optimal HASH join (Figure 3).


Vertica HASH join.png

Figure 2: Vertica HASH join



 execution plan.png

Figure 3: Execution plan, including SIP




If you are interested in Vertica, you can download the community version here.


If you have any questions, feel free to reach out to us in the comments section below. You can also find more information on the Operations Analytics homepage.



HP HAVEn data sheet  


Bloom filter

  • operational intelligence
About the Author


This account is for guest bloggers. The blog post will identify the blogger.


I would like to know the set of best practices to be followed while implementing DW in 100's of TB scale .

I havent seen any documentation regarding such systems from vertica. 

I would like to know more about

1.capacity planning , loading

3.query performance optimization on such scale

4.Resource allocation


HPE Blogger

Hi Vivek,


Below are some answers :

Vertica have capacity planning document on their site if you like I can send it to you .

Performance , there is no magic , you need to set projections to answer your use cases otherwise performance can be bad , to get the best performance try to avid Joins as much you can .

Data Load  -  very fast , the best practices is to  load in parallel , so each loader will connect to different Vertica server .

Resource allocation – Vertica have the concept  resource manager , you can manage your memory and CPU resources per resource pool .





//Add this to "OnDomLoad" event