Hive, Impala and Presto – The War on SQL over Hadoop

I feel the logo of an infant elephant for Hadoop is not opt now. It is well established and growing faster and stronger. Some people getting along up to the speed and some find it hard to grow faster. To bridge that gap, there is enormous activity going on to bring traditional SQL over the Hadoop. Facebook started to develop Hive around 2007 and opensource it in the end of 2008. Ever since the popularity of SQL over Hadoop is growing. On October 2012, Cloudera announced Impala which claim to be near real time Adhoc bigdata query processing engine faster than Hive. Facebook again jump in to the picture and announced Presto last month. There is an open source project called Apache Drill also focusing on Adhoc analysis.

Lets take a look at the bigger picture how these system interacting with the larger Hadoop ecosystem.

Overall architecture of Hadoop, Hive and Impala

In short, Hive converts the HiveQL query language in to sequence of MapReduce jobs to achieve the results, while Presto and Impala follow the distributed query engine processor inspired by Google Dremel paper.

HiveQL:

One of the common thing one could found among all three systems are, it all support on common standard called HiveQL (need a better common name soon?). Though HiveQL is based on SQL, it’s not strictly support the SQL-92 specification.

How hive works?

Hive maintain it’s own metadata storage where it keep metadata information about schema definition, table definition, name node that contains the respective date etc.

There is Hive meta data storage client, that expose all meta data information as a service. It can be accessed by thrift, that make Hive Meta store is inter operable with external systems. This gave an advantage for impala and Presto to use the existing infrastructure and build on top of it.

Hive gets the query in the format of HiveQL, parse it and convert that in to series of Map / Reduce Job.

How Impala & Presto works?

Both Presto and Impala leverages the Hive meta store engine and get the name node information. It then talk directly to the name node and hdfs file system, and execute the queries in parallel. It then merges and stream the result back to the user. The entire process happen on memory, there by it eliminate the latency of Disk IO that happen extensively during MapReduce job.

The comparison:

Hive

Advantage	Disadvantage
It’s been around 5 years. You could say it is matured and proven solution.	Since it is using MapReduce, It’s carrying all the drawbacks which MapReduce has such as expensive shuffle phase as well as huge IO operations
Runs on proven MapReduce framework	Hive still not support multiple reducers that make queries like Group By and Order By lot slower
Good support for user defined functions	Lot slower compare to other competitors.
It can be mapped to HBase and other systems easily

Cloudera Impala:

Advantage	Disadvantage
Lighting speed and promise near real time adhoc query processing.	No fault tolerance for running queries. If a query failed on a node, the query has to be reissued, It can’t resume from where it fails.
The computation happen in memory, that reduce enormous amount of latency and Disk IO	Still no UDF support
Open source, Apache licensed	custom SerDes not yet supported.

PrestoDB:

Advantage	Disadvantage
Lighting fast and promise near real time interactive querying.	It’s a new born baby. Need to wait and watch since there were some interesting active developments going on.
Used extensively in Facebook. So it is proven and stable.	As of now support only Hive managed tables. Though the website claim one can query hbase also, the feature still under development.
Open Source and there is a strong momentum behind it ever since it’s been open sourced.	Still no UDF support yet. This is the most requested feature to be added.
It is also using Distributed query processing engine. So it eliminates all the latency and DiskIO issues with traditional MapReduce.
Well documented. Perhaps this is the first open source software from Facebook that got a dedicated website from day 1.

what to watch next?

This is the most happening field in Big data analytic field as now. This blog contents may not be relevant after one month, since the amount of activity going on all these platforms. Some of the interesting stuff we can watch over is,

1. Hortonworks Stinger project : Hortonwork put their bet on Hive and they started an initiative to improve Hive 100X faster. They already delivered two milestones and working on their final phase. They aim to integrate Hive in to another opensource project called Apache Tez, which is again a distributed query engine.

2. Cloudera is also contributing much on Stinger project. It will be interesting to see their approach over Impala on it.

3. What will happen to Drill project, if Presto getting in to Apache Incubator (I’m sure it will be soon)

4. How popular Presto will grow.

Lets watch and see 🙂

Edit

Thanks Greg and Justin. Yes I was wrong about Impala License. I found it in their blog answer here and in the quora answer as well.

6 thoughts on “Hive, Impala and Presto – The War on SQL over Hadoop”

Justin Kestelyn (@kestelyn) says:

November 19, 2013 at 7:39 pm

Ananth,

You are misinformed about Impala licensing: It is not “proprietary software.” Rather, Impala is 100% open source and Apache Licensed.

Greg Rahn (@GregRahn) says:

November 20, 2013 at 3:08 am

For the record, Cloudera Impala is not “proprietary software”, it is 100% open source (already), Apache licensed and can be run without Cloudera Manager. Hopefully that is easier to accept =)
Source code: https://github.com/cloudera/impala

Big Data and related technologies | Parroty's Blog says:

January 16, 2014 at 11:01 am

[…] https://bigdatanerd.wordpress.com/2013/11/19/war-on-sql-over-hadoop/ […]

Jim Tommaney (@jtommaney) says:

April 10, 2014 at 12:44 pm

Hi Ananth,

There is one more combatant in the War On SQL over Hadoop, i.e. InfiniDB. We had a third party (Radiant Advisors) execute an independent benchmark comparing Impala, Hive, Presto, and InfiniDB. Summary chart is available here: http://bit.ly/1n93muT

A quick summary
InfiniDB supported syntax for all 10 queries, with queries running between 1.28 and 17.62 seconds.
Impala supported syntax for 7 of 10 queries, with queries running between 3.1 and 69.38 seconds.
Presto supported syntax for 9 of 10 queries, with queries running between 18.89 and 506.84 seconds.
Hive 0.11 supported syntax for 7/10 queries, with queries running between 102.59 and 277.18 seconds.
Hive 0.12 supported syntax for 7/10 queries, with queries running between 91.39 and 325.68 seconds.
Note that 3 of the 7 queries supported with Hive did not complete due to resource issues.

There is a webinar coming up on April 23rd, and the report itself can be downloaded (with registration) at http://www.infinidb.co/blog/open-source-sql-hadoop-query-engines-benchmark

Cheers,
Jim

- bigdatanerd says:
  
  April 10, 2014 at 1:43 pm
  
  Thanks Jim, Sure I will run through some samples and update more information here
  
kritika kapoor says:

March 11, 2016 at 4:30 am

I came across your post and found exact information which I was looking for. Thanks for such a useful article!! I found one more information rich and interesting tutorial while doing searches which I am sharing with you just go through this link –
http://bit.ly/24UArSh