I feel the logo of an infant elephant for Hadoop is not opt now. It is well established and growing faster and stronger. Some people getting along up to the speed and some find it hard to grow faster. To bridge that gap, there is enormous activity going on to bring traditional SQL over the Hadoop. Facebook started to develop Hive around 2007 and opensource it in the end of 2008. Ever since the popularity of SQL over Hadoop is growing. On October 2012, Cloudera announced Impala which claim to be near real time Adhoc bigdata query processing engine faster than Hive. Facebook again jump in to the picture and announced Presto last month. There is an open source project called Apache Drill also focusing on Adhoc analysis.
Lets take a look at the bigger picture how these system interacting with the larger Hadoop ecosystem.
In short, Hive converts the HiveQL query language in to sequence of MapReduce jobs to achieve the results, while Presto and Impala follow the distributed query engine processor inspired by Google Dremel paper.
One of the common thing one could found among all three systems are, it all support on common standard called HiveQL (need a better common name soon?). Though HiveQL is based on SQL, it’s not strictly support the SQL-92 specification.
How hive works?
Hive maintain it’s own metadata storage where it keep metadata information about schema definition, table definition, name node that contains the respective date etc.
There is Hive meta data storage client, that expose all meta data information as a service. It can be accessed by thrift, that make Hive Meta store is inter operable with external systems. This gave an advantage for impala and Presto to use the existing infrastructure and build on top of it.
Hive gets the query in the format of HiveQL, parse it and convert that in to series of Map / Reduce Job.
How Impala & Presto works?
Both Presto and Impala leverages the Hive meta store engine and get the name node information. It then talk directly to the name node and hdfs file system, and execute the queries in parallel. It then merges and stream the result back to the user. The entire process happen on memory, there by it eliminate the latency of Disk IO that happen extensively during MapReduce job.
|It’s been around 5 years. You could say it is matured and proven solution.||Since it is using MapReduce, It’s carrying all the drawbacks which MapReduce has such as expensive shuffle phase as well as huge IO operations|
|Runs on proven MapReduce framework||Hive still not support multiple reducers that make queries like Group By and Order By lot slower|
|Good support for user defined functions||Lot slower compare to other competitors.|
|It can be mapped to HBase and other systems easily|
|Lighting speed and promise near real time adhoc query processing.||No fault tolerance for running queries. If a query failed on a node, the query has to be reissued, It can’t resume from where it fails.|
|The computation happen in memory, that reduce enormous amount of latency and Disk IO||Still no UDF support|
|Open source, Apache licensed||custom SerDes not yet supported.|
|Lighting fast and promise near real time interactive querying.||It’s a new born baby. Need to wait and watch since there were some interesting active developments going on.|
|Used extensively in Facebook. So it is proven and stable.||As of now support only Hive managed tables. Though the website claim one can query hbase also, the feature still under development.|
|Open Source and there is a strong momentum behind it ever since it’s been open sourced.||Still no UDF support yet. This is the most requested feature to be added.|
|It is also using Distributed query processing engine. So it eliminates all the latency and DiskIO issues with traditional MapReduce.|
|Well documented. Perhaps this is the first open source software from Facebook that got a dedicated website from day 1.|
what to watch next?
This is the most happening field in Big data analytic field as now. This blog contents may not be relevant after one month, since the amount of activity going on all these platforms. Some of the interesting stuff we can watch over is,
1. Hortonworks Stinger project : Hortonwork put their bet on Hive and they started an initiative to improve Hive 100X faster. They already delivered two milestones and working on their final phase. They aim to integrate Hive in to another opensource project called Apache Tez, which is again a distributed query engine.
2. Cloudera is also contributing much on Stinger project. It will be interesting to see their approach over Impala on it.
3. What will happen to Drill project, if Presto getting in to Apache Incubator (I’m sure it will be soon)
4. How popular Presto will grow.
Lets watch and see 🙂