Hive, Impala and Presto – The War on SQL over Hadoop

I feel the logo of an infant elephant for Hadoop is not opt now. It is well established and growing faster and stronger. Some people getting along up to the speed and some find it hard to grow faster.  To bridge that gap, there is  enormous activity going on to bring traditional SQL over the Hadoop. Facebook started to develop Hive around 2007 and opensource it in the end of 2008. Ever since the popularity of SQL over Hadoop is growing. On October 2012, Cloudera announced  Impala which claim to be near real time Adhoc bigdata query processing engine faster than Hive. Facebook again jump in to the picture and announced Presto last month. There is an open source project called Apache Drill also focusing on Adhoc analysis.

Lets take a look at the bigger picture how these system interacting with the larger Hadoop ecosystem.

Overall architecture of Hadoop, Hive and Impala

Overall architecture of Hadoop, Hive and Impala

In short, Hive converts the HiveQL query language in to sequence of MapReduce jobs to achieve the results, while Presto and Impala follow the distributed query engine processor inspired by Google Dremel paper.

HiveQL:

One of the common thing one could found among all three systems are, it all support on common standard called HiveQL (need a better common name soon?). Though HiveQL is based on SQL, it’s not strictly support the SQL-92 specification.

How hive works?

Hive maintain it’s own metadata storage where it keep metadata information about  schema definition, table definition, name node that contains the respective date etc.

There is Hive meta data storage client, that expose all meta data information as a service. It can be accessed by thrift, that make Hive Meta store is inter operable with external systems. This gave an advantage for impala and Presto to use the existing infrastructure and build on top of it.

Hive gets the query in the format of HiveQL, parse it and convert that in to series of Map / Reduce Job.

How Impala & Presto works?

Both Presto and Impala leverages the Hive meta store engine and get the name node information. It then talk directly to the name node and hdfs file system, and execute the queries in parallel. It then merges and stream the result back to the user. The entire process happen on memory, there by it eliminate the latency of Disk IO that happen extensively during MapReduce job.

The comparison:

Hive

Advantage Disadvantage
It’s been around 5 years. You could say it is matured and proven solution. Since it is using MapReduce, It’s carrying all the drawbacks which MapReduce has such as expensive shuffle phase as well as huge IO operations
Runs on proven MapReduce framework Hive still not support multiple reducers that make queries like Group By and Order By lot slower
Good support for user defined functions  Lot slower compare to other competitors.
It can be mapped to HBase and other systems easily

Cloudera Impala:

Advantage Disadvantage
Lighting speed and promise near real time adhoc query processing. No fault tolerance for running queries. If a query failed on a node, the query has to be reissued, It can’t resume from where it fails.
The computation happen in memory, that reduce enormous amount of latency and Disk IO Still no UDF support
 Open source, Apache licensed custom SerDes not yet supported.

PrestoDB:

Advantage Disadvantage
Lighting fast and promise near real time interactive querying. It’s a new born baby. Need to wait and watch since there were some interesting active developments going on.
Used extensively in Facebook. So it is proven and stable. As of now support only Hive managed tables. Though the website claim one can query hbase also, the feature still under development.
Open Source and there is a strong momentum behind it ever since it’s been open sourced. Still no UDF support yet. This is the most requested feature to be added.
It is also using Distributed query processing engine. So it eliminates all the latency and DiskIO issues with traditional MapReduce.
Well documented. Perhaps this is the first open source software from Facebook that got a dedicated website from day 1.

what to watch next?

This is the most happening field in Big data analytic field as now. This blog contents may not be relevant after one month, since the amount of activity going on all these platforms. Some of the interesting stuff we can watch over is,

1. Hortonworks Stinger project : Hortonwork put their bet on Hive and they started an initiative to improve Hive 100X faster. They already delivered two milestones and working on their final phase. They aim to integrate Hive in to another opensource project called Apache Tez, which is again a distributed query engine.

2. Cloudera is also contributing much on Stinger project. It will be interesting to see their approach over Impala on it.

3. What will happen to Drill project, if Presto getting in to Apache Incubator (I’m sure it will be soon)

4. How popular Presto will grow.

Lets watch and see 🙂

Edit

Thanks Greg and Justin. Yes I was wrong about Impala License. I found it in their blog answer here and in the quora answer as well.

MapReduce – Running MapReduce in Windows file system – Debug MapReduce in Eclipse

Hadoop_logoThe distributed nature of Hadoop MapReduce framework make the debugging little harder. Often we want to test our MR jobs in a small amount of data before deploThere are some good tutorials to configure Hadoop development with Eclipse. The major concern with the HDFS file system nature, it is hard to map the debugger in the windows environment. This is a little hack, that will make Hadoop to understand or take input from the windows file system and run the map reduce job locally. This will faster and flexible way of developing.

Lets extend the LocalFileSystem and override with our windows file system


package org.ananth.learning.fs;

import java.io.IOException;

import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.permission.FsPermission;
import java.io.IOException;

import org.apache.hadoop.fs.LocalFileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.fs.permission.FsPermission;

public class WindowsLocalFileSystem extends LocalFileSystem{


 /**
 *
 *
 */
 public WindowsLocalFileSystem() {
 super();

}


 public boolean mkdirs (
 final Path path,
 final FsPermission permission)
 throws IOException {
 final boolean result = super.mkdirs(path);
 this.setPermission(path, permission);
 return result;
 }


 public void setPermission (
 final Path path,
 final FsPermission permission)
 throws IOException {
 try {
 super.setPermission(path, permission);
 }
 catch (final IOException e) {
 System.err.println("Cant help it, hence ignoring IOException setting persmission for path \"" + path +
 "\": " + e.getMessage());
 }
 }


}

Then all you need to do on your driver class is,


package org.ananth.learning.mapper;


import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

public class MutualfriendsDriver extends Configured implements Tool{

/**
 * @param args
 * @throws Exception
 */
 public static void main(String[] args) throws Exception {

 ToolRunner.run(new MutualfriendsDriver(), null);
 }

 @Override
 public int run(String[] arg0) throws Exception {
 Configuration conf = getConf();
 conf.set("fs.default.name", "file:///");
 conf.set("mapred.job.tracker", "local");
 conf.set("fs.file.impl", "org.ananth.learning.fs.WindowsLocalFileSystem");
 conf.set("io.serializations","org.apache.hadoop.io.serializer.JavaSerialization,"
 + "org.apache.hadoop.io.serializer.WritableSerialization");

 Job job = new Job(conf,"Your Job name");

// Set your Mapper and Reducer for the JOB

// Set your input and output class

 FileInputFormat.addInputPath(job, new Path("input"));
 FileOutputFormat.setOutputPath(job, new Path("output"));
 job.waitForCompletion(Boolean.TRUE);
 return 0;
 }

}

The Path, input and output should be located on the project root directory. Now you all set, you can run the MR job in you windows local machine.

TDD – An Introduction to Test Driven Development

Test driven development chart
TDD

In my previous posts I’ve covered some of the automated integration testing frameworks such as Arquillian and Selenium. I’ve explained how it is important to have a solid automated testing to do continues code refactor and evolving technology architecture based on dynamic business needs. In this post I’m going to share some of my view and understanding of Test Driven Development (TDD)

What is TDD ?

ever since Kent Beck introduced JUnit (Rated as one of the top 5 tool for Java Technology ever made), and he rediscovered the whole Test Driven Development progress. According to the Wikipedia definition, TDD is a development process which relies on the repetition of a very short development life cycle. The developer write the test case showing how the system fails and then refactor the code to make it success.

 

TDD Process :

Many misunderstood TDD is all about writing test cases. TDD is a process that differentiate software engineering from plain programming. It has the following itineraries.

1. Plan:

Read the requirement and business use case. plan what method you going to implement and how you going to implement it.

2. Write a test case to fail:

This is very important. Once you done the planning, don’t jump in to implementation. Write test cases for that function and show what way it can fail.

3. Implement the functionality:

Now you refactor the implementation code as per the requirement.

4. Write test cases to pass:

Now the method already fool-proof. we have covered all the scenario, how not to fail. Run the test cases and see it run successfully.

5. Repeat (1 – 4)

TDD basically enforce the very basic of the software programming, “Code for Failure”. If you are new to software programming, a typical method should look like

function x(int x) {

< pre condition> (what should we do if we get x value undesirable)

Your Business logic

<Post condition> (did you got the desired result)

}

Lets Take a simple example. We need to implement a simple divider function, that take two integers as input and produce the division as output. We have simple business validation.

1. The denominator should be zero

2. The result should not be in negative number. (which means neither of the variable should be in negative)

Lets do a TDD.

1. Plan:

As we have two business use case validation, we need to have a custom exception class.

Write a simple method that will take two integer parameter and do the division operation.

2. Write Test case to fail

Lets write the basic class now.

DataFlow.java (The exception class)


package org.ananth.learning.tdd;

/**
 * The is the custom data exception
 * @author Ananth
 *
 */

public class DataException extends RuntimeException{

 public DataException(String message) {
 super(message);
 }

}

</pre>
package org.ananth.learning.tdd;

/**
 * Simple divider implementation
 * @author Ananth
 *
 */

public class SimpleDivider {

 /**
 * Take integer A,B and result the divider
 * @param a
 * @param b
 * @return
 */
 public Integer divide(Integer a, Integer b) {

return a/ b;

 }

}
<pre>

Now the test cases to fail


package org.ananth.learning.tdd.test;

import static org.junit.Assert.*;

import org.ananth.learning.tdd.DataException;
import org.ananth.learning.tdd.SimpleDivider;
import org.junit.Test;

/**
 * Test methods for simple divider
 * @author Ananth
 *
 */
public class SimpleDividerTest {

 /**
 * Denominator Zero
 */
 @Test(expected = DataException.class)
 public void testZeroDivisor() {
 new SimpleDivider().divide(10, 0);
 }

 /**
 * Negative denominator and positive Numerator
 */
 @Test(expected = DataException.class)
 public void testNegetiveDivisorA() {
 new SimpleDivider().divide(10, -2);
 }

 /**
 * Negative Numerator and positive denominator
 */

 @Test(expected = DataException.class)
 public void testNegetiveDivisorB() {
 new SimpleDivider().divide(-10, 2);
 }

 /**
 * Negative Numerator and denominator
 */

 @Test(expected = DataException.class)
 public void testNegetiveDivisorAB() {
 new SimpleDivider().divide(-10, -2);
 }


 /**
 * Actual Test to pass
 */
 @Test
 public void testDivisor() {
 assertEquals(new Integer(5),new SimpleDivider().divide(10, 2));

 }

}

Now if you run the test cases you can see except the last test case all the test cases been failed. Because we have not build out implementation method for failure.

Step 3: Refactor the code.

Now I’ve refactor the implementation method to include precondition to handle failures.


public Integer divide(Integer a, Integer b) {

 if(b == 0) {
 throw new DataException("Can't allow zero as divisor");
 }

 if(a < 0 || b < 0) {
 throw new DataException("Values can't be in negative");
 }

 return a / b;

 }

Now you can see all the precondition has been properly implemented and exceptions been thrown.

Step 4: See the test pass through

Now you can rerun the test cases and see everything pass through.

Step 5:

Take another modular method and repeat step 1-4.

Happy TDD!!!!