Some of the best practices while developing MapReduce programs. You can see why we need Hadoop here
1. Use Combiner:
Combiner is a mini-reducers. Most of the time the code for Reducer will be same for Combiner. Combiner also extends your Reducer class to implement the reduce functionality. The major advantage of combiner is it will reduce data on each Map itself, that will reduce the Network IO and increase the execution time.
2. Use Data Compression:
Hadoop supports common compression techniques like DEFLATE,gzip,bzip2,LZO,snappy. Since hadoop hard split the files in to different block it is always best to use split compression algorithms like bzip2. Hadoop also supports Avro file format, which standardize the serialization of objects. Sequence file format is the best fit to input compressed files. Compression also helps to handle small files problem in Hadoop.
3. Distribution Cache : use it for only small files.
Distributed cache is a way of side data distribution. You often need to have a look up data to perform computation, say for instance using employee id, write employee name in the output or using IP Address in the Apache Log, get the city / country name. If the distributed cache is large than HDFS will store it over blocks and it will be time-consuming to read those data just to look up.
4. Choose HDFS as a larger block:
HDFS specification defined a block size to hold data should be between 64 MB to 512 MB. Hadoop designed to work on larger amount of data to reduce the disk seek time and increase the computation speed. So always define the HDFS block size larger enough to allow Hadoop to compute effectively.
5. Set Reducers to zero if you not using it:
Some time we don’t really use reducers. For example filtering and reduce noise in data. When not use Reducer always make sure to set it zero since the sorting and shuffling is an expensive operation.
6. Chain the Jobs:
The funda about MapReduce is modularity. modularize your problem and think of solving it MapReduce Way. Chain your jobs so that in a complex problems, if some failure happen in a mid way, you can still carry from the last job. It will also simplify your problem and make it easy to solve.
7. Always write unit test and run in a small data set:
It is the best practice in any programming. Hadoop comes up with good support to unit test your Mapper as well as Reducer.
8. Choose Number of Mapper and Reducer wisely:
The general rule of thump to choose number of mapper and reducer is
Total mapper or reducer = Number of Nodes * maximum number of tasks per node
maximum number of task per node = number of processor per node – 1 (Since data node and task tracker will take one processor)
Say like we have 50 nodes and each node has 4 processor then
total number of mapper or reducer = 50 * (4-1) = 150.