Over the last couple of years the innovative tools that has emerged around big data technologies were immense.  Each tool has its own merits and demerits. Each tool need fair amount of expertise and infrastructure management since it is going to deal with large amount of data. One architecture philosophy I always like is “Keep it Simple”. The primary motive behind this design is to make sure there should be only one enterprise data hub management software to fit Lambda Architecture in to it. These are my thought process of how we can fit Lambda architecture with in Cloudera enterprise data hub.

For brief introduction about Lambda Architecture, Please see part-1 of Lambda Architecture.

Lets walk through each layers in Lambda Architecture and examine what tool we can use with in Cloudera distribution.







Data Ingestion Layer:

Though Lambda architecture doesn’t speak much about Data Source and Data Ingestion Layer, during my design I found understanding this layer is very important.

Before going to choose the tools for data ingestion, it is important to understand the nature of data sources. We can broadly classify data sources in to four categories.

1. Batch files

Batch files are periodically injected data in to a file system. In practical sense we used to consume it as a large chunk of data periodically (typically once in a day). Example of these files like XML or JSON files from external or internal systems.

DB Data:

 Traditional warehouse and transaction data usually been stored in to a RDBMS. This will be well structured data.

Rotating Log files:

Rotating Log files usually machine generated data which keep appending immutable data in to the file system. In most of the use cases it will be either structured or semi-structured data

Streaming Data:

I would say the modern data source. Streaming data usually accessed by a fire hose API, which keep injecting the data as it comes. A good example would be Twitter fire hose API.

Technology choice:

Apache Flume for Rotating Log files, batch files and streaming data.

Apache Sqoop for getting data from databases.

Speed Layer:

Technology choice: Spark Streaming and spark eco system

Spark is phenomenal with it’s in memory computing engine. One could argue in favor of Apace Storm. Though I’ve not used Apache Storm much, Spark stands out it’s concept of “data local” computing. The amount of innovation with in Spark core context and Spark RDD made Spark is a perfect fit for Speed Layer (Mahout recently announced they going to rewrite Mahout with spark Eco system).

Batch Layer:

Technology Choice:

Master data: Apache Hadoop Yarn – HDFS with Apache Avro & Parquet

Batch View processing: Apache Pig for data analytics. Apache Hive for data warehousing and Cloudera Impala for fast prototyping and adhoc queries. Apache Mahout of machine learning and predictive analysis

Apache Yarn is a step ahead of Hadoop Eco system. Its clear segregation of map reduce programming paradigm and HDFS made other programming paradigms play on top of it. It is important to move to Yarn to keep the innovation open on your big data enterprise data hub as well.

Data serialization is an important aspect when we maintain the big data system. It is important to force a schema validation before storing data. This will reduce surprises when we do analytic on top of it and save lot of development time.

Columnar Storage with Parquet. Hadoop designed to read row by row. The master data design will be a de-normalized data design hence there will be N number of columns in a row. When we do analysis we don’t want all the data to be loaded in the memory. We need those data which we really required. Parquet enables us to load only the data we require in memory to help increase the processing speed and efficient memory utilization.  Parquet has out of the box integration with Avro as well.

Apache Spark, Apache Pig, Apache Hive and Impala having out of the box integration with Parquet as well.

Servicing Layer:

Technology Choice: Apache HBase

The only NoSQL solution on Hadoop eco system. This is a bit of tough choice. Servicing layer need to be highly available since all the external consumer facing application will access it. HBase Master / Slave architecture make it little tough and it need a lot of monitoring. Region Failure, MTTR (Mean time to recover), high availability of Master node are some of the concerns while maintain HBase. There are a lot of activity happening to make HBase master highly available and improve MTTR.


Lambda architecture – Part 1 – An Introduction to Lambda Architecture

                                                   In last couple of year people were trying to conceptualize big data and business impacts of it. Companies like Amazon and Netflix pioneered in this space and delivered some of the best products to its customers. We should thank to Amazon for bringing in data driven business to end consumer market.   The big data paradigm emerged from a conceptual understanding to real world products now. All the major retailers, dot-com companies and enterprise products focus on leveraging big data technologies to produce actionable insights and innovative products out of it. The system emerged to the extends potentially replace traditional data warehousing solutions.

How this big data shift happened?

It is fundamental design thinking of how we store and analyses data. The moment you start to think that the data is,

  1. Immutable in nature
  2. Atomic in nature, that one event log is independent of another events.

Traditional databases were designed to store the current state of an event (with its update nature and data structure in beneath to support it). This made traditional RDBMS systems not fit in to the big data paradigm. There are numerous NoSQL solutions started to flow in to address the problem (See my earlier blog post on HDFS vs RDBMS).

Now we need an architectural pattern to address our big data problem. Nathan Marz  proposed Lambda Architecture for big data. In this two part blog post I’m going to brief overview of Lambda architecture and its layers. In the second post I’m going to walk you through my thought process of designing Lambda Architecture with Cloudera Hadoop Distribution (CDH).

“Lambda” in Lambda Architecture:

I’m not sure the reason behind the name Lambda Architecture. But I feel “Lambda” perfectly fit here because “Lambda” is a  shield pattern used by Spartans to handle large volume, variety and velocity of opponents. (Yeh 300 movie impact 🙂 )














                                                                                 Picture : Lambda Architecture

Layers in Lambda Architecture:

Lambda architecture has three main layers

  1. Batch Layer
    1. The storage engine to store immutable, atomic events
    2. The batch layer is a fault tolerance and replicated storage engine to prevent data lose
    3. The batch layer support running batch jobs on top of it and produce periodic batch views to the serving layer for the end services to consume and query
  2. Speed Layer
    1. This is a real-time processing engine.
    2. Speed layer won’t persist any data or provide any permanent storage engine. If raw data processing via speed layer need to be persisted it will persist in master data.
    3. Speed layer process data as it comes in or with specific short time interval and produce real-time view in to servicing layer
  3. Servicing Layer:
    1. Servicing layer will get updated from batch layer and speed layer either periodic or in real-time
    2. Servicing layer should combine results from both speed layer and batch layer to provide unified result.
    3. Servicing Layer usually a Key / Value storage and in-memory storage engine with high availability.