In last couple of year people were trying to conceptualize big data and business impacts of it. Companies like Amazon and Netflix pioneered in this space and delivered some of the best products to its customers. We should thank to Amazon for bringing in data driven business to end consumer market. The big data paradigm emerged from a conceptual understanding to real world products now. All the major retailers, dot-com companies and enterprise products focus on leveraging big data technologies to produce actionable insights and innovative products out of it. The system emerged to the extends potentially replace traditional data warehousing solutions.
How this big data shift happened?
It is fundamental design thinking of how we store and analyses data. The moment you start to think that the data is,
- Immutable in nature
- Atomic in nature, that one event log is independent of another events.
Traditional databases were designed to store the current state of an event (with its update nature and data structure in beneath to support it). This made traditional RDBMS systems not fit in to the big data paradigm. There are numerous NoSQL solutions started to flow in to address the problem (See my earlier blog post on HDFS vs RDBMS).
Now we need an architectural pattern to address our big data problem. Nathan Marz proposed Lambda Architecture for big data. In this two part blog post I’m going to brief overview of Lambda architecture and its layers. In the second post I’m going to walk you through my thought process of designing Lambda Architecture with Cloudera Hadoop Distribution (CDH).
“Lambda” in Lambda Architecture:
I’m not sure the reason behind the name Lambda Architecture. But I feel “Lambda” perfectly fit here because “Lambda” is a shield pattern used by Spartans to handle large volume, variety and velocity of opponents. (Yeh 300 movie impact 🙂 )
Picture : Lambda Architecture
Layers in Lambda Architecture:
Lambda architecture has three main layers
- Batch Layer
- The storage engine to store immutable, atomic events
- The batch layer is a fault tolerance and replicated storage engine to prevent data lose
- The batch layer support running batch jobs on top of it and produce periodic batch views to the serving layer for the end services to consume and query
- Speed Layer
- This is a real-time processing engine.
- Speed layer won’t persist any data or provide any permanent storage engine. If raw data processing via speed layer need to be persisted it will persist in master data.
- Speed layer process data as it comes in or with specific short time interval and produce real-time view in to servicing layer
- Servicing Layer:
- Servicing layer will get updated from batch layer and speed layer either periodic or in real-time
- Servicing layer should combine results from both speed layer and batch layer to provide unified result.
- Servicing Layer usually a Key / Value storage and in-memory storage engine with high availability.