Data lake : A brief history
The Big Data lake term coined by James Dixon, The CTO of Pentaho. Though the initial term coined to contrast with data mart, soon it became a very popular term on the big data world. PWC subsequently told that data lake could potentially end the data silos which is a major concern for enterprises. Given the maturity of the concept and technology, there are very less projects got successfully deployed as a big data lake. The rush to get in to their hands on big data and market them self as a big data company, many started to dump all the data in to HDFS and over the period of time started to forget them. The key to success is not dumping all the data, but creating a meaningful data lake that can increase the speed of extracting the value out of it.
Data lake is not just a storage or processing unit, it’s a process to unleash the value of data.
Why we need big data lake?
Every industry has a potential big data problem. In the digital era with social media and IOT technologies, customers now interacting across variety channels. The interaction leads to create what we call the big data. Creating a 360 degree view and establish a single source of truth about their clients is a nightmare for most companies. The importance of data lake can be summarized by a quote below,
Every product and service will go digital, creating vast quantities of data which may be more valuable than the products themselves.-Steve Prentice (Gartner Fellow)
The life cycle of data lake
The data lake life cycle in itself is iterative in nature. A typical data lake follow 3 step process and keep getting iterated.
1. Data source integration:
The data lake process starts with the data ingestion process. The data ingestion always done at a very granular level of an event without any assumption about the data. Data ingestion process often referred as “As it happened mirror” of your data source. The nature of big data with volume, variety and velocity increases the complexity of data integration.We no longer have traditional RDBMS alone as a data source. The data lake creation start with a handful of identified business critical data sources and later adding more data sources. This enable simplification of the complex data ingestion process.
Complexities of data ingestion process:
When we add a new data sources, we may not know the business process which act on the data sources. Data storage optimization will be challenge since we may not know the access pattern upfront. Data sources may include complex data types which are hard to convert to relational structure upfront without knowing the significance of the data.
Iterative Data ingestion pattern:
Data ingestion process includes data de-duplication and data enrichment process as well. The business process identification yields data access pattern. The findings of data access pattern then looped in to data ingestion strategy to enhance data de-duplication and data enrichment process. The initial output of data ingestion process yield loosely coupled, complex entities which get enhanced over period as denormalized, flattened, enriched and easily query-able dataset.
Technologies: Apache Kafka, Apache Nifi, Apache flume, Apache Sqoop and Druid.
2. Business process discovery:
Business process discovery is the important process of data lake creation. The true value of a data lake can be realized only if we can make the business process discovery achieved without greater efforts.The Business discovery process start with the exploratory analysis to query the data and identify the hidden value out of it. Data stewards and business analyst also plays a vital role on it to exploring the data by providing and gaining valuable insights. The exploratory analysis tools often a MPP query engine with SQL-like abstraction. The exploratory analysis can be performed to achieve the following objectives.
- Validate a business process theory
- Discover a new business process
- Derive business intelligence via descriptive analysis
- Serve as a foundational platform for predictive analytics.
Technologies: Impala, Presto, Drill and Apache Pig
3. Serving data products with data insights store:
Once the business process identified we need to create data store that can be easily serve as a data layer of an application. We can closely related data insights store with data marts. data insights store often tend to be highly normalized, optimized for particular business process access pattern. Though the data insight store tightly coupled with a business process it is important to identify “conformed dimensions” across business process. This will significantly reduce the computation need for each business process to derive the insights. It is also recommended to store the roll up dimensions relationship along with the data insight store in order to reduce the need for duplicate computations.
Technologies: Apache HBase, Elastic search and other nosql storage engines.
Data warehouse vs Data lake:
|No||Data Warehouse||Big data lake|
|1||The process starts with business process identification often driven by data stewards and business owners with the certain assumption of data and business.
|In the big data lake world, no assumption been made about the data. We start collecting the data at the granular level as it happened. Business process discovery happens based on data with the input from data stewards and business owners|
|2||Database schema evolution is very hard given the nature of relational data systems||Complex data types support and ability to rebuild the relationship is much easier|
|3||Very static since the business process drive the design||Very dynamic since business process identified based on data|
|4||Roll up and drill down analysis is harder since in order to reduce the complexity of data, the design may need to compromise certain granularity of data||Exploratory analysis is much simple since data been collected at a granular level|
|5||Serves predefined business needs||Ignite innovation and new business opportunity|
|6.||Limited complex data types support||Supports structured, semi structured and unstructured data|
Big data lake, will it replace traditional data warehouse?
The politically correct answer to this is big data lake is a complementary to data warehouse. Well this is true to a certain extends as many companies have well established data warehouse system and big data systems still very young but growing rapidly. The big data lake will grow hand on hand with data warehouse system for a certain period. The enterprises sooner or later will mature enough to handle the big data lakes and maintaining two system will become redundant. One can argue that the data warehouse can be one of the data source for big data lake, then that’s a totally wrong design since you already made some assumption about your data while designing your data warehouse system. I believe big data lake eventually make data warehouse redundant but data warehouse concepts like dimensional modeling will get well adopted by big data lake system. The big data lake is just another evolution of data warehouse.