Data engineering and continuous delivery:
We are witnessing the evaluation of web from web 2.0 with social engagement to self intelligent data driven applications. whether it is a retail app or CRM or healthcare, all the applications will be driven by data. The quest to provide personalized experience increases the adoption of big data.
The adoption of big data increases exponentially so as the complexity of data pipeline. The change of strategy increases the consumption of data across various sources produce internally as well as externally. A reliable and agile data pipeline is a backbone for an organization to move quickly and win the clients.Continuous data pipeline principles are more important than ever in data engineering.
The current challenges with data pipeline:
Cascading system failure:
Data pipeline is a continuous delivery system following the principles of workflow patterns. Any fault behaviour of the system component in the upstream can potentially affect all the downstream. This will lead to cascading system failure and create bad user experience.
High Risk releases:
The lack of continuous delivery system decreases the confidence in release cycle. One need to put multiple level of check before pushing a job in to production. The manual process increases bureaucracy and reduce greater agility.
Delayed time to market:
The cascading failure and high risk releases add more complexity to data pipeline engine. This will ultimately leads to delay to market as simple change request need multiple cautious effort to deliver.
High cost of maintenance:
the lack of continuous delivery system result in creating experts of the system. This will create technical debt as knowledge of the system not spreader across equally. This increases hiring and retaining specialist that brings high cost as well.
Continuous data pipeline delivery system:
Unit testing, Integration testing and code coverage enable high level confidence on individual code that we deliver. Map Reduce framework has MR Unit as a unit testing framework. Cloudera has very good blog post on unit testing in Apache Spark with Spark Testing Base. HBase Mini cluster provide comprehensive integration testing utility for HBase. Kafka support unit and integration testing via Kafka server. The Jarvis project have the complete code example covering various integration testing utility. It is a best practice to follow a “Two Vote code review” process. This not only reduce the risk but also spread the knowledge across the team there by eliminate technical debt.
Automated acceptance testing:
Microcosm testing (Known set of input / Known set of output) is a critical backbone of the continuous data pipeline. The data pipeline either doing a data transformation or data cleaning during their life time before consuming data from source and sink it to another data storage engine. The data transformation and data cleanup will always be depends on any business rules. It is important to have a solid microcosm testing system after the build to make sure we are not breaking the business rules through out the data pipeline. Microcosm testing system will give high degree of confidence in terms of business functionality and ability to support other data applications depends on it.
Automated Workflow planning:
It is surprising to see most of the open source workflow scheduling engines (Oozie, Airflow) build without any intelligence around it. One of the common problem in a complex data pipeline is that we need to know complete lineage of the jobs. We need to know the lineage of the jobs because when we deploy we need to approximately time it to make sure all the dependencies were satisfied.
Capacity planning is an another major pain point with workflow systems. We do have some very good visualization to show job start time and end time with resource utilization etc. Continuous data pipeline requires an intelligent workflow scheduling engine which automatically understand the dependency lineage, capacity of the cluster and SLA associated with the pipeline and act on it.
Data pipeline requires manual testing to make sure the system we build functioning correctly and we have 100% confident to deploy job. Manual Testing often very important for new feature development where we have not yet established comprehensive automated verification or important business rule changes on the data pipeline. It is important that we should able to expose any part of the data on the data pipeline to Adhoc query engine. Adhoc query engine like Presto, Impala and Drill make information consumption lot easier without disturbing the data pipeline. Manual testing often carried out against sample set of data or random sampling.
The best practice is to treat staging and production environment as candidate 2 and candidate 1. After a feature gone through manual testing the artifact promoted from snapshot to staging (Candidate 2) environment. If the job run successfully and satisfies the performance requirement, the artifact will get promoted to production (candidate 1) environment.
Continuous monitoring is an important part of continuous data pipeline. In a typical organization environment where there will be multiple team producing different data sources. There will be multiple teams consuming those data to power their services. It is important to provide data lineage, data quality, clear ownership of data and data dictionary about the data. A new artifact can get deployed multiples times in a day that could potentially change these information. It is important to automate these services so that downstream jobs easily track the changes. These tools can produce greater visibility and transparency to the entire system. Twitter has some very good blog post about their continuous monitoring system here.