The emerging of Big data:
In the web2.0 era, the amount of data getting generated is reaching over peta bytes of data.Programmers and the Business analyst are looking to analyse the large amount of data to drive the business. The data are key for any business.
There are two important characteristic of big data that causes the challenges.
- Store data fail safe
- Process the data faster
1. Store data fail safe
Over the years the storage capacity of a single disk increased considerably to drive storing data. Now 1 TB of hard disk is very normal. But the speed of reading the data from the disks are not cope up with the increase of storage speed. On an average we get only 100MB/s. This is a lot of time to read all the data from the disk,that will take 2 and half hour to read all the data from a disk.
How to increase the reading time: Parallel access:
One way to improve the read process is to read data from multiple disk source, hence the computation will be faster.
The draw back of this approach is we may end up more disks than the actual data size. But the increase of storage space and reduce in prices of those disk made this approach dearer.
What about hardware failure?
It is inevitable that the hardware may fail. So there should be a technique to duplicate data to different storage systems, So even one system fails the other will picks up. So we need an effective file system(HDFS) to store data distributed.
2.Process the data faster:
Often the analysis needs to combine data from different node for computation like sorting and merging. So we need an effective programming model like MapReduce which Hadoop build-on.The ability process unstructured data and slowness in seek time are the biggest challenges in computing big data.
Seek time is the process of moving disk header to particular place of the disk to read or write.The data access is primarily depends on the seek time. But the traditional B-Tree algorithm used by RDMBS good for updating and selecting data, it is not as efficient as of MapReduce for sorting and merging. The batch processing which most of write less and read often where the relational database fit for continuously updated.
Processing semi structure data:
RDBMS is good fit when your data organized in a structured way such as XML or tables. Because the whole data structure build around the relationships of data. The semi-structured data like a spreadsheet, though it organized as rows and cells, each row and cell can hold any data. Unstructured data such as image file or PDF won’t fit in to a relational databases. Map Reduce works well with unstructured and semi-structured data since it interpret the data at the processing time unlike RDBMS which force while storing time (with constraints and data types).
RDBMS often normalized to reduce the duplication where as distributed data processing build on top of duplication of data over different node. Duplication required so that even one node gone down, the data should not get lost and the computation should go on undisturbed. Hadoop HDFS file system and MapReduce algorithm perfectly build for it.
On MapReduce, If you increase the input data to be processed the processing speed with get reduce. On the other hand if you increase number of clusters, the processing speed will increase. It means the amount of data needs to be processed and the size of the cluster is directly proportional. It is not true in case of SQL queries.
RDMBS is good when you have a Gigabytes of structured data, which read and write often and need high integrity.
Hadoop is good when you have a Peta bytes of semi-structure or unstructured (though fit for structure too), which often read and write once in a while, require to process in batch mode with linear scaling and low integrity. People now start to use Hadoop for real-time analytic too now.