What exactly is Big Data? Well, it is an Electro pop band from Brooklyn(I am not kidding! Check here). Music apart, the first logical thought that comes to the mind is, it is data that is BIG! But then how much data is enough to be considered as BIG? The threshold of the bigness of data is something that is not constant. Decades ago gigabyte was big, then terabyte, petabyte and so on. This will keep on changing.
The point is volume is not the only aspect that defines Big Data.
4 Vs of Big Data
Volume:Big Data involves huge amount of data. In magnitudes of Terabytes and Petabytes. This is because nowadays because of digitization, data is captured almost everywhere and anywhere.
Velocity: The speed at which data is being generated from different systems like POS, sensors, the internet etc is tremendous. Effectively handling this velocity is one more aspect of Big Data.
Veracity: Such large amount of data is surely isn't clean. There is noise and abnormalities in data. And this uncertainty is exactly what veracity in Big Data means.
Ok. So now we have our data rather "Big Data". What do we do with it? The first word that can be thrown is Predictive Analytics!
Now, what does that cool sounding term means?
Predictive Analytics
Let us take the example of Netflix. Data is being generated by users who register on Netflix, data about what the user clicks, what the user watches, what he bookmarks is captured. And of course, there is the huge content that Netflix provides the user in the form of video stream. Now an application of predictive analytics is merging this data to predict what the user will like next and generate a suggestion list!
Sounds simple, doesn't it? But what goes on behind the stage is a lot of data crunching using complex statistical methods and data models.
So much of data and processing of data surely can't be done by standalone computers, of course! Enters Distributed Computing.
Distributed Computing.
Imagine a complex job, that is beyond the capacity of a single man. So, to get it done, it can be broken down into simple tasks and distributed to different people. The outcome of each task can then be combined to get the desired job done.This analogy applies to distributed computing too. When a complex analysis is to be performed on a large volume of data, the job can be divided and delegated to a grid of connected computers with good computational power. Each computer is called a node. These nodes process the given tasks and their outcome is combined. This is an effective way to apply complex predictive analytics algorithms to Big Data.
Hadoop
HDFS
Suppose we have 10 terabytes of data. A HDFS will split such large files of data onto multiple computers. At the same time, these distributed files also get replicated to achieve good reliability.MapReduce
MapReduce is a an application framework to process the data on the files that have already been split on multiple computers in the cluster. With MapReduce, multiple files can be processed simultaneously, thus minimizing the computation time.Some more buzzwords that get thrown in with Hadoop are HBase, Hive and Pig. Let's see what they mean
HBase
HBase is built on top of HDFS providing a fualt-tolerant capability. For example, searching for 40 large items in a group of 1 billion records is made possible by HBase.Pig
Pig is a data analysis platform used to analyze large amounts of data in the Hadoop ecosystem. In simple terms, when you write a program in Pig Latin (a SQL-like language for Pig), the Pig infrastructure breaks it down in several MapReduce programs and executes them in parallel.Hive
Hive is a data warehouse infrastructure and it's main component is a SQL like language i.e. HiveQL. It was developed at Facebook because the learning curve for Pig was high. Hive also leverages the distributed functionality of MapReduce to help analyze data.All of the above combined make up for the Hadoop Ecosystem. Of course there are many more add-ons, but these are the most basic ones with which a big data infrastructure can be built and analyzed as well.
Stay tuned for the next post on Data Mining. Happy Learning!
~Slice of BI