Slice of BI: September 2015

Data Mining in plain terms is analysis of large data-sets to discover of patterns, trends and relationships that go beyond simple analysis.

A more formal definition would be "a computational process to discovering patterns, trends and behavior in large data-sets using Artificial Intelligence, Machine Learning, Statistics and Database Systems. Phew..lot of technologies involved, isn't it! Well, lets simplify this and take a bird's eye view of a real-world example of Data Mining.

Data Mining in Recommendation Systems

Whenever you make a purchase on Amazon, or for that matter even just searched for products on Amazon, you would have noticed a string of 'Recommended Products' pleading your attention at a corner of the page. These recommendations systems are a classic example of Data Mining. Let's take the example of Amazon's Recommendation System and explore the steps involved in Data Mining.

Amazon maintains a big data-warehouse containing tons of information about transactions, products, users, etc. The first step towards building a recommendation system, would be to pull out only the information that is required for the system. In this particular case, it could include information such as customer' transactions, products viewed, customer feedback, product categories, product price, etc. This step is the Selection step - i.e. Identifying and capturing only the necessary data from the data-warehouse. The selected data can be referred to as 'Target Data'.

The information in the Target Data may often need some Pre-Processing before using it can actually be used to build the system. How should Missing Values be handled? Which attributes do we need? Do we need to create any calculated field? These are some of the decisions that need to made at this stage. Textual data tend to require much more cleaning. For example, textual data such as customer feedbacks would need to be processed to remove articles, pronouns, etc (commonly refereed to as stop words), removing punctuation, converting to lower case, etc.

Once the pre-processing is done, the format of the data may have to be changed according to the algorithm that is to be used. For example, from all the previous transactions made by the customer, it will be of interest to us to capture the products recently brought by the customer, say in the previous 3 months, and consider it as a separate variable. All of this is done in the Transformation step.

Once the Selection, Pre-Processing and Transformation of the data is done, we should be ready to actually 'mine' the data..! The most critical part of formulating a data mining solution, is to formulate the question you are trying to answer. The results, and the quality of the results will greatly depend on how well you understand the problem you are trying to solve. Based on the task at hand, we can opt for different Data Mining algorithms. For this particular case, applying Clustering Algorithm on the products might be a good choice to identify products similar to the ones the customer has brought. The results of the algorithm should help us identify patterns, trends, etc. Visualization tools and techniques help us greatly to interpret the results. We can further use statistical methods to Extrapolate and find predictions

This term has surely got the hype in the past few years. But, for sometime everyone is aware about the hype but unaware of the term's meaning like the bear above. Let us change that in this blog. Everybody just loves to talk about how Big Data is the future, how the Fortune 500 companies are leveraging it to make better decisions and so on.
What exactly is Big Data? Well, it is an Electro pop band from Brooklyn(I am not kidding! Check here). Music apart, the first logical thought that comes to the mind is, it is data that is BIG! But then how much data is enough to be considered as BIG? The threshold of the bigness of data is something that is not constant. Decades ago gigabyte was big, then terabyte, petabyte and so on. This will keep on changing.
The point is volume is not the only aspect that defines Big Data.

4 Vs of Big Data

Volume:Big Data involves huge amount of data. In magnitudes of Terabytes and Petabytes. This is because nowadays because of digitization, data is captured almost everywhere and anywhere.

Variety: Data comes in all kinds of format these days. Structured in the form of the good ol' databases. Unstructured in the form of text documents, email, video, audio. Merging and managing such different forms is one of the aspect of Big Data.

Velocity: The speed at which data is being generated from different systems like POS, sensors, the internet etc is tremendous. Effectively handling this velocity is one more aspect of Big Data.

Veracity: Such large amount of data is surely isn't clean. There is noise and abnormalities in data. And this uncertainty is exactly what veracity in Big Data means.

Ok. So now we have our data rather "Big Data". What do we do with it? The first word that can be thrown is Predictive Analytics!
Now, what does that cool sounding term means?

Predictive Analytics

Let us take the example of Netflix. Data is being generated by users who register on Netflix, data about what the user clicks, what the user watches, what he bookmarks is captured. And of course, there is the huge content that Netflix provides the user in the form of video stream. Now an application of predictive analytics is merging this data to predict what the user will like next and generate a suggestion list!
Sounds simple, doesn't it? But what goes on behind the stage is a lot of data crunching using complex statistical methods and data models.

So much of data and processing of data surely can't be done by standalone computers, of course! Enters Distributed Computing.

Distributed Computing.

Imagine a complex job, that is beyond the capacity of a single man. So, to get it done, it can be broken down into simple tasks and distributed to different people. The outcome of each task can then be combined to get the desired job done.
This analogy applies to distributed computing too. When a complex analysis is to be performed on a large volume of data, the job can be divided and delegated to a grid of connected computers with good computational power. Each computer is called a node. These nodes process the given tasks and their outcome is combined. This is an effective way to apply complex predictive analytics algorithms to Big Data.

Hadoop

The word Hadoop is almost synonymous with the word Big Data because of the number of times it is mentioned with Big Data. But of course, Hadoop is just a technology that was created when the data on the web started exploding and went beyond the ability of traditional systems to handle it. In simple words, Hadoop is a new way to store and process data. It enables distributed computing of huge amount of data across inexpensive servers that store and process data. The storage part of Hadoop is called the Hadoop Distributed File System and the processing part is called MapReduce.

HDFS

Suppose we have 10 terabytes of data. A HDFS will split such large files of data onto multiple computers. At the same time, these distributed files also get replicated to achieve good reliability.

MapReduce

MapReduce is a an application framework to process the data on the files that have already been split on multiple computers in the cluster. With MapReduce, multiple files can be processed simultaneously, thus minimizing the computation time.

Some more buzzwords that get thrown in with Hadoop are HBase, Hive and Pig. Let's see what they mean

HBase

HBase is built on top of HDFS providing a fualt-tolerant capability. For example, searching for 40 large items in a group of 1 billion records is made possible by HBase.

Pig

Pig is a data analysis platform used to analyze large amounts of data in the Hadoop ecosystem. In simple terms, when you write a program in Pig Latin (a SQL-like language for Pig), the Pig infrastructure breaks it down in several MapReduce programs and executes them in parallel.

Hive

Hive is a data warehouse infrastructure and it's main component is a SQL like language i.e. HiveQL. It was developed at Facebook because the learning curve for Pig was high. Hive also leverages the distributed functionality of MapReduce to help analyze data.

All of the above combined make up for the Hadoop Ecosystem. Of course there are many more add-ons, but these are the most basic ones with which a big data infrastructure can be built and analyzed as well.

Stay tuned for the next post on Data Mining. Happy Learning!

~Slice of BI

Slice of BI

Pages

Saturday

Data Mining in a Nutshell..!!