Saturday

Data Mining in a Nutshell..!!



Data Mining in plain terms is analysis of large data-sets to discover of patterns, trends and relationships that go beyond simple analysis. 

A more formal definition would be "a computational process to discovering patterns, trends and behavior in large data-sets using Artificial Intelligence, Machine Learning, Statistics and Database Systems. Phew..lot of technologies involved, isn't it! Well, lets simplify this and take a bird's eye view of a real-world example of Data Mining.



Data Mining in Recommendation Systems


Whenever you make a purchase on Amazon, or for that matter even just searched for products on Amazon, you would have noticed a string of 'Recommended Products' pleading your attention at a corner of the page. These recommendations systems are a classic example of Data Mining. Let's take the example of Amazon's Recommendation System and explore the steps involved in Data Mining.



Amazon maintains a big data-warehouse containing tons of information about transactions, products, users, etc. The first step towards building a recommendation system, would be to pull out only the information that is required for the system. In this particular case, it could include information such as customer' transactions, products viewed, customer feedback, product categories, product price, etc. This step is the Selection step - i.e. Identifying and capturing only the necessary data from the data-warehouse. The selected data can be referred to as 'Target Data'.


The information in the Target Data may often need some Pre-Processing before using it can actually be used to build the system. How should Missing Values be handled? Which attributes do we need? Do we need to create any calculated field? These are some of the decisions that need to made at this stage. Textual data tend to require much more cleaning. For example, textual data such as customer feedbacks would need to be processed to remove articles, pronouns, etc (commonly refereed to as stop words), removing punctuation, converting to lower case, etc.

Once the pre-processing is done, the format of the data may have to be changed according to the algorithm that is to be used. For example, from all the previous transactions made by the customer, it will be of interest to us to capture the products recently brought by the customer, say in the previous 3 months, and consider it as a separate variable. All of this is done in the Transformation step.

Once the Selection, Pre-Processing and Transformation of the data is done, we should be ready to actually 'mine' the data..! The most critical part of formulating a data mining solution, is to formulate the question you are trying to answer. The results, and the quality of the results will greatly depend on how well you understand the problem you are trying to solve. Based on the task at hand, we can opt for different Data Mining algorithms. For this particular case, applying Clustering Algorithm on the products might be a good choice to identify products similar to the ones the customer has brought. The results of the algorithm should help us identify patterns, trends, etc. Visualization tools and techniques help us greatly to interpret the results. We can further use statistical methods to Extrapolate and find predictions

Friday

Big Data Basics

 


This term has surely got the hype in the past few years. But, for sometime everyone is aware about the hype but unaware of the term's meaning like the bear above. Let us change that in this blog.  Everybody just loves to talk about how Big Data is the future, how the Fortune 500 companies are leveraging it to make better decisions and so on.
What exactly is Big Data? Well, it is an Electro pop band from Brooklyn(I am not kidding! Check here). Music apart, the first logical thought that comes to the mind is, it is data that is BIG! But then how much data is enough to be considered as BIG? The threshold of the bigness of data is something that is not constant. Decades ago gigabyte was big, then terabyte, petabyte and so on. This will keep on changing.
The point is volume is not the only aspect that defines Big Data.

4 Vs of Big Data



Volume:Big Data involves huge amount of data. In magnitudes of Terabytes and Petabytes. This is because nowadays because of digitization, data is captured almost everywhere and anywhere.

Variety: Data comes in all kinds of format these days. Structured in the form of the good ol' databases. Unstructured in the form of text documents, email, video, audio. Merging and managing such different forms is one of the aspect of Big Data.

Velocity: The speed at which data is being generated from different systems like POS, sensors, the internet etc is tremendous. Effectively handling this velocity is one more aspect of Big Data.

Veracity: Such large amount of data is surely isn't clean. There is noise and abnormalities in data. And this uncertainty is exactly what veracity in Big Data means.

Ok. So now we have our data rather "Big Data". What do we do with it? The first word that can be thrown is Predictive Analytics!
Now, what does that cool sounding term means?


Predictive Analytics


Let us take the example of Netflix. Data is being generated by users who register on Netflix, data about what the user clicks, what the user watches, what he bookmarks is captured. And of course, there is the huge content that Netflix provides the user in the form of video stream. Now an application of predictive analytics is merging this data to predict what the user will like next and generate a suggestion list!
Sounds simple, doesn't it? But what goes on behind the stage is a lot of data crunching using complex statistical methods and data models.

So much of data and processing of data surely can't be done by standalone computers, of course! Enters Distributed Computing.


Distributed Computing.

Imagine a complex job, that is beyond the capacity of a single man. So, to get it done, it can be broken down into simple tasks and distributed to different people. The outcome of each task can then be combined to get the desired job done.
This analogy applies to distributed computing too. When a complex analysis is to be performed on a large volume  of data, the job can be divided and delegated to a grid of connected computers with good computational power. Each computer is called a node. These nodes process the given tasks and their outcome is combined. This is an effective way to apply complex predictive analytics algorithms to Big Data.


Hadoop



The word Hadoop is almost synonymous with the word Big Data because of the number of times it is mentioned with Big Data. But of course, Hadoop is just a technology that was created when the data on the web started exploding and went beyond the ability of traditional systems to handle it. In simple words, Hadoop is a new way to store and process data. It enables distributed computing of huge amount of data across inexpensive servers that store and process data. The storage part of Hadoop is called the Hadoop Distributed File System and the processing part is called MapReduce.


HDFS

Suppose we have 10 terabytes of data. A HDFS will split such large files of data onto multiple computers. At the same time, these distributed files also get replicated to achieve good reliability.


MapReduce

MapReduce is a an application framework to process the data on the files that have already been split on multiple computers in the cluster. With MapReduce, multiple files can be       processed simultaneously, thus minimizing the computation time.

Some more buzzwords that get thrown in with Hadoop are HBase, Hive and Pig. Let's see what they mean
            

HBase

HBase is built on top of  HDFS providing a fualt-tolerant capability. For example, searching   for 40 large items in a group of 1 billion records is made possible by HBase.


Pig

Pig is a data analysis platform used to analyze large amounts of data in the Hadoop     ecosystem. In simple terms, when you write a program in Pig Latin (a SQL-like language for Pig), the Pig infrastructure breaks it down in several MapReduce programs and executes them in parallel.


Hive

Hive is a data warehouse infrastructure and it's main component is a SQL like language i.e. HiveQL. It was developed at Facebook because the learning curve for Pig was high. Hive also leverages the distributed functionality of MapReduce to help analyze data.

All of the above combined make up for the Hadoop Ecosystem. Of course there are many more add-ons, but these are the most basic ones with which a big data infrastructure can be built and analyzed as well.


Stay tuned for the next post on Data Mining. Happy Learning!

~Slice of BI

           


Monday

Life of DW-BI

What is Business Intelligence?

  

 

Simply put, Business Intelligence is a set of thought, money and actions put towards either saving money or getting more money for the business. How do you do this? Obliviously by analyzing some data with some tools.

For example, consider yourself as a store manager of Nike who wants to put up a shop in a new city. So, where should you put up your shop? In downtown area? Or near a college? or near some sport equipment shops? If you are a good store manager who wants to make more money you would consider analyzing some data about demographics, competitors, property rates to come to a magical combination which will cost the least but will drive the sales through the roof.

The process of transforming this input data into actionable insights will help you to make strategic decisions (like finding a sweet spot for your shop). And this my friends is termed as Business Intelligence..!! Below mentioned info-graphics will help you visualize this process.

Typical Life Cycle of Business Intelligence Application

It all begins with data! Whether it be structured data like relational tables, semi-structured XML files or unstructured data like facebook posts, it usually goes through ETL. Here data is extracted from its source and placed into staging area where its cleansed, transformed and loaded into Data-warehouse. The next phase is reporting phase which gets its input from data-marts. This information is then visualized using reports or dashboards and is also made available across mobile platform.

Now, lets understand some more key concepts by continuing our Nike Store example.

Data-Warehouse


Data-Warehouse is like a neatly arranged and organized closet which gets you a quick access to your day to day clothes, keeps track of your dirty laundry and tells you that its time to buy some new clothes. In formal words- A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management's decision making process.
  • Subject-Oriented is similar to keeping your regular, party and sports wear in different compartments.
  • Integrated is like having multiple sources for your clothes- newly bought, borrowed from friend or passed on to you by your older sibling.
  • Time-Variant means keeping some very old clothes- your first football jersey or your first Halloween costume.
  • Non-Volatile can be termed as once you get clothes you never alter them. If you are growing or getting fat- you just get bigger sized clothes
All these properties of an ideal closet (or data warehouse) will help you (the manager) take more informed decisions.

OLTP v/s OLAP

Online Transaction Processing a.k.a OLTP

Consider you working at the point of sales (typically where you do the billing) of your new Nike store where:
  • You have to process hundreds of billing transactions in a day which get stored in some operational database. At the end of the day you simply query daily sales against this database to verify the cash in hand.
  • The system should be fast and order info must be inserted, updated and sometimes deleted (in case of incorrect order entry) within seconds.
  •  You just worry about handling the fundamental business process - take cash in and sell goods to customer. This is a Online Transactional Processing (OLTP) system.

Online Analytical Processing a.k.a OLAP

Now some years have passed by and your hard-work has paid off.. and now you are the manager of the store where:
  • You do run some complex queries to analyze the state of your business. How are your monthly sales compared to last year? How much discount can you afford to attract more customers for the coming Christmas sale?
  • Obviously these queries will take some time time to run because you are consolidating data for few months (not just doing some insert, updates and deletes)
  • Now, your aim is to take some informed strategic decisions which can enable you to increase your revenue further. This is a Online Analytical Processing (OLAP) system.

ETL - Extract..Transform..Load

Extract, Transform and Load.. This is how the data warehouses are built.!!
  • Incoming goods for your Nike store are first extracted (or received from one or many locations of the  company factory). 
  • These goods are checked for any obvious defects (like a T-shirt having only one sleeve). Goods in good condition are packaged and labelled with the sell price and discount information as per the local currency (transformation). 
  • And finally these transformed goods are loaded into appropriate sections of the store (like Men's, Women's, Kid's etc) sometimes these sections can be termed as Data-Marts in data warehousing terminology.

Reporting & Visualization


 

So how do you decide if your store is doing well?
  • First, you decide on some KPIs a.k.a Key Performance Indicators. So, in case of a Nike store these KPIs can be number of customers, number of orders, total sales, total profit etc.
  • Once you have the KPIs you usually would like to see them in nicely formatted reports. These reports can be Daily Sales report or Quarterly revenue report.
  • If you own several stores in the west coast region, you are more likely to see overall performance of these stores with respect to time and location. That's where Dashboards come in.
  • And if you want to compare you store sales with your last year's sales.. or you want to compare actual sales v/s the forecasted sales then you probably want to have a scorecard.

Lets hope this new role as a Nike store manager helped you to refresh the understanding of common BI terms. Stay tuned for more interesting posts on Data Analytics and Big Data! Feel free to express yourself in the comments section. Happy Learning!

~Slice of BI

Sunday

WHAT, WHY and HOW of Analytics and Business Intelligence

Before we start.. How about a famous quote..


We will leave destiny, life and karma aside for this post.. Let's connect some dots to take a deep dive into the world of Analytics and Business Intelligence..


Everything's intentional. It's just filling in the dots. -David Byrne

Here is the journey of Data to Wisdom:

  • Data is just obtaining some raw facts. (Similar to collecting . . . . <4 dots in above pic>)
  • Information is when data has some meaning; mostly by defining some relationships with in the dots (Consider this as trying to connect the 4 dots with each other)
  • Knowledge is analyzing and synthesizing the information in order to give it a meaningful purpose. (Can we connect or relate this small piece with a bigger piece?)
  • Wisdom is using knowledge to establish and achieve goals. It usually builds on the past by combining it with gut, judgement and experience to reveal a new understanding. (Making a pattern out of the connected dots for better understanding the bigger picture)

So how do we transform data into information and use our knowledge to covert it into wisdom? 

Well its not that simple but here is a good info-graphics which gives a 30,000 ft view of how its done.

30,000 ft view of Analytics and Business Intelligence
Let's hope that you got some idea bout the WHAT and WHY part of it. To give a clear understanding of HOW these things are done- look out for upcoming posts on Business Intelligence, Data Mining and Big Data

Feel free to share your feedback, suggestions (or just to say Hi) in the comments section! Happy Learning!

~Slice of BI




Wednesday

Hello World..!!

Sometimes we learn some things in our life, but rarely understand it.. We all have been through this phase.. They say that- Half knowledge is dangerous.!! So is superficial understanding of some key concepts in technology..

What can "You" get from this blog?

This blog attempts to go beyond the technical jargons and boring textbook definitions to really understand basic business intelligence concepts. After reading this blog you should be able to understand 3 basic questions about common BI terms:
  • WHAT is it?
  • WHY it is needed?
  • HOW to do it?
Sometimes you might not know how to do it, in that case instead of re-inventing the wheel we will help you with-
  • WHERE to find it?

Initial idea is to cover topics like- Business Intelligence, Data Warehouse Concepts, Data Mining, Data Analytics Tools and Technologies etc. 

If you feel need for posts on some specific topics within these major areas-We are all ears..!! Just let us know, and we will be on it.. :) 

This is all for the first post.. I hope you find the upcoming sections useful.. Happy Learning..!!

~Slice Of BI