Saturday

Data Mining in a Nutshell..!!



Data Mining in plain terms is analysis of large data-sets to discover of patterns, trends and relationships that go beyond simple analysis. 

A more formal definition would be "a computational process to discovering patterns, trends and behavior in large data-sets using Artificial Intelligence, Machine Learning, Statistics and Database Systems. Phew..lot of technologies involved, isn't it! Well, lets simplify this and take a bird's eye view of a real-world example of Data Mining.



Data Mining in Recommendation Systems


Whenever you make a purchase on Amazon, or for that matter even just searched for products on Amazon, you would have noticed a string of 'Recommended Products' pleading your attention at a corner of the page. These recommendations systems are a classic example of Data Mining. Let's take the example of Amazon's Recommendation System and explore the steps involved in Data Mining.



Amazon maintains a big data-warehouse containing tons of information about transactions, products, users, etc. The first step towards building a recommendation system, would be to pull out only the information that is required for the system. In this particular case, it could include information such as customer' transactions, products viewed, customer feedback, product categories, product price, etc. This step is the Selection step - i.e. Identifying and capturing only the necessary data from the data-warehouse. The selected data can be referred to as 'Target Data'.


The information in the Target Data may often need some Pre-Processing before using it can actually be used to build the system. How should Missing Values be handled? Which attributes do we need? Do we need to create any calculated field? These are some of the decisions that need to made at this stage. Textual data tend to require much more cleaning. For example, textual data such as customer feedbacks would need to be processed to remove articles, pronouns, etc (commonly refereed to as stop words), removing punctuation, converting to lower case, etc.

Once the pre-processing is done, the format of the data may have to be changed according to the algorithm that is to be used. For example, from all the previous transactions made by the customer, it will be of interest to us to capture the products recently brought by the customer, say in the previous 3 months, and consider it as a separate variable. All of this is done in the Transformation step.

Once the Selection, Pre-Processing and Transformation of the data is done, we should be ready to actually 'mine' the data..! The most critical part of formulating a data mining solution, is to formulate the question you are trying to answer. The results, and the quality of the results will greatly depend on how well you understand the problem you are trying to solve. Based on the task at hand, we can opt for different Data Mining algorithms. For this particular case, applying Clustering Algorithm on the products might be a good choice to identify products similar to the ones the customer has brought. The results of the algorithm should help us identify patterns, trends, etc. Visualization tools and techniques help us greatly to interpret the results. We can further use statistical methods to Extrapolate and find predictions