Introduction to Data Mining


If we compare the current society with the society we had in year 2000, we see a huge emerge in 'Digitization'. The analog machines which were controlled by humans have converted into machines of '1's and '0's which are controlled by themselves. People have attached with various technical devices to make their lives easy. Every industry has tried to adopt the 'Technology' to their activities.
As a result of them, information of lots of activities of people have been monitored, identified, reported and stored in various places in various manners. With the development of World Wide Web (WWW), people have found out lots of paths to reach to those sources. In fact, instead of looking at one source of data, people have gone up where they can merge few data sources and observe the 'connected data', the big word; BIG DATA. 
In
With these improvements, 'Data Mining' has become a 'hot' subject among the Computer Scientists. Although we have the powerful storage mechanisms to save each bit of data without discarding, generating Information using those data is not yet so improvised. That make the field of Data Mining even hotter!
Looking at the use of the subject 'Data Mining', doesn't matter even if you are not in a field of IT, the results of these findings will be beneficial for your field. That is why knowing the basis of Data Mining will be important for everyone.

What is Data Mining?
Well although it is contrast with the name, it is not mining Data, it is mining Information; We are mining for Information. We collect those previously mentioned datasets and do various actions to them and come up with some meaning full statistics and models what you can call as Information.
Data Mining is one step of this complicated process. We will not find information just by Mining. There is a huge process before and after the mining step. But this is the main step of the whole process. Therefore we will interpret the subject as extraction of interesting patterns of knowledge from sets of data.

The process of Knowledge Discovery in Databases (KDD)
This is the so called process that was mentioned before.
Image : http://www.ryerson.ca/~rmichon/mkt700/readings/KDD%20Process%20Overview.htm

As the figure shows, the main steps of the process includes,
  1. Data Selection
  2. Data Pre-processing
  3. Data Transformation
  4. Data Mining
  5. Pattern Evaluation
To do a proper Data Mining, all the previous steps have to be completed with no margins of errors. As you can see, Data Mining is the step which converts Data to some meaningful patterns. Therefore it does the major role in the whole KDD process.

Since we are focusing on Data Mining, let's go through some methods used in that step.
The main 'types' of data mining methods we will be considering are,
  • Association Rule Mining (ARM)
  • Classification
  • Clustering
First let's compare and contrast these three types.

By looking at a single bit of data, we can not interpret a meaningful information (There are exceptions). So in this process of Data Mining, what we are concentrated on is creating connections between data bits. Simply if we can match data1 with data2 and say something meaningful, then that is considered as a successful mining of Data. For the process of 'finding connections' we use Algorithms. 
In Association Rule Mining, what we search is the patterns which are frequent in the dataset. Because, if there is an actual association between few attributes, then the data relevant to those attributes must show some certain pattern when a set of records are considered. To search those patterns there are few techniques.

1. Apriori Approach
First, we scan the database and identify the patterns which has 2 or more repetitions.
Then out of them, we identify the patterns which has 3 or more repetitions.
Likewise, we increase the number of repetitions in each round and identify the relevant patterns.
As the desire of the user, we can have a margin where we terminate this process. Once those patterns are found, we check the 'confidence level' and 'support level of them'. You need to have some knowledge in Math to understand confidence and support. Since this is only an introduction, others can consider it as checking for certain qualities of these patterns. What we want is patterns with High Support level and High Confidence.

Issues in Apriori: The main issue with Apriori is, the large number of Database scans needed for the process. These scans are very time consuming since we are playing with huge sets of data.

2. Partitioning
Here what we do is, partition the database and then follow the Apriori approach in each of them. Then in each partition we can find frequent patterns respected to the partition, which we can called as, local frequent patterns. Then we collect those local frequent patterns and choose the ones which has to be global frequent patterns. 
Rather than partitioning a single dataset, where this is mostly used is in distributed datasets. Then instead of fetching the data in all the places into one place, we can follow Apriori in each of those locations, come up with local frequent patterns and then assemble those patterns to a single place. In this way we reduce a huge process of transporting and storing large datasets. 

3. Frequent Pattern Tree (FP-Tree)
Suppose the minimum support we need for a pattern is 3 times. Then here what we do is, scan the database and find all the items (lets say 'S') which were repeated 3 or more times. Then, for the database we add another column and in there for each record, we check out of the items in the record which are included in the set 'S'. If there are such common items to the record and the set 'S', then we write them in the column. If there are multiple items, then we writing them as ascending the number of frequency.
Then we build scan the database again and while doing so, we create the FP-Tree. The way we build the tress is not included here. If interested, you can google and find out may posts.
The advantage in here is that this method only do 2 scans of Database. The tree that we build represent all the frequent patterns in the database in a compressed manner. This tree will never larger than the original database. Therefore this is Complete and Compact.

Now let's look at some Classification methodologies.
This is a supervised learning methodology. We divide the dataset we have into two sets as Training data and Testing data. The normal percentage of the division is 75% for Training and 25% for Testing. That differs from dataset to dataset.
Then we take the training dataset and using algorithms we try to come up with a model which fits in to those data. We call it as the 'Classifier'. The main algorithms that we use for generating the classifiers are, Decision-tress, Naive Bayesian classifier, If-Then rules, Classification by back-propagation (Neural networks), Support vector machines, k-nearest neighbor algorithm, Case-based reasoning an genetic algorithms. 
The model we build will be consits of set of classes. When we input the necessary data, the model clams to which class the record is belongs to. Once we built the model, we test its accuracy using the Testing data set.
Note that classification and prediction are two different things which looks similar. In classification what we do is predicting the Class out the set of classes model has.
But in Prediction, what we do is predicting an attributes value (continuous or ordered) based on the input data. 

Finally, what is clustering?
Clustering is unsupervised learning. There are not predefined classes as in Classification. Using the characteristics of the data, what we try to do is grouping the similar data objects into clusters. In most of the times for grouping the data, what we search is the spread of these data with respect to the dimensions. 
Inter-class similarity

Intra-class similarity

Inter-class similarities are the similarities between clusters. Intra-class similarities are inside a certain cluster. If the clusters at the end have low inter-class similarities and high Intra-class similarities, then it is considered as a good cluster. To grouping these clusters we use K-means clustering, K-Medoid clustering, Hierarchical clustering, Density-based clustering (DBSCAN, OPTICS), etc.

These are the basic methods that we use in the Data Mining step. Still the subject is so young! Lots of dark areas to be researched. People who are interested can obviously find out those things around the internet and dig deeper and server the community with findings ;)

This article is a collection of methodologies described in very brief. What I wanted to give you is the basic idea of Data Mining, and the techniques used in for Data Mining at present.

References

Han and Kamber: Data Mining---Concepts and Techniques


Based on the things learnt in the lecture series done by Dr. Amal Shehan Perera as a final year module.




Some say he’s half man half fish, others say he’s more of a seventy/thirty split. Either way he’s a fishy bastard.

0 comments: