K Means Clustering With Decision Tree Computer Science Essay

The K-means clustering data mining algorithm is commonly used to find the clusters due to its simplicity of implementation and fast execution. After applying the K-means clustering algorithm on a dataset, it is difficult for one to interpret and to extract required results from these clusters, until another data mining algorithm is not used. The Decision tree (ID3) is used for the interpretation of the clusters of the K-means algorithm because the ID3 is faster to use, easier to generate understandable rules and simpler to explain. In this research paper we integrate the K-means clustering algorithm with the Decision tree (ID3) algorithm into a one algorithm using intelligent agent, called Learning Intelligent Agent (LIAgent). This LIAgent capable of to do the classification and interpretation of the given dataset. For the visualization of the clusters 2D scattered graphs are drawn.

Keywords: Classification, LIAgent, Interpretation, Visualization

1. Introduction

The data mining algorithms are applied to discover hidden, new patterns and relations from the complex datasets. The uses of intelligent mobile agents in the data mining algorithms further boost their study. The term intelligent mobile agent is a combination of two different disciplines, the ‘agent’ is created from Artificial Intelligence and ‘code mobility’ is defined from the distributed systems. An agent is an object which has independent thread of control and can be initiated. The first step is the agent initialization. The agent will then start to operate and may stop and start again depending upon the environment and the tasks that it tried to accomplish. After the agent finished all the tasks that are required, it will end at its complete state. Table 1 elaborates the different states of an agent [1][2][3][4].

Table 1. States of an agent

Name of Step

Description

Initialize

Performs one-time setup activity.

Start

Start its job or task.

Stop

Stops its jobs or tasks after saving intermediate results.

Complete

Performs completion or termination activity.

There is link between Artificial Intelligence (AI) and the Intelligent Agents (IA). The data mining is known as “Machine Learning” in Artificial Intelligence. Machine Learning deals with the development of techniques which allows the computer to ‘learn’. It is a method of creating computer programs by the analysis of the datasets. The agents must be able to learn to do classification, clustering and prediction using learning algorithms [5][6][7][8].

The remainder of this paper is organized as followos: Section 2 reviews the relevant data mining algoritms, namely the K-means clustering and the Decision tree (ID3). Section 3 is about the methodology; a hybrid integration of the data mining algorithms. In section 4 we discuss the results and dicussion. Finally section 5 presents the conclusion.

2. Overview of Data Mining Algorithms

The K-means clustering data mining algorithm is used for the classification of a dataset by producing the clusters of that dataset. The K-means clustering algorithm is a kind of ‘unsupervised learning’ of machine learning. The decision tree (ID3) data mining algorithm is used to interpret these clusters by producing the decision rules in if-then-else form. The decision tree (ID3) algorithm is a type of ‘supervised learning’ of machine learning. Both of these algorithms are combined in one algorithm through intelligent agents, called Learning Intelligent Agent (LIAgent). In this section we will discuss both of these algorithms.

2.1. K-means clustering Algorithm

The following steps explain the K-means clustering algorithm:

Step 1: Enter the number of clusters and number of iterations, which are the required and basic inputs of the K-means clustering algorithm.

Step 2: Compute the initial centroids by using the Range Method shown in equations 1 and 2.

(1)

(2)

The initial centroid is C(ci, cj).Where: max X, max Y, min X and min Y represent maximum and minimum values of X and Y attributes respectively. ‘k’ represents the number of clusters and i, j and n vary from 1 to k where k is an integer. In this way, we can calculate the initial centroids; this will be the starting point of the algorithm. The value (maxX – minX) will provide the range of ‘X’ attribute, similarly the value (maxY – minY) will give the range of ‘Y’ attribute. The value of ‘n’ varies from 1 to ‘k’. The number of iterations should be small otherwise the time and space complexity will be very high and the value of initial centroids will also become very high and may be out of the range in the given dataset. This is a major drawback of the K-means clustering algorithm.

Step 3: Calculate the distance using Euclidean’s distance formula in equation 3. On the basis of the distances, generate the partition by assigning each sample to the closest cluster.

Euclidean Distance Formula: (3)

Where d(xi, xj) is the distance between xi and xj. xi and xj are the attributes of a given object, where i and j vary from 1 to N where N is total number of attributes of a given object. i,j and N are integers.

Step 4: Compute new cluster centers as centroids of the clusters, again compute the distances and generate the partition. Repeat this until the cluster memberships stabilizes [9][10].

The strengths and weaknesses of the K-means clustering algorithm are discussed in table 2.

Table 2. Strengths and Weakness of the K-means clustering Algorithm

Strengths

Weaknesses

Time complexity is O(nkl). Linear time complexity in the size of the dataset.

It is easy to implement, it has the drawback of depending on the initial centre provided.

Space complexity is O(k + n).

If a distance measure does not exist, especially in multidimensional spaces, first define the distance, which is not always easy.

It is an order-independent algorithm. It generates same partition of data irrespective of order of samples.

The Results obtained from this clustering algorithm can be interpreted in different ways.

Not applicable

All clustering techniques do not address all the requirements adequately and concurrently.

The following are areas but not limited to where the K-means clustering algorithm can be applied:

Marketing: Finding groups of customers with similar behavior given large database of customer containing their profiles and past records.

Biology: Classification of plants and animals given their features.

Libraries: Book ordering.

Insurance: Identifying groups of motor insurance policy holders with a high average claim cost; identifying frauds.

City-planning: Identifying groups of houses according to their house type, value and geographically location.

Earthquake studies: Clustering observed earthquake epicenters to identify dangerous zones.

WWW: Document classification; clustering web log data to discover groups of similar access patterns.

Medical Sciences: Classification of medicines; patient records according to their doses etc. [11][12].

2.2. Decision Tree (ID3) Algorithm

The decision tree (ID3) produces the decision rules as an output. The decision rules obtained from ID3 are in the form of if-then-else, which can be use for the decision support systems, classification and prediction. The decision rules are helpful to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The function of the decision tree (ID3) is shown in the figure 1.

Figure 1. The Function of Decision Tree (ID3) algorithm

The cluster is the input data for the decision tree (ID3) algorithm, which produces the decision rules for the cluster.

The following steps explain the Decision Tree (ID3) algorithm:

Step 1: Let ‘S’ is a training set. If all instances in ‘S’ are positive, then create ‘YES’ node and halt. If all instances in ‘S’ are negative, create a ‘NO’ node and halt. Otherwise select a feature ‘F’ with values v1,…,vn and create a decision node.

Step 2: Partition the training instances in ‘S’ into subsets S1, S2, …, Sn according to the values of V.

Step 3: Apply the algorithm recursively to each of the sets Si [13][14].

Table 3 shows the strengths and weaknesses of ID3 algorithm.

Table 3. Strengths and Weaknesses of Decision Tree (ID3) Algorithm

Strengths

Weaknesses

It generates understandable rules.

It is less appropriate for a continuous attribute.

It performs classification without requiring much computation.

It does not perform better in problems with many class and small number of training examples.

It is suitable to handle both continuous and categorical variables.

The growing of a decision tree is expensive in terms of computation because it sorts each node before finding the best split.

It provides an indication for prediction or classification.

It is suitable for a single field and does not treat well on non-rectangular regions.

3. Methodology

We combine two different data mining algorithms namely the K-means clustering and Decision tree (ID3) into a one algorithm using intelligent agent called Learning Intelligent Agent (LIAgent). The Learning Intelligent Agent (LIAgent) is capable of clustering and interpretation of the given dataset. The clusters can also be visualized by using 2D scattered graphs. The architecture of this agent system is shown in figure 2.

Figure 2. The Architecture of LIAgent System

The LIAgent is a combination of two data mining algorithms, the one is the K-means clustering algorithm and the second is the Decision tree (ID3) algorithm. The K-means clustering algorithm produces the clusters of the given dataset which is the classification of that dataset and the Decision tree (ID3) will produce the decision rules for each cluster which are useful for the interpretation of these clusters. The user can access both the clusters and the decision rules from the LIAgent. This LIAgent is used for the classification and the interpretation of the given dataset. The clusters of the LIAgent are further used for visualization using 2D scattered graphs. Decision tree (ID3) is faster to use, easier to generate understandable rules and simpler to explain since any decision that is made can be understood by viewing path of decision. They also help to form an accurate, balanced picture of the risks and rewards that can result from a particular choice. The decision rules are obtained in the form of if-then-else, which can be used for the decision support systems, classification and prediction.

A medical dataset ‘Diabetes’ is used in this research paper. This is a dataset/testbed of 790 records. The data of ‘Diabetes’ dataset is pre-processed, called the data standardization. The interval scaled data is properly cleansed. The attributes of the dataset/testbed ‘Diabetes’ are:

Number of times pregnant (NTP)(min. age = 21, max. age = 81)

Plasma glucose concentration a 2 hours in an oral glucose tolerance test (PGC)

Diastolic blood pressure (mm Hg) (DBP)

Triceps skin fold thickness (mm) (TSFT)

2-Hour serum insulin (m U/ml) (2HSHI)

Body mass index (weight in kg/(height in m)^2) (BMI)

Diabetes pedigree function (DPF)

Age

Class (whether diabetes is cat 1 or cat 2) [15].

We create the four vertical partitions of the dataset ‘Diabetes’, by selecting the proper number of attributes. This is illustrated in tables 4 to 7.

Table 4. 1st Vertically partition of Diabetes Dataset

NTP

DPF

Class

0.627

-ive

0.351

+ive

2.288

-ive

Table 5. 2nd Vertically partition of Diabetes Dataset

DBP

AGE

Class

-ive

+ive

-ive

Table 6. 3rd Vertically partition of Diabetes Dataset

TSFT

BMI

Class

33.6

-ive

28.1

+ive

43.1

-ive

Table 7. 4th Vertically partition of Diabetes Dataset

PGC

2HIS

Class

148

-ive

+ive

185

168

-ive

Each partitioned table is a dataset of 790 records; only 3 records are exemplary shown in each table. For the LIAgent, the number of clusters ‘k’ is 4 and the number of iterations ‘n’ in each case is 50 i.e. value of k =4 and value of n=50. The decision rules of each clusters is obtained. For the visualization of the results of these clusters, 2D scattered graphs are also drawn.

4. Results and Discussion

The results of the LIAgent are discussed in this section. The LIAgent produces the two outputs, namely, the clusters and the decision rules for the given dataset. The total sixteen clusters are obtained for all four partitions, four clusters per partition. Not all the clusters are good for the classification, only the required and useful clusters are discussed for further information. The sixteen decision rules are also generated by LIAgent. We are presenting three decision rules of three different clusters. The number of decision rules varies from cluster to cluster; it depends upon the number of records in the cluster.

The Decision Rules of the 4th partition of the dataset ‘Diabetes’:

Rule: 1

if PGC = “165” then

Class = “Cat2”

else

Rule: 2

if PGC = “153” then

Class = “Cat2”

else

Rule: 3

if PGC = “157” then

Class = “Cat2”

else

Rule: 4

if PGC = “139” then

Class = “Cat2”

else

Rule: 5

if HIS = “545” then

Class = “Cat2”

else

Rule: 6

if HIS = “744” then

Class = “Cat2”

else

Class = “Cat1”

Only six decision rules are for the 4th partition of the dataset. It is easy for any one to take the decision and interpret the results of this cluster.

The Decision Rules of the 1st partition of the dataset ‘Diabetes’:

Rule: 1

if DPF = “1.32” then

Class = “Cat1”

else

Rule: 2

if DPF = “2.29” then

Class = “Cat1”

else

Rule: 3

if NTP = “2” then

Class = “Cat2”

else

Rule: 4

if DPF = “2.42” then

Class = “Cat1”

else

Rule: 5

if DPF = “2.14” then

Class = “Cat1”

else

Rule: 6

if DPF = “1.39” then

Class = “Cat1”

else

Rule: 7

if DPF = “1.29” then

Class = “Cat1”

else

Rule: 8

if DPF = “1.26” then

Class = “Cat1”

else

Class = “Cat2”

The eight decision rules are for the 1st partition of the dataset. The interpretation of the cluster is easy through the decision rules and it also helps to take the decision.

The Decision Rules of the 3rd partition of the dataset ‘Diabetes’:

Rule: 1

if BMI = “29.9” then

Class = “Cat1”

else

Rule: 2

if BMI = “32.9” then

Class = “Cat1”

else

Rule: 3

if TSFK = “23” then

Rule: 4

if BMI = “25.5” then

Class = “Cat1”

else

Rule: 5

if BMI = “30.1” then

Class = “Cat1”

else

Rule: 6

if BMI = “28.4” then

Class = “Cat1”

else

Class = “Cat2”

else

Rule: 7

if BMI = “22.9” then

Class = “Cat1”

else

Rule: 8

if BMI = “27.6” then

Class = “Cat1”

else

Rule: 9

if BMI = “29.7” then

Class = “Cat1”

else

Rule: 10

if BMI = “27.1” then

Class = “Cat1”

else

Rule: 11

if BMI = “25.8” then

Class = “Cat1”

else

Rule: 12

if BMI = “28.9” then

Class = “Cat1”

else

Rule: 13

if BMI = “23.4” then

Class = “Cat1”

else

Rule: 14

if BMI = “30.5” then

Rule: 15

if TSFK = “18” then

Class = “Cat2”

else

Class = “Cat1”

else

Rule: 16

if BMI = “26.6” then

Rule: 17

if TSFK = “18” then

Class = “Cat2”

else

Class = “Cat1”

else

Rule: 18

if BMI = “32” then

Rule: 19

if TSFK = “15” then

Class = “Cat2”

else

Class = “Cat1”

else

Rule: 20

if BMI = “31.6” then

Class = “Cat2” , “Cat1”

else

Class = “Cat2”

The twenty decision rules are for the 3rd partition of the dataset. The number of rules for this cluster is higher than the other two clusters discussed.

The visualization is important tool which provides the better understanding of the data and illustrates the relationship among the attributes of the data. For the visualization of the clusters 2D scattered graphs are drawn for all the clusters. We are presenting the four 2D scattered graphs of four different clusters of different partitions.

Figure 3. 2D Scattered Graph between ‘NTP’ and ‘DPF’ attributes of ‘Diabetes’ dataset

The distance between ‘NTP’ and ‘DPF’ attributes of ‘Diabetes’ dataset varies at the beginning of the graph but after some interval the distance becomes constant.

Figure 4. 2D Scattered Graph between ‘DBP’ and ‘AGE’ attributes of ‘Diabetes’ dataset

There is a variable distance between ‘DBP’ and ‘AGE’ attributes of the dataset. It remains variable throughout this graph.

Figure 5. 2D Scattered Graph between ‘TSFT’ and ‘BMI’ attributes of ‘Diabetes’ dataset

The graph shows almost constant distance between ‘TSFT’ and ‘BMI’ attributes of the dataset. It remains constant throughout the graph.

Figure 6. 2D Scattered Graph between ‘PGC’ and ‘2HIS’ attributes of ‘Diabetes’ dataset

There is a variable distance between ‘PGC’ and ‘2HIS’ attributes of the dataset. But in the middle of this graph there is some constant distance between these attributes. The structure of this graph is similar to the graph of figure 5.

5. Conclusion

It is not simple for all the users that they can interpret and extract the required results from these clusters, until some other data mining algorithms or other tools are not used. In this research paper we have tried to address the issue by integrating the K-means clustering algorithm with the Decision tree (ID3) algorithm. The choice of the ID3 is due to the decision rules in the form of if-then-else as an output, which are easy to understand and help to take the decision. It is a hybrid combination of ‘supervised and unsupervised machine learning’, using intelligent agent, called a LIAgent. The LIAgent is helpful in the classification and prediction of the given dataset. Furthermore, 2D scattered graphs of the clusters are drawn for the visualization.

Order Now