Comparison On Classification Techniques Using Weka Computer Science Essay

Computers have brought tremendous improvement in technologies especially the speed of computer and reduced data storage cost which lead to create huge volumes of data. Data itself has no value, unless data changed to information to become useful. In past two decade the data mining was invented to generate knowledge from database. Presently bioinformatics field created many databases, accumulated in speed and numeric or character data is no longer restricted. Data Base Management Systems allows the integration of the various high dimensional multimedia data under the same umbrella in different areas of bioinformatics.

WEKA includes several machine learning algorithms for data mining. Weka contains general purpose environment tools for data pre-processing, regression, classification, association rules, clustering, feature selection and visualization. Also, contains an extensive collection of data pre-processing methods and machine learning algorithms complemented by GUI for different machine learning techniques experimental comparison and data exploration on the same problem. Main features of WEKA is 49 data preprocessing tools, 76 classification/regression algorithms, 8 clustering algorithms, 3 algorithms for finding association rules, 15 attribute/subset evaluators plus 10 search algorithms for feature selection. Main objectives of WEKA are extracting useful information from data and enable to identify a suitable algorithm for generating an accurate predictive model from it.

This paper presents short notes on data mining, basic principles of data mining techniques, comparison on classification techniques using WEKA, Data mining in bioinformatics, discussion on WEKA.

Introduction

Computers have brought tremendous improvement in technologies especially the speed of computer and data storage cost which lead to create huge volumes of data. Data itself has no value, unless data can be changed to information to become useful. In past two decade the data mining was invented to generate knowledge from database. Data Mining is the method of finding the patterns, associations or correlations among data to present in a useful format or useful information or knowledge[1]. The advancement of the healthcare database management systems creates a huge number of data bases. Creating knowledge discovery methodology and management of the large amounts of heterogeneous data has become a major priority of research. Data mining is still a good area of scientific study and remains a promising and rich field for research. Data mining making sense of large amounts of unsupervised data in some domain[2].

Data mining techniques

Data mining techniques are both unsupervised and supervised.

Unsupervised learning technique is not guided by variable or class label and does not create a model or hypothesis before analysis. Based on the results a model will be built. A common unsupervised technique is Clustering.

In Supervised learning prior to the analysis a model will be built. To estimate the parameters of the model apply the algorithm to the data. The biomedical literatures focus on applications of supervised learning techniques. A common supervised techniques used in medical and clinical research is Classification, Statistical Regression and association rules. The learning techniques briefly described below as:

Clustering

Clustering is a dynamic field of research in data mining. Clustering is an unsupervised learning technique, is process of partitioning a set of data objects in a set of meaningful subclasses called clusters. It is revealing natural groupings in the data. A cluster include group of data objects similar to each other within the cluster but not similar in another cluster. The algorithms can be categorized into partitioning, hierarchical, density-based, and model-based methods. Clustering is also called unsupervised classification: no predefined classes.

Association Rule

Association rule in data mining is to find the relationships of items in a data base.

A transaction t contains X, itemset in I, if X Ã t. Where an itemset is a set of items.

E.g., X = {milk, bread, cereal} is an itemset.

An association rule is an implication of the form:

X Â® Y, where X, Y ÃŒ I, and X Ã‡Y = Ã†

An association rules do not represent any sort of causality or correlation between the two item sets.

X Ãž Y does not mean X causes Y, so no Causality

X Ãž Y can be different from Y Ãž X, unlike correlation

Association rules assist in marketing, targeted advertising, floor planning, inventory control, churning management, homeland security, etc.

Classification

Classification is a supervised learning method. The classification goal is to predict the target class accurately for each case in the data. Classification is to develop accurate description for each class. Classification is a data mining function consists of assigning a class label of objects to a set of unclassified cases.

Classification – A Two-Step process show in figure 4.

Data mining classification mechanisms such as Decision trees, K-Nearest Neighbor (KNN), Bayesian network, Neural networks, Fuzzy logic, Support vector machines, etc. Classification methods classified as follows:

Decision tree: Decision trees are powerful classification algorithms. Popular decision tree algorithms include Quinlan’s ID3, C4.5, C5, and Breiman et al.’s CART. As the name implies, this technique recursively separates observations in branches to construct a tree for the purpose of improving the prediction accuracy. Decision tree is widely used as it is easy to interpret and are restricted to functions that can be represented by rule “If-then-else” condition.

Most decision tree classifiers perform classification in two phases: tree-growing (or building) and tree-pruning. The tree building is done in top-down manner. During this phase the tree is recursively partitioned till all the data items belong to the same class label. In the tree pruning phase the full grown tree is cut back to prevent over fitting and improve the accuracy of the tree in bottom up fashion. It is used to improve the prediction and classification accuracy of the algorithm by minimizing the over-fitting. Compared to other data mining techniques, it is widely applied in various areas since it is robust to data scales or distributions.

Nearest-neighbor:

K-Nearest Neighbor is one of the best known distance based algorithms, in the literature it has different version such as closest point, single link, complete link, K-Most Similar Neighbor etc. Nearest neighbors algorithm is considered as statistical learning algorithms and it is extremely simple to implement and leaves itself open to a wide variety of variations. Nearest-neighbor is a data mining technique that performs prediction by finding the prediction value of records (near neighbors) similar to the record to be predicted. The K-Nearest Neighbors algorithm is easy to understand. First the nearest-neighbor list is obtained; the test object is classified based on the majority class from the list. KNN has got a wide variety of applications in various fields such as Pattern recognition, Image databases, Internet marketing, Cluster analysis etc.

Probabilistic (Bayesian Network) models:

Bayesian networks are a powerful probabilistic representation, and their use for classification has received considerable attention. Bayesian algorithms predict the class depending on the probability of belonging to that class. A Bayesian network is a graphical model. This Bayesian Network consists of two components. First component is mainly a directed acyclic graph (DAG) in which the nodes in the graph are called the random variables and the edges between the nodes or random variables represents the probabilistic dependencies among the corresponding random variables. Second component is a set of parameters that describe the conditional probability of each variable given its parents. The conditional dependencies in the graph are estimated by statistical and computational methods. Thus the BN combine the properties of computer science and statistics.

Probabilistic models Predict multiple hypotheses, weighted by their probabilities[3].

The Table 1 below gives the theoretical comparison on classification techniques.

Data mining is used in surveillance, artificial intelligence, marketing, fraud detection, scientific discovery and now gaining a broad way in other fields also.

Experimental Work

Experimental comparison on classification techniques is done in WEKA. Here we have used labor database for all the three techniques, easy to differentiate their parameters on a single instance. This labor database has 17 attributes ( attributes like duration, wage-increase-first-year, wage-increase-second-year, wage-increase-third-year, cost-of-living-adjustment, working-hours, pension, standby-pay, shift-differential, education-allowance, statutory-holiday, vacation, longterm-disability-assistance, contribution-to-dental-plan, bereavement-assistance, contribution-to-health-plan, class) and 57 instances.

Figure 5: WEKA 3.6.9 – Explorer window

Figure 5 shows the explorer window in WEKA tool with the labor dataset loaded; we can also analyze the data in the form of graph as shown above in visualization section with blue and red code. In WEKA, all data is considered as instances features (attributes) in the data. For easier analysis and evaluation the simulation results are partitioned into several sub items. First part, correctly and incorrectly classified instances will be partitioned in numeric and percentage value and subsequently Kappa statistic, mean absolute error and root mean squared error will be in numeric value only.

Figure 6: Classifier Result

This dataset is measured and analyzed with 10 folds cross validation under specified classifier as shown in figure 6. Here it computes all required parameters on given instances with the classifiers respective accuracy and prediction rate. Based on Table 2 we can clearly see that the highest accuracy is 89.4737 % for Bayesian, 82.4561 % for KNN and lowest is 73.6842 % for Decision tree. In fact by this experimental comparison we can say that Bayesian is best among three as it is more accurate and less time consuming.

Table 2 : Simulation Result of each Algorithm

DATA MINING IN BIONFORMATICS

Bioinformatics and Data mining provide challenging and exciting research for computation. Bioinformatics is conceptualizing biology in terms of molecules and then applying “informatics techniques to understand and organize the information associated with these molecules on a large scale. It is MIS for molecular biology information. It is the science of managing, mining, and interpreting information from biological sequences and structures. Advances such as genome-sequencing initiatives, microarrays, proteomics and functional and structural genomics have pushed the frontiers of human knowledge. Data mining and machine learning have been advancing with high-impact applications from marketing to science. Although researchers have spent much effort on data mining for bioinformatics, the two areas have largely been developing separately. In classification or regression the task is to predict the outcome associated with a particular individual given a feature vector describing that individual; in clustering, individuals are grouped together because they share certain properties; and in feature selection the task is to select those features that are important in predicting the outcome for an individual.

We believe that data mining will provide the necessary tools for better understanding of gene expression, drug design, and other emerging problems in genomics and proteomics. Propose novel data mining techniques for tasks such as

Gene expression analysis,

Searching and understanding of protein mass spectroscopy data,

3D structural and functional analysis and mining of DNA and protein sequences for structural and functional motifs, drug design, and understanding of the origins of life, and

Text mining for biological knowledge discovery.

In today’s world large quantities of data is being accumulated and seeking knowledge from massive data is one of the most fundamental attribute of Data Mining. It consists of more than just collecting and managing data but to analyze and predict also. Data could be large in size & in dimension. Also there is a huge gap from the stored data to the knowledge that could be construed from the data. Here comes the classification technique and its sub-mechanisms to arrange or place the data at its appropriate class for ease of identification and searching. Thus classification can be outlined as inevitable part of data mining and is gaining more popularity.

WEKA data mining software

WEKA is data mining software developed by the University of Waikato in New Zealand. Weka includes several machine learning algorithms for data mining tasks. The algorithms can either call from your own Java code or be applied directly to a dataset, since WEKA implements algorithms using the JAVA language. Weka contains general purpose environment tools for data pre-processing, regression, classification, association rules, clustering, feature selection and visualization.

The Weka data mining suite in the bioinformatics arena it has been used for probe selection for gene expression arrays[14], automated protein annotation[7][9], experiments with automatic cancer diagnosis[10], plant genotype discrimination[13], classifying gene expression profiles[11], developing a computational model for frame-shifting sites[8] and extracting rules from them[12]. Most of the algorithms in Weka are described in[15].

WEKA includes algorithms for learning different types of models (e.g. decision trees, rule sets, linear discriminants), feature selection schemes (fast filtering as well as wrapper approaches) and pre-processing methods (e.g. discretization, arbitrary mathematical transformations and combinations of attributes). Weka makes it easy to compare different solution strategies based on the same evaluation method and identify the one that is most appropriate for the problem at hand. It is implemented in Java and runs on almost any computing platform.

The Weka Explorer

Explorer is the main interface in Weka, shown in figure 1. Open fileÃ¢â‚¬Â¦ load data in various formats ARFF, CSV, C4.5, and Library.

WEKA Explorer has six (6) tabs, which can be used to perform a certain task. The tabs are shown in figure 2.

Preprocess: Preprocessing tools in WEKA are called “Filters”. The Preprocess retrieves data from a file, SQL database or URL (For very large datasets sub sampling may be required since all the data were stored in main memory). Data can be preprocessed using one of Weka’s preprocessing tools. The Preprocess tab shows a histogram with statistics of the currently selected attribute. Histograms for all attributes can be viewed simultaneously in a separate window. Some of the filters behave differently depending on whether a class attribute has been set or not. Filter box is used for setting up the required filter. WEKA contains filters for Discretization, normalization, resampling, attribute selection, attribute combination,

Classify: Classify tools can be used to perform further analysis on preprocessed data. If the data demands a classification or regression problem, it can be processed in the Classify tab. Classify provides an interface to learning algorithms for classification and regression models (both are called “classifiers” in Weka), and evaluation tools for analyzing the outcome of the learning process. Classification model produced on the full trained data. WEKA consists of all major learning techniques for classification and regression: Bayesian classifiers, decision trees, rule sets, support vector machines, logistic and multi-layer perceptrons, linear regression, and nearest-neighbor methods. It also contains “metalearners” like bagging, stacking, boosting, and schemes that perform automatic parameter tuning using cross-validation, cost-sensitive classification, etc. Learning algorithms can be evaluated using cross-validation or a hold-out set, and Weka provides standard numeric performance measures (e.g. accuracy, root mean squared error), as well as graphical means for visualizing classifier performance (e.g. ROC curves and precision-recall curves). It is possible to visualize the predictions of a classification or regression model, enabling the identification of outliers, and to load and save models that have been generated.

Cluster: WEKA contains “clusterers” for finding groups of instances in a datasets. Cluster tools gives access to Weka’s clustering algorithms such as k-means, a heuristic incremental hierarchical clustering scheme and mixtures of normal distributions with diagonal co-variance matrices estimated using EM. Cluster assignments can be visualized and compared to actual clusters defined by one of the attributes in the data.

Associate: Associate tools having generating association rules algorithms. It can be used to identify relationships between groups of attributes in the data.

Select attributes: More interesting in the context of bioinformatics is the fifth tab, which offers methods for identifying those subsets of attributes that are predictive of another (target) attribute in the data. Weka contains several methods for searching through the space of attribute subsets, evaluation measures for attributes and attribute subsets. Search methods such as best-first search, genetic algorithms, forward selection, and a simple ranking of attributes. Evaluation measures include correlation- and entropy based criteria as well as the performance of a selected learning scheme (e.g. a decision tree learner) for a particular subset of attributes. Different search and evaluation methods can be combined, making the system very flexible.

Visualize: Visualization tools shows a matrix of scatter plots for all pairs of attributes in the data. Practically visualization is very much useful which helps to determine learning problem difficulties. WEKA visualize single dimension (1D) for single attributes and two-dimension (2D) for pairs of attributes. It is to visualize the current relation in 2D plots. Any matrix element can be selected and enlarged in a separate window, where one can zoom in on subsets of the data and retrieve information about individual data points. A “Jitter” option to deal with nominal attributes for exposing obscured data points is also provided.

interfaces to Weka

All the learning techniques in Weka can be accessed from the simple command line (CLI), as part of shell scripts, or from within other Java programs using the Weka API. WEKA commands directly execute using CLI.

Weka also contains an alternative graphical user interface, called “Knowledge Flow,” that can be used instead of the Explorer. Knowledge Flow is a drag-and-drop interface and supports incremental learning. It caters for a more process-oriented view of data mining, where individual learning components (represented by Java beans) can be connected graphically to create a “flow” of information.

Finally, there is a third graphical user interface-the “Experimenter”-which is designed for experiments that compare the performance of (multiple) learning schemes on (multiple) datasets. Experiments can be distributed across multiple computers running remote experiment servers and conducting statistical tests between learning scheme.

Conclusion

Classification is one of the most popular techniques in data mining. In this paper we compared algorithms based on their accuracy, learning time and error rate. We observed that, there is a direct relationship between execution time in building the tree model and the volume of data records and also there is an indirect relationship between execution time in building the model and attribute size of the data sets. Through our experiment we conclude that Bayesian algorithms have good classification accuracy over above compared algorithms. To make bioinformatics lively research areas broaden to include new techniques.

Order Now