Data warehouse and data mining
Data mining and data warehouse is one of an important issue in a corporate world today. The biggest challenge in a world that is full of information is searching through it to find connections and data that were not previously known. Dramatic advance in data development make the role of data mining and data warehouse become important in order to improve business operation in organization. The scenario’s of important data mining and data warehouse in organization are seen in the process of accumulating and integrating of vast and growing amounts of data in various format and various databases. This paper is discuss about data warehouse and data mining, the concept of data mining and data warehouse, the tools and techniques of data mining and also the benefits of data mining and data warehouse to the organizations.
Data, Data Warehouse, Data Mining, Data Mart
Organizations tend to grow and prosper as they gain a better understanding of their environment. Typically, business managers must be able to track daily transactions to evaluate how the business is performing. By tapping into the operational database, management can develop strategies to meet organizational goals. The process that identified the trends and patterns in data are the factors to accomplish that. By the way, the way to handle the operational data in organization is important because the reason for generating, storing and managing data is to create information that becomes the basis for rational decision making. To facilitate the decision-making process, decision support systems (DSSs) were developed whereas it is an arrangement of computerized tools used to assist managerial decision making within a business. Decision support is a methodology that designed to extract information from data and to use such information as a basis for decision making. However, information requirements have become so complex that is difficult for a DSS to extract all necessary information from the data structures typically found in an operational database. Therefore, a data mining and data warehouse was developed and become a proactive methodology in order to support managerial decision making in organization.
Concept of Data Warehouse
A data warehouse is a firm’s repositories that running the process of updating and storing historical business data of organization whereas the process then transform the data into multidimensional data model for efficient querying and analysis. All the data stored are extracts or obtains its data from multiple operational systems in organization with containing the information of relevant activity that occurred in the past in order to support organizational decision making. A data mart, on the other hand, is a subset of a data warehouse. It holds some special information that has been grouped to help business in making better decisions. Data used here are usually derived from data warehouse. The first organized used of such large database started with OLAP (Online Analytical Processing) whereas the focused is analytical processing of organization. The diffrences between a data mart and a data warehouse is only the size and scope of the problem being solved.
According to William H.Inmon (2005), a data warehouse is a “subject-oriented, integrated, time-varying, and non-volatile collection of data in support of the management’s decision-making process”. To understand that definition, the components will be explained more detailed;
Provide a unified view of all data elements with a common definition and representation for all business units.
Data are stored with a subject orientation that facilitates multiple views of the data and facilitates decision making. For example, sales may be recorded by product, by division, by manager, or by region.
Dates are recorded with a historical perspective in mind. Therefore, a time dimension is added to facilitate data analysis and various time comparisons.
Data cannot be changed. Data are added only periodically from historical systems. Once the data are properly stored, no changes are allowed. Therefore, the data environment is relatively static.
In summary, the data warehouse is usually a read-only database optimized for data analysis and query processing. Typically, data are extracted from various sources and are then transformed and integrated, in other words, passed through a data filter, before being loaded into the data warehouse. Users access the data warehouse via front-end tools and end-user application software to extract the data in usable form.
The Issues That Arise in Data Warehouse
Although the centralized and integrated data warehouse can be a very attractive proposition that yields many benefits, managers may be reluctant to embrace this strategy. Creating a data warehouse requires time, money, and considerable managerial effort. Therefore, it is not surprising that many companies begin their foray into warehousing by focusing on more manageable data sets that are targeted to meet the special needs of small groups within the organization. These smaller data warehouse are called data marts. A data mart is a small, single-subject data warehouse subset that provides decision support to a small group of people. Some organizations choose to implement data marts not only because of the lower cost and shorter implementation time, but also because of the current technological advances and inevitable “people issues” that make data marts attractive. Powerful computers can provide a customized DSS to small groups in ways that might not be possible with a centralized system. Also, a company’s culture may predispose its employees to resist major changes, but they might quickly embrace relatively minor changes that lead to demonstrably improved decision support. In addition, people at different organizational levels are likely to require data with different summarization, aggregation, and presentation formats. Data marts can serve as a test vehicle for companies exploring the potential benefits of data warehouses. By migrating gradually from data marts to data warehouses, a specific departments decision support needs can be addressed within a reasonable time frame (six month to one year), as compared to the longer time frame usually required to implement a data warehouse (one to three years). Information Technology (IT) departments also benefit from this approach because their personnel have the opportunity to learn the issues and develop the skills required to create a data warehouse.
Concept of Data Mining
Data mining is the forecasting techniques and analytical tools that extensively used in industries and corporates to ensure the effectiveness in decision making. Data mining is a tools to analyze the data, uncover problems or opportunities hidden in the data relationships, form computer models based on their findings, and then use the models to predict business behavior by requiring minimal end-user intervention. The way it works is through search of valuable information from a huge amount of data that is collected over time and defined the patterns or relationships of information that present by data. In business field, the organization use data mining to predict the customer behaviour in the business environment. The process of data mining started from analyzed the data from different perspectives and summarized it into useful information, which from the information then created knowledge to address any number of business problems. For the example, banks and credit card companies use knowledge-based analysis to detect fraud, thereby decreasing fraudulent transactions. In fact, data mining has proved to be very helpful in finding practical relationships among data that help define customer buying patterns, improve product development and acceptance, reduce healthcare fraud, analyze stock markets and so on.
Data Mining in Historical Perspective
Over the last 25 years or so, there has been a gradual evolution from data processing to data mining. In the 1960s business routinely collected data and processed it using database management techniques that allowed an orderly listing and tabulation of the data as well as some query activity. The OLTP (Online Transaction Processing) became routine, data retrieval from stored data bacame faster and more efficient because of the availability of new and better storage devices, and data processing became quicker and more efficient because of advancement in computer technology. Database management advanced rapidly to include highly sophisticated query systems, and became popular not only in business applications but also in scientific inquiries.
Approaches of Data Mining in Various Industries
With data mining, a retail store may find that certain products are sold more in one channel of distribution than in the others, certain products are sold more in one geographical location than in others, and certain products are sold when a certain event occurs. With data mining, a financial analyst would like to know the characteristics of a successful prospective employee; credit card departments would like to know which potential customers are more likely to pay back the debt and when a credit card is swiped, which transaction is fraudulent and which one is legitimate; direct marketers would like to know which customers purchase which types of products; booksellers like Amazon would like to know which customers purchase which types of books (fiction, detective stories or any other kind) and so on. With this type of information available, decision makers will make better choices. Human resource people will hire the right individuals. Credit departments will target those prospective customers that are less prone to become delinquent or less likely to involve in fraudulent activities. Direct marketers will target those customers that are likely to purchase their products. With the insight gained from data mining, businesses may wish to re-configure their product offering and emphasize specific features of a product. These are not the only uses of data mining. Police use this tool to determine when and where a crime is likely to occur, and what would be the nature of that crime. Organized stock changes detect fraudulent activities with data mining. Pharmaceutical companies mine data to predict the efficacy of compounds as well as to uncover new chemical entities that may be useful for a particular disease. The airline industry uses it to predict which flights are likely to be delayed (well before the flight is scheduled to depart). Weather analyst determine weather patterns with data mining to predict when there will be rain, sunshine, a hurricane, or snow. Beside that, nonprofit companies use data mining to predict the likelihood of individuals making a donation for a certain cause. The uses of data mining are far reaching and its benefits may be quite significant.
Data Mining Tools and Techniques
Data mining is the set of tools that learn the data obtained and then using the useful information for business forecasting. Data mining tools use and analyze the data that exist in databases, data marts, and data warehouse. A data mining tools can be categorized into four categories of tools which are prediction tools, classification tools, clustering analysis tools and association rules discovery. Below are the elobaration of data mining tools:
A prediction tool is a method that derived from traditional statistical forecasting for predicting a value of the variable.
The classification tools are attempt to distinguish the differences between classes of objects or actions. Given the example is an advertiser may want to know which aspect of its promotion is most appealing to consumers. Is it a price, quality or reliability of a product? Or maybe it is a special feature that is missing on competitive products. This tools help give such information on all the products, making possible to use the advertising budget in a most effective manner.
Clustering Analysis Tools
This is very powerful tools for clustering products into groups that naturally fall together which are the groups are identified by the program. Most of the clusters discovered may not be useful in business decision. However, they may find one or two that are extremely important which the ones the company can take advantage of. The most common use is market segmentation which in this process, a company divides the customer base into segments dependent upon characteristics like income, wealth and so on. Each segment is then treated with different marketing approach.
Association Rules Discovery
This tool discover associations which are like what kinds of books certain groups of people read, what products certain groups of people purchase and so on. Businesses use such information in targeting their markets. For instance, recommends movies based on movies people have watched and rated in the past.
There are four general phases in data mining which are data preparation, data analysis and classification, knowledge acquisition and prognosis.
In the data preparation phase, the main data sets to be used by the data mining operation are identified and cleaned of any data impurities. Because the data in the data warehouse are already integrated and filtered, the data warehouse usually is the target set for data mining operations.
The data anlysis and classification phase studies the data to identify common data characteristics or patterns. During this phase, the data mining tool applies specific algorithm to find:
- Data groupings, classifications, clusters, or sequences.
- Data dependencies, links, or relationships.
- Data patterns, trends, and deviations.
The knowledge-acquisition phase uses the results of the data analysis and classification phase. During the knowledge-acquisition phase, the data mining tool (with possible intervention by the end user) selects the appropriate modeling or knowledge-acquisition algorithms. The most common algorithms used in data mining are based on neural networks, decision trees, rules induction, genetic algorithms, classification and regression trees, memory-based reasoning, and nearest neighbor and data visualization. A data mining tool may use many of these algorithms in any combination to generate a computer model that reflects the behavior of the target data set.
Although many data mining tools stop at the knowledge-acquisition phase, others continue to the prognosis phase. In that phase, the data mining findings are used to predict future behavior and forecast business outcomes. Examples of data mining findings can be:
- 65% of customers who did not use a particular credit card in the last six months are 88% likely to cancel that account.
- 82% of customers who bought a 27-inch or larger TV are 90% likely to buy an entertainment center within the next four weeks.
- If age < 30 and income < = 25,000 and credit rating 25,000, then the minimum loan term is ten years.
The complete set of findings can be represented in a decision tree, a neural net, a forecasting model, or a visual presentation interface that is used to project future events or results. For example, the prognosis phase might project the likely outcome of a new product rollout or a new marketing promotion.
The Benefit and Weaknesess of Data Warehouse to Organization
Data warehouse is the one of powerful techniques that applies in organization in order to assist managerial decision making within a business. This methodology becomes a crucial asset in modern business enterprise. It is designed to extract information from data and to use such information as a basis for decision making. The organization will get more benefit with application of data warehouse because the features of data warehouse itself is it’s a central repositories that stores historical information, meaning say that eventhough the data come from differ location and various points in time but all the relevant data are assembled in one location and was organized in efficient manner. Indirectly, it makes a profit to company because it greatly reduces the computing cost. One of the advantage of using data warehouse is it allows the accessible of large volume information whereas the information will be used in problem solving that arise in business organization. All the data that are from multiple sources that located in central repository will be analyze in order to allow them come out with a choice of solutions.
However there are also having weaknesses that need to concern as well. The processes of data warehouse actually take a long period of time bacause before all the data can be stored into warehouse, they need to cleaned, extracted and loaded. The process of maintaining the data is one of the problems in data warehouse because it is not easy to handle. The compatibility may be the isssued in order to implement the data warehouse in organization because the new transaction system that tried to implement may not work with the system that already used. Beside that, the user that works with the system must be trained to use the system because without having a proper training may cause a problem. Furthermore, if the data warehouse can be accessed via the internet, the security problem might be the issue. The biggest problem that related with the data warehouse is the costs that must taken into consideration especially for their maintenance. Any organization that is considering using a data warehouse must decide if the benefits outweigh the costs.
Successfully supporting managerial decision-making is significantly dependent upon the availability of integrated, high quality information organized and presented in a timely and in simply way to understand. Data mining and data warehouse have emerged to meet this need. The application of data mining and data warehouse will be apart of crucial element in organization in order to assist the managerial running the operation smoothly and at the same time will help them to accomplish the business goal. It is because both of these techniques are the foundation of decision support system. Today data mining and data warehouse are an important tools and more companies will begin using them in the future.
- Bonifati, A., Cattaneo, F., Ceri, F., Fuggetta, A., and Paraboschi, S., (2001). Designing data marts for data warehouse. ACM Transactions On Software Engineering And Methodology, 10, 452-483. Retrieved February 15, 2010 from: http://www.emeraldinsight.com.ezaccess.library.uitm.edu.my/Insight/viewPDF.jsp?contentType=Article&Filename=html/Output/Published/EmeraldAbstractOnlyArticle/Pdf/2810110103.pdf
- Chaplot, P., (2007). An introduction to data warehousing. Retrieved February 14, 2010 from: http://www.emeraldinsight.com.ezaccess.library.uitm.edu.my/Insight/viewPDF.jsp?contentType=Article&Filename=html/Output/Published/EmeraldFullTextArticle/Pdf/0291000304.pdf
- Roiger, R.,J., (2005). Teaching an introductory course in data mining. Retrieved February 13, 2010 from:
- Santos, R., J., and Bernandino, J. Real-time data warehouse loading methodology. Retrieved February 13, 2010 from: http://www.emeraldinsight.com.ezaccess.library.uitm.edu.my/Insight/viewPDF.jsp?contentType=Article&Filename=html/Output/Published/EmeraldFullTextArticle/Pdf/0291010105.pdf
- Chowdhury, S., Chan, J.,O., (2007). Data warehousing and data mining: a course in mba and msis program from uses perspective. Data Warehousing And Data Mining. 7. Retrieved February 15, 2010 from: http://www.emeraldinsight.com.ezaccess.library.uitm.edu.my/Insight/viewPDF.jsp?contentType=Article&Filename=html/Output/Published/EmeraldFullTextArticle/Pdf/1640150202.pdf
- Ranjan, J., Malik, K., (2007). Effective educational process: a data mining approach. The Journal Of Information And Knowledge Management Systems. 37, 502-515. Retrieved February 16, 2010 from:
- Mora, S., L., Trujillo, J., Song, I, Y., (2006). A uml profile for multidimensional modeling in data warehouses. Data & Knowledge Engineering. 59, 725-769. Retrieved February 20, 2010 from: http://www.sciencedirect.com.ezaccess.library.uitm.edu.my/science?_ob=MImg&_imagekey
- March, S., T., Hevner, A., R., (2005). Integrated decision support systems: a data warehousing perspective. Retrieved February 21, 2010 from: