Application Survey on Data Mining and Data Warehousing

Survey Report on Bank-Loan Risk Prediction

Introduction

Data Mining has been the most explored topic for the past decade and has given rise to several new enhancements and techniques in several industries. One such mind provoking arena of high interest is Credit Risk analysis or simply the Bank-loan risk prediction. It has been a pressing need for several banks these days to employ a Credit Risk Analysis simply to make sure that the money they invest to customers as a loan or any form is given to a legitimate customer who is capable of repaying and to avoid any other fraudulent scenarios. Several techniques in data mining have been explored to analyze the customer’s creditworthiness and a few will be analyzed and emphasized in the further sections.

Discussion on Selected Papers

In this Section, I have listed the journals, IEEE papers referenced for my study and analysis on Bank-loan risk prediction and categorized various factors for each in Table 1.

Table 1. Sources used that focused on Bank-loan risk prediction using different data mining techniques

References	Objective & Data Mining Techniques Employed	Authors	Number of Citations
[1]	SAS Enterprise Miner 5.3, Logistic Regression Model and Decision Tree employed in credit scoring models for assessing credit risk.	Bee Wah Yap, Seng Huat Ong, Nor Huselina Mohamed Husain.	77
[2]	Decision Tree model for credit assessments in a Bank.	I Gusti Ngurah Narindra Mandala, Catharina Badra Nawangpalupia, FransiscusRian Praktiktoa.	15
[3]	Predictive Modelling technique and NaÃ¯ve Bayes algorithm for loan risk assessment.	Rob Gerritsen	34
[4]	Multilayer Feed Forward Neural Network, Support Vector Machines, Genetic Programming, Logistic Regression, Group Method of Data Handling, Probabilistic Neural Network techniques for Financial Fraud assessment.	P.Ravisankar, V.Ravi, G. Raghava Rao, I.Bose	147

Expert Systems with Applications: Using Data Mining to improve assessment of credit worthiness via credit scoring models

Problem Description:

Bee Wah Yap et al.[1] found a recreational club has been facing difficulties in identifying the defaulters who do not pay their monthly subscription fee causing a lot of chaos for the club to manage the funds effectively and divide the fund for any further activities or events in the club. The management decided to evaluate the credit worthiness of the club members by using the past member’s data as a data set and analyzed using three different data mining techniques in order to conclude the fittest of all[1].

Solution technology:

Bee Wah Yap et al.[1] employed Credit scorecard model, logistic regression model and decision tree model using SASÂ® Enterprise Miner, a diverse tool to employ several data mining techniques in order to improvise and identify the potential defaulters in the club.

Solution Evaluation:

Bee Wah Yap et al.[1] in the credit scorecard model, identified the various factors determining a defaulter based on their age, the number of dependents, the number of cars, district of address and most importantly the classification of defaulters and non-defaulters based on the payment status. They then obtained the Information Value as the summation of the probability of good attribute(applicable values from the old dataset taken for prediction) minus the probability of bad attribute(values from the old dataset that have no added value to be included in the prediction) and identified that values greater than 0.02 as admissible values of inclusion on the score card.

They then identified the Stepwise selection method suitable of all the other Logistic Regression model and found a wide range of information and conclusions on the type of defaulters.

Finally, they applied the Decision tree algorithm in order to classify an if-then rule for the large dataset into smaller segments and obtained the profile of defaulters.

Based on the results he obtained from the above three techniques they had clearly identified that Decision Tree is by far a better approach for prediction although all three have no big difference and that Credit scoring model without adequate and proper data sets and old data could never perform well in prediction.

Further Enhancements:

The study has employed several techniques in order to justify a better model for prediction as a substitute for the Credit scoring model but has overlooked the fact that the data sets used throughout are from past customers which may or may not be legitimate way of prediction and definitely not a sensible way to conclude Decision Tree better over Credit scoring as neither of the arguments is valid and may vary when using a large amount of real-time data from the present to predict the future defaulters.

Assessing Credit Risk: an Application of Data Mining in a Rural Bank

Problem Description:

I Gusti Ngurah Narindra Mandala et al.[2] felt that for rural banks to stay healthier, a certain benchmark has to be set on many factors out of which non-performing loan (NPL) factor played an important role. They identified that lower the NPL rate better the health of the rural bank. In order to employ this, they proposed that banks should approve only the right applicants and thereby increase the profit, credibility, and serve the improvements of their local community where such banks are most used. They were affirmative that banks with less than 5% of NPL are in better condition when compared to other with a greater value of NPL.

Solution Technology:

I Gusti Ngurah Narindra Mandala et al.[2] chose Decision Tree technique to be employed in a rural bank in Bali and scrutinized the various factors that are currently kept in consideration for lending loans to a customer.

Solution Evaluation:

I Gusti Ngurah Narindra Mandala et al.[2] found that the current NPL value of the rural bank of Bali is 11.99% very much higher than the expected value for a good performing bank. They made use of 84% of data from a sample data set of 1028 records for evaluation and determined approximately 13 parameters of consideration for evaluating the NPL customers. They developed a decision tree based on the existing parameters but reordered the determining factor as the collateral value and obtained an NPL of 3%, which by far is the most efficient a bank could perform.

Further Enhancements:

Although the above assessment and conclusion of a healthy bank seem appealing they could have employed a further emphasis on other factors that also contribute to a healthy bank / NPL and predicted the credibility further using various other Predictive and Descriptive modeling techniques which have better analysis and solution for the given scenario than what was obtained.

Assessing Loan Risks: A Data Mining Case Study

Problem Description:

Rob Gerritsen [3] identified that if customers who could not pay their loans bank can be predicted before lending using data mining techniques then the information would be worthwhile. He found that USDA’s Rural Housing Service has been lending money to people in the rural areas and USDA realized that the huge number of applicants who are being approved of the loan may or may not be capable of repaying the amount. Hence USDA decided to perform a data mining technique in order to gather the information and predict the vulnerabilities of the customers[3].

Solution Technology:

Rob Gerritsen [3] decided to use Predictive Modeling Techniques along with the NaÃ¯ve Bayes algorithm to come up with a solution for the above problem.

Solution Evaluation:

Rob Gerritsen [3] was given a sample data of 12,000 based on the existing mortgages of single families and had to train the given data set using the model and then predict the future scenarios. So, he first classified the dataset and applied the NaÃ¯ve Bayes binning algorithm in order to divide the customer based on loan amounts that are to be paid by each.

Initially, he found this ineffective as a huge amount of people fell into a single bin as the bin range values where continuous/uniform in distribution and hence difficult to identify precisely the original defaulters.

He further organized the binning range distribution and made a decision tree from the results obtained to conclude the major factors of defaulters.

Further Enhancements:

Rob Gerritsen [3] himself has identified that the data set taken was too less to conclude the results and further, a wide range of dataset has to be taken along with further factors of consideration for USDA to obtain the verified solution for their problem.

Decision Support System: Detection of financial statement fraud and feature selection using data mining techniques

Problem Description:

P. Ravisankar et al.[4] conducted a study on 202 Chinese companies using a variety of data mining techniques simply to conclude if the financial statements, income statements, cash flow, and various other factors if assimilated could give an better output from the companies and also decide if the loan has to be given to customers based on the results.

Solution Technology:

P. Ravisankar et al.[4] has employed a variety of data mining techniques namely Support Vector Machines (SVM), Group Method of Data Handling (GMDH), Genetic Programming (GP), Logistic Regression (LR), Multilayer Feed Forward Neural Network (MLFF) and Probabilistic Neural Network (PNN). He made use of a number of techniques for the same datasets in order to identify the best solution for the above problem.

Solution Evaluation

P. Ravisankar et al.[4] identified that among the 202 Chinese companies taken as a data set 101 were Fraudulent and the remaining were Non-Fraudulent.

He then applied the Genetic Algorithm to find the fitness function, SVM to obtain the permissible support vectors, GMDH to classify and obtain a Feed – Forward network model(Polynomial Model), PNN and with or without Feature selection in order to obtain the features of fraudulent companies.

He has clearly observed that among the several techniques used the main factors that have to be considered is the amount of dataset that is to be used should concede with the capability of the technique and with less time consumption for training and obtaining results from the dataset.

Further Enhancements

I would abide with P. Ravisankar et al.[4] conclusion of classifying with an if-then rule on the dataset and to apply other hybrid data mining techniques inorder to further enhance the solutions.

REFERENCES

Yap, B. W., Ong, S. H., & Husain, N. H. M. (2011). Using data mining to improve assessment of credit worthiness via credit scoring models. Expert Systems with Applications, 38, 13274-13283.
GustiNgurah Narindra Mandalaa, Catharina Badra Nawangpalupia*, FransiscusRian Praktiktoa Assessing Credit Risk: an Application of Data Mining in a Rural Bank / Procedia Economics and Finance 4 ( 2012 ) 406 – 412â .
R. Gerritsen, Assessing loan risks: a data mining case study, IEEE IT Professional (1999) 16-21.
P. Ravisankar, V. Ravi, G. Rao, I. Bose, Detection of financial statement fraud and feature selection using data mining techniques, Decision Support Systems 50 (2) (2011) 491-500.

Question and Answers

Why DM and DW technologies are becoming important tools for today’s business world?

Today’s business world is a competitive environment where right decisions needs to be taken at right time by knowing the answers for what has happened and by predicting what will happen in the future.

Data warehousing helps us to identify answers for questions like what, which and how through aggregations.

Data mining known as KDD helps us to predict what can happen in future. This is done by discovering and analyzing the hidden patterns.

Both DM and DW results are processed from large set of data records from either same or different data sources.

What are the main differences between data mining, traditional statistics data analysis, and information retrieval?

Data Mining is a process of obtaining a derived / discovering new information based on the existing information by observing the data, identifying the patterns and obtaining meaningful analytics that can be used in business.

A traditional statistics data analysis is method of testing a proposed phenomenon or hypothesis to validate and provide a statistically significant data for accepting the outcome.

Information Retrieval in simple terms is the process of collecting/retrieving required data from an existing information available in any form.

How is data warehouse model different from a relational database model? Why DW technology is more advanced in supporting business management?

Relational Database Model:

Used for Online Transaction Processing (OLTP)
Data stored are generally a fact in a single operational database
Tables are normalized
SQL are used to query

Data Warehouse Model:

Used for Online Analytical Processing (OLAP)
Data stored in DW are generally consolidated data(aggregation) from multiple databases or sources
Tables are de-normalized
OLAP tools are used to query

The key difference between DW model and relational database model is that, DW is a layer on top of other databases whereas relations database is a database itself.

DW technology is more advanced in supporting business management because it provides quick answer for question like “WHAT, WHICH and HOW” which helps the management to act accordingly on making decisions. i.e. they are very faster in generating reports for answering the management queries.

What are the main difference between using OLAP on DW and using SQL on traditional database for supporting business decision making?

The main difference is that complex questions which involves multiple aggregations can be answered in ad-hoc environments (i.e. data from different sources) easily in faster way using OLAP on DW

Order Now