Poor Data Quality In Information System Databases Information Technology Essay
The demand for better quality data is increasing. Too often, data are used uncritically without consideration of the error contained within, and this can lead to erroneous results, misleading information, unwise decisions, loss of revenue, and duplication of efforts, increase in cost of processing, poor business relationships and loss of lives are some of the results of poor quality data. Attention must be paid to the quality of data going into computer-based information systems. Corporations, government agencies and not-for-profit groups are all inundated with enormous amounts of data in their information systems databases This data has the potential to be used to generate greater understanding of a country for proper planning; for an organisation’s customers, processes, and the organisation itself. There is the need to ensure the quality of the data from which reports are generated and far reaching decisions made. The Data manipulated by information systems may not only be “inaccurate” or “wrong” but may also be missing, out-of-date, inconsistent or otherwise inadequate for the specific purposes of the user. Missing, inaccurate or inconsistent data can lead to inefficiencies or missed opportunities and poor policies. This paper explores factors that control data quality in information systems databases and identify factors that are critical to the improvement of the quality of data.
Keywords: Data quality, Dimensions, Human factor, Data cleansing
INTRODUCTION
Attention to data quality is a critical issue in all areas of information resources management. The government analyses of data gained by population census to decide, which regions of the country require further investments in the infrastructure, like schools and other educational facilities, because of expected future trends. If the rate of birth in one region has increased over the last couple of years the existing schools and teachers employed might not be sufficient to handle the amount of students expected. Thus, additional schools or employment of teachers will be needed. Inaccuracies in analysed data can lead to false conclusions and misdirected investments. In this example, erroneous data could result in an over- or undersupply of educational facilities if the information about the birth rate was inaccurate. In other words, inaccurate census statistics could result in wrong allocation of scarce resources. Even business are not spared by poor quality data. An article in the Wall Street Journal (13/7/98) relates the domino effect that occurred when erroneous information was typed into a central database. A new airport in Hong Kong suffered catastrophic problems in baggage handling, flight information, and cargo transfer. The ramifications of the dirty data were felt throughout the airport. Flights took off without luggage; airport officials tracked flights with plastic pieces on magnetic boards; and airlines called confused ground staff on cellular phones to let them know where even more confused passengers could find their planes (Arnold, 1998). The airport had been depending on the central database to be accurate. When it wasn’t, the airport paid the price in terms of customer satisfaction and trust. It is imperative that the issue of data quality be addressed if the data base is to prove beneficial to an organisation.
In Nigeria, the Central Bank of Nigeria (CBN) in august 2009 made public in the Guardian newspaper bank loan defaulters of all banks in Nigeria. Many companies and individuals refuted the figures indicated against their names for different reasons which included;
They never applied let alone obtained loan from the indicated banks.
Amounts they owed were less than amount indicated against their names.
Loan obtained had been completely paid up yet list claimed they were still debtors.
They were never or no longer directors of the debtor companies as published.
No sector seem to be left out.
Bowen (1993) reports an error in an inventory database that was caused by a manager. This resulted in a decisions that generate under-stock conditions. This error was caused by just one minor data entry. Even an error such as the wrongly entered product / service price, that go through an organisation’s information system without appropriate data quality checks, could cause losses to an organisation and / or harm its reputation.
The election into the position of Governor of Anambra state (6th February, 2010) was also with problem of poor data. A case in point was that of a former Vice President of Nigeria (Dr. Alex Ekwueme) who couldn’t find his name in the voters register. Surprisingly, the same voters register was used some months earlier in the recent local government election and he claimed it contained his name and he was able to vote then.
Poor data quality in health care for example, could impact on patient safety. A review of the safety and quality of health care in the United States (Institute of Medicine, 2000) found that between 44,000 and 98,000 deaths in the US each year can be attributed to medical errors. The report recommends ‘better access to accurate, timely information’ and to ‘make relevant patient information available at the point of care’ in an effort to improve patient safety. While not all of these errors are 100% attributable to data quality, Strong et al (1997) notes that the percentage contributed by poor quality are quite high. He also notes the social impact when government organization fail to ensure their data have sufficient quality to make effective decisions.
The cost of all these was quite high. Strong et al (Strong et al., 1997) noted the social impact of poor quality data when governmental organisations fail to ensure their data have sufficient quality to make effective decisions. The cost to organisations is far more than merely financial. Trust is lost from valuable customers (both internal and external), potential customers and sales are missed, operational costs increased, workers lose motivation, long-term business strategy is hindered (Leitheiser, 2001) and business re-engineering is impeded (Bowen, Fuhrer, & Guess, 1998; Redman, 1996a; United States Department of Defence, 2003), (Loshin, 2001). Redman also details how poor data quality affects operational, tactical and strategic decisions (Redman, 1996a).
The view of this study is that to be able to handle information system data effectively, an in-depth knowledge of the issues affecting data quality must be reviewed. Case studies concerning data quality problems are frequently documented. Data quality is a vital issue in both industry and everyday life. Data quality problems have been investigated in a substantial body of literature. The results of a number of investigations are summarised in table 1.1:
Table 1.1: Data quality issues
Arnold (1998)
SUMMARY OF DATA QUALITY ISSUES
Olson (2003):
Poor data quality management costs more than $1.4 billion annually in 599 surveyed companies
Up to 88% of data-related projects fail, largely due to issues with Data Quality
Wang et al. (2001
70% of manufacturing orders are assessed as of poor data quality.
In 2001, Data Quality issues accounted for nearly $600M in losses for US companies
Huang et al. (1999)
a telecommunications company lost $3 million because of the poor data quality in customer bills
Redman (1998)
it is estimated that poor data quality results in 8% to 12% loss of revenue in a typical enterprise, and informally estimated to be responsible for 40% to 60% of expense in service organisations
Wand and Wang (1996)
More than 60% of 500 medium-size firms suffer data quality problems
Klein et al. (1997)
1% to 10% of data in organisational databases is inaccurate
Redman (1996)
an industrial data error rate is up to 30% – 75%
Wang and Strong (1996)
40% of the credit-risk management database in a New York bank are incomplete
Strong et al. 1997
between 50% and 80% of computerised U.S. criminal records are estimated to be inaccurate, incomplete and ambiguous
Standish Group (1998)
74 percent of all data migration projects either overran or failed, resulting in almost $100 billion in unexpected costs.
Information Week
In a survey of 300 IT executives, the majority of the respondents (81 percent) said, “improving (data) quality was their most important post-year 2000 technology priority”
TDWI survey(2001)
Data Quality issues lead to 87% of projects requiring extra time to reconcile data.
Data Quality issues lead to lost credibility within a system on 81%
Day-to-day operating costs due to bad data are estimated to be as high as 20% of operating profit.
These figures demonstrate that managing the quality of data really can be beneficial to organisations. Observing the figures above, we are able to conclude that data quality problems are pervasive, costly and even disastrous.
DATA QUALITY DIMENSIONS
Many studies have confirmed that data quality is a multi-dimensional concept (Ballou and Pazer 1985; Redman 1996; Wand and Wang 1996; Wang and Strong 1996; Huang et al. 1999). Over the last two decades, different sets of data quality dimensions have been identified from both database and data management perspectives. Table 1.2. shows the data quality dimension proposed by each researcher.
Table 1.2 : Data quality dimension proposed by data quality researchers
Boulou et al (1982,
Huang et al. (1999)
Pipino et al. (2002)
Lee et al.
(2002)
Wang, Reddy, & Kon (1995)
Stvilia et al.
(2006)
Wang et al (2001)
Number of Dimensions
Proposed
4
16
15
15
5
22
15
(4)
The dimensions that have been identified by Ballou et al (1982,1985,1987,1993) will be adopted in this study because they cover the most important dimensions that have been addressed in information system literature and have been reasonably widely accepted in the data quality field. Therefore, quality data in this paper means accurate, timely, complete, and consistent data. In any case, in accord with Olson (2003), we regard accuracy as the most important data quality dimension.: “The accuracy dimension is the foundation measure of the quality of data. If data is not right, the other dimensions are of little importance.” Olson (2003). The four dimensions identified by Boulou et al (1982, 1985,1987,1993) and their meaning are shown in table 1.3
Table 1.3: Data quality Dimensions (Ballou et al. 1982, 1985,1987,1993).
DIMENSION
MEANING
ACCURACY
which occurs when the recorded value is in conformity with the actual value;
TIMELINESS
which occurs when the recorded value is not out of date;
CONSISTENCY
which occurs when the representation of the data values, is the same in all cases
COMPLETENESS
which occurs when all values for a certain variable are recorded
MEASURING (ASSESSING) DATA QUALITY
To improve quality of data, Pipino, Lee & Wang (2002) stressed that data managers need to be able to measure each data quality dimension. Measuring quality usually involves comparing recorded data to some independently collected “gold standard,” and that it is impractical to measure the quality of every piece of data. A better approach they proffered is to sample the data and respond to patterns in quality. The resulting measures are known as metrics. Table 1.4 shows the methods of measurement and formulae for the data quality dimension adopted for this paper.
Table 1.4: Measurement of Data quality : Heinrich, Kaiser, Klier. (2007); Vassiliadis (2000)
Dimension
Methods of measurement
Formulae
Completeness
performance of statistic checks
the percentage of stored information detected to be incomplete with respect
to the real world values
Timeliness
Heinrich., Kaiser, Klier, (2007)
percentage of ready generated information (that meet deadline ) from data in database in relation to the whole.
Accuracy
documentation of the person or machine which entered the
information and performance of statistical checks
the percentage of stored information detected to be inaccurate with respect to the real world values, due to data entry reasons
Consistency
performance of statistic checks
the percentage of stored information detected to be inconsistent
Data Error Patterns (Defects)
The general data content defect (error) patterns proposed by English (1999) is given in table 1.5.
Table 1.5 : Error patterns (English, 1999)
ERROR TYPES
MEANING
1. Missing data values
Data that should have a value is missing. This includes both required fields and fields not required to be entered at data capture, but are needed in downstream processing.
2. Incorrect data values
These may be caused by transposition of key-strokes; entering data in the wrong place; misunderstanding of the meaning of the data captured; or forced values due to fields requiring a value not known to the information producer or data intermediary.
3. Duplicate occurrences
Multiple records that represent one single real world entity.
4. Inconsistent data values
Unmanaged data stored in redundant databases often gets updated inconsistently; that is, data may be updated in one database, but not the others. Or, it may even be updated inconsistently in the different database
5. Information quality
contamination
The result of deriving inaccurate data by combining accurate data with inaccurate data.
6. Domain schizophrenia
Fields may be used for different purposes depending on as specific requirement or purpose.
7. Non-atomic data values
Data fields may be mis-defined as non-atomic or multiple facts may be entered in a single field.
FRAMEWORK FOR DATA QUALITY ASSESSMENT
Liskin, (1990) and Wang (1989) each proposed frameworks for the assessment of data quality. Wang’s framework takes into account the fact that there are different types of data and different consumers and users. The framework also recognizes that data is used for different applications. As such, the needs and quality requirements are different for the different data customers and applications. Figure 1 shows the framework for data quality assessment.
STEP 7: COMPLETE THE FEEDBACK CYCLE. Periodically reassess your data consumers, how they use your data and the quality targets for your applications
STEP 3: SET DATA QUALITY TARGETS. Set targets for each measure based on the data consumer’s need and applications
STEP 2: SELECT MEASURES. Ensure that the data quality measures are useful and relevant for majority, if not all, of the data consumers.
STEP 6: ASSIGN RESPONSIBILITY. Assign data quality responsibility to data stewards who endure that data problems are fixed get fixed at the root cause. Provide performance incentive based on quality levels.
STEP 5: IDENTIFY DATA QUALITY DEFICIENCIES. Compare data quality results to target and identify data quality deficiencies. Identify and program resources to improve data quality or lower targets to be financially constrained
STEP 1: KNOW YOUR DATA CONSUMER. Enumerate the consumers and know the type of data they are using
STEP 4: CALCULATE DATA QUALITY FOR UNIQUE DATA. Measure data quality for each version of data. Data quality changes for each version of data therefore on must calculate data quality for each significantly unique version of data
Figure 1: Data quality assessment framework. Wang (1989)
DATA QUALITY STAKEHOLDER
In data quality and data warehouse fields; Strong et al, 1997; Wang, 1998; Sharks and Darke 1998) identified five stakeholder groups that are responsible for creating, maintaining, using, and managing data. They are:
data producers/suppliers: create or collect information;
data custodians: design, develop and operate the information system
data consumers / users: use the information in their works;
data managers: are responsible for managing the data quality;
The interrelationship of the data quality stakeholder groups and the information system is shown in figure 2.
DATA SUPPLIERS ( Create / collect data )
RAW DATA
DATA CUSTODIAN
Design, dev. Software and operate Info. System.
INFORMATION
SYSTEM
DATA MANAGERS
Supervise/Manager
Processed information
Information users
E.g. top management &
general users
Fig 2: Data quality stakeholder group framework components and their interrelationships
Factors that impact on Data Quality in Information Systems Database
There have been many studies of critical success factors in quality management such as Total Quality Management (TQM) and Just-In-Time (JIT) (Saraph et al 1989; Porter & Parker 1993; Black & Porter 1996; Badri, Davis & Davis 1995; Yusof & Aspinwall 1999). Some of the data quality literature has addressed the critical points and steps for DQ (Firth 1996; Segev 1996; Huang et al 1999; English 1999). Table 1.6 summarises these human factors.
Table 1.6 summary the factors affecting the quality of data
HUMAN FACTOR
Xu (2001)
English (1999)
Wang (1998,1999)
Firth (1996)
Segev (1996)
Zhu (1995)
Saraph (1989)
Johnson (1981)
Groomer (1989)
Bowen (1993)
Nichols (1987)
1
Role of top management
2
Data quality polices & standards
3
Role of data quality manager
5
Employee/ personnel relations
6
Performance evaluation and
rewards (responsibility for DQ)
8
Internal control (systems, process)
9
Input control
10
Continuous improvement
11
Training and communication
12
Manage Change
Model for human factors impacting on data quality in information systems is shown in figure 3
Relevant DQ policies & standards & its implementation
DQ approaches (control & improvement); Role of DQ;
Internal control, Input control; Understanding of the systems & DQ
DQ characteristics
External factors
Information Systems
(IS) data quality (DQ)
Organisational factors
Stakeholders’ related factors
Training; Organisational structure & culture
Performance evaluation & rewards
Manage change; Evaluate cost/benefit tradeoffs
Teamwork
Top management’s commitment to DQ
Role of DQ manager /manager group; Customer focus;
Employee/personnel relations
Information supplier quality management;
Figure 3: The model for human factors influencing data quality in information systems
IMPROVING DATA QUALITY
English (1999); Redman (1996); Wang et al., (1995b) and Ballou, & Pazer (2003), Haan, Adams, & Cook (2004) all agree that the quality of a real-world data set depends on a number of issues but the source of the data is the crucial factor. Ballou, & Pazer (2003) was specific when they said that data entry and acquisition are inherently prone to errors, both simple and complex. Marcus et al (2001) says that much effort can be given to improving this front-end process, with respect to reduction in entry errors, but the fact often remains that errors in large data sets are common. English (2003) proposed a data quality improvement cycle (figure 4).
Plan
Implement
Educate
Evaluate
Adapt
Fig 4: Data Quality Improvement Cycle. English (2003)
Olson (2003) notes that the mission of any data quality improvement programme should be three-fold; to improve, prevent, and monitor. An analysis of the requirements for a data quality improvement programme finds that the data quality practitioners, including English (999a), Wang et al., (2001), Olson (2003) and Loshin (2001), agree that the cause of poor quality data is often found to be human or process error. A programme of work is required by many participants in an organisation and often across business units to implement the above initiatives and Olson (2003) indicated that such a programme requires long term commitment. Embury (2001) notes that the general principles of quality management as applied to products can also be applied to data. This suggests there should be two basic approaches to the improvement of data quality, namely:
defect detection (and correction) and
defect prevention.
Defect prevention is considered to be far superior to defect detection, since detection is often costly and cannot guarantee to be 100% successful at any stage. To prevent data value errors, Redman (1996) gave the tips in table 1.7
Table 1.7: Database Design Tips to Improve Data Quality (from Redman, 1996)
Database Design Tip
Description of Design Tip
1
Create a data value as few times as possible.
Inconsistencies between multiple values often go unnoticed until they are the source of a problem.
2
Store data in as few databases as possible.
Multiple storage makes it difficult to maintain consistency, especially when data change.
3
Put data in machine-readable form as early in the business process as possible.
Computers and scanners are better than people at activities such as reading and inputting data. Â However, do not assume that computerized data collection is 100% accurate.
4
Minimize data format changes within the business process.
If format changes are necessary, use computers, not people, to make format changes.
5
When obtaining data for the first time, obtain them just before they are first needed.
Existing data values change rapidly. Â Capture changes to data values as soon as possible after they change.
6
Discontinue gathering and storing data that are no longer useful.
Plan for periodic review of data needs. Â When data are no longer useful, they need not be destroyed, simply moved to secondary storage.
7
Employ codes that are easy for data creators and users to understand.
Avoid long, numeric, meaningless coding conventions in favor of short, meaningful words or abbreviations.
8
Place edits as near as possible to data creation or modification.
Use edits as input criteria to a database, as opposed to exit criteria from a database to an application.
9
Employ single-fact data wherever possible.
Single-fact data help reduce code complexity and simplify operators’ jobs.
DATA CLEANSING
Rahm and Do (2000) states that the term Data cleansing, also called data improvement or scrubbing, are used synonymously to mean detecting and removing errors and inconsistencies from data in order to improve its quality. For Lee (2004), the process of cleansing is to improve the quality of data within the existing data structures. Figure 5 shows the process of data cleansing.
Data delivered
Quality checks
Data cleansing
Receive source file
Fig 5: The process of data cleansing
Cleansing data from impurities is an integral part of data processing and maintenance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This means standardizing non-standard data values and domains, filling in missing data, correcting incorrect data, and consolidating duplicate occurrences. The general framework for data cleansing is described by Maletic and Marcus (1999) as follows:
Define and determine error types
Search and identify error instances
Correct the errors uncovered
The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid street code) or fuzzy (such as correcting records that partially match existing, known records).
Storage of data
The storage of data can have an effect on data quality in a number of ways. Many of these are not obvious, but need to be considered both in the design of the storage vessel (database) and as a unit in the data quality chain.
Backup of data
The regular backup of data helps ensure consistent quality levels. It essential that organizations maintain current disaster recovery and back-up procedures. Whenever data are lost or corrupted, there is a concomitant loss in quality.
Archiving
Archiving (including obsolescence and disposal) of data is an area of data and risk management that needs more attention. Data archiving, in particular by universities, NGOs and private individuals should be a priority data management issue. Universities have high turnovers of staff and often the research data are stored in a distributed manner – usually in the researchers own PC or filing cabinet. If not fully documented, such data can very quickly lose their usability and accessibility. More often than not it is discarded sometime after the researcher has left the organisation, as no one knows what it is or cares to put the effort in to maintaining it. It is for this reason that Universities in particular need sound documenting and archiving strategies.
Individual researchers working outside of a major institution need to ensure that their data are maintained and/or archived after their death, or after they cease to have an interest in it. Similarly NGO organisations that may not have long-term funding for the storage of data, need to enter into arrangements with appropriate organisations that do have a long-term data management strategy (including for archiving) and who may have an interest in the data. The cleanup and disposal and archiving of data are also issues with data on the World Wide Web. Web sites that are abandoned by their creators, or that contain old and obsolete data leave cyberspace littered with digital debris (various references). Organisations need a data archiving strategy built into their information management chain.
Order Now