Poor Data Quality In Information System Databases Information Technology Essay

The demand for better quality data is increasing. Too often, data are used uncritically without consideration of the error contained within, and this can lead to erroneous results, misleading information, unwise decisions, loss of revenue, and duplication of efforts, increase in cost of processing, poor business relationships and loss of lives are some of the results of poor quality data. Attention must be paid to the quality of data going into computer-based information systems. Corporations, government agencies and not-for-profit groups are all inundated with enormous amounts of data in their information systems databases This data has the potential to be used to generate greater understanding of a country for proper planning; for an organisation’s customers, processes, and the organisation itself. There is the need to ensure the quality of the data from which reports are generated and far reaching decisions made. The Data manipulated by information systems may not only be “inaccurate” or “wrong” but may also be missing, out-of-date, inconsistent or otherwise inadequate for the specific purposes of the user. Missing, inaccurate or inconsistent data can lead to inefficiencies or missed opportunities and poor policies. This paper explores factors that control data quality in information systems databases and identify factors that are critical to the improvement of the quality of data.

Keywords: Data quality, Dimensions, Human factor, Data cleansing

INTRODUCTION

Attention to data quality is a critical issue in all areas of information resources management. The government analyses of data gained by population census to decide, which regions of the country require further investments in the infrastructure, like schools and other educational facilities, because of expected future trends. If the rate of birth in one region has increased over the last couple of years the existing schools and teachers employed might not be sufficient to handle the amount of students expected. Thus, additional schools or employment of teachers will be needed. Inaccuracies in analysed data can lead to false conclusions and misdirected investments. In this example, erroneous data could result in an over- or undersupply of educational facilities if the information about the birth rate was inaccurate. In other words, inaccurate census statistics could result in wrong allocation of scarce resources. Even business are not spared by poor quality data. An article in the Wall Street Journal (13/7/98) relates the domino effect that occurred when erroneous information was typed into a central database. A new airport in Hong Kong suffered catastrophic problems in baggage handling, flight information, and cargo transfer. The ramifications of the dirty data were felt throughout the airport. Flights took off without luggage; airport officials tracked flights with plastic pieces on magnetic boards; and airlines called confused ground staff on cellular phones to let them know where even more confused passengers could find their planes (Arnold, 1998). The airport had been depending on the central database to be accurate. When it wasn’t, the airport paid the price in terms of customer satisfaction and trust. It is imperative that the issue of data quality be addressed if the data base is to prove beneficial to an organisation.

In Nigeria, the Central Bank of Nigeria (CBN) in august 2009 made public in the Guardian newspaper bank loan defaulters of all banks in Nigeria. Many companies and individuals refuted the figures indicated against their names for different reasons which included;

They never applied let alone obtained loan from the indicated banks.

Amounts they owed were less than amount indicated against their names.

Loan obtained had been completely paid up yet list claimed they were still debtors.

They were never or no longer directors of the debtor companies as published.

No sector seem to be left out.

Bowen (1993) reports an error in an inventory database that was caused by a manager. This resulted in a decisions that generate under-stock conditions. This error was caused by just one minor data entry. Even an error such as the wrongly entered product / service price, that go through an organisation’s information system without appropriate data quality checks, could cause losses to an organisation and / or harm its reputation.

The election into the position of Governor of Anambra state (6th February, 2010) was also with problem of poor data. A case in point was that of a former Vice President of Nigeria (Dr. Alex Ekwueme) who couldn’t find his name in the voters register. Surprisingly, the same voters register was used some months earlier in the recent local government election and he claimed it contained his name and he was able to vote then.

Poor data quality in health care for example, could impact on patient safety. A review of the safety and quality of health care in the United States (Institute of Medicine, 2000) found that between 44,000 and 98,000 deaths in the US each year can be attributed to medical errors. The report recommends ‘better access to accurate, timely information’ and to ‘make relevant patient information available at the point of care’ in an effort to improve patient safety. While not all of these errors are 100% attributable to data quality, Strong et al (1997) notes that the percentage contributed by poor quality are quite high. He also notes the social impact when government organization fail to ensure their data have sufficient quality to make effective decisions.

The cost of all these was quite high. Strong et al (Strong et al., 1997) noted the social impact of poor quality data when governmental organisations fail to ensure their data have sufficient quality to make effective decisions. The cost to organisations is far more than merely financial. Trust is lost from valuable customers (both internal and external), potential customers and sales are missed, operational costs increased, workers lose motivation, long-term business strategy is hindered (Leitheiser, 2001) and business re-engineering is impeded (Bowen, Fuhrer, & Guess, 1998; Redman, 1996a; United States Department of Defence, 2003), (Loshin, 2001). Redman also details how poor data quality affects operational, tactical and strategic decisions (Redman, 1996a).

The view of this study is that to be able to handle information system data effectively, an in-depth knowledge of the issues affecting data quality must be reviewed. Case studies concerning data quality problems are frequently documented. Data quality is a vital issue in both industry and everyday life. Data quality problems have been investigated in a substantial body of literature. The results of a number of investigations are summarised in table 1.1:

Table 1.1: Data quality issues

Arnold (1998)

SUMMARY OF DATA QUALITY ISSUES

Olson (2003):

Poor data quality management costs more than $1.4 billion annually in 599 surveyed companies

Up to 88% of data-related projects fail, largely due to issues with Data Quality

Wang et al. (2001

70% of manufacturing orders are assessed as of poor data quality.

In 2001, Data Quality issues accounted for nearly $600M in losses for US companies

Huang et al. (1999)

a telecommunications company lost $3 million because of the poor data quality in customer bills

Redman (1998)

it is estimated that poor data quality results in 8% to 12% loss of revenue in a typical enterprise, and informally estimated to be responsible for 40% to 60% of expense in service organisations

Wand and Wang (1996)

More than 60% of 500 medium-size firms suffer data quality problems

Klein et al. (1997)

1% to 10% of data in organisational databases is inaccurate

Redman (1996)

an industrial data error rate is up to 30% – 75%

Wang and Strong (1996)

40% of the credit-risk management database in a New York bank are incomplete

Strong et al. 1997

between 50% and 80% of computerised U.S. criminal records are estimated to be inaccurate, incomplete and ambiguous

Standish Group (1998)

74 percent of all data migration projects either overran or failed, resulting in almost $100 billion in unexpected costs.

Information Week

In a survey of 300 IT executives, the majority of the respondents (81 percent) said, “improving (data) quality was their most important post-year 2000 technology priority”

TDWI survey(2001)

Data Quality issues lead to 87% of projects requiring extra time to reconcile data.

Data Quality issues lead to lost credibility within a system on 81%

Day-to-day operating costs due to bad data are estimated to be as high as 20% of operating profit.

These figures demonstrate that managing the quality of data really can be beneficial to organisations. Observing the figures above, we are able to conclude that data quality problems are pervasive, costly and even disastrous.

DATA QUALITY DIMENSIONS

Many studies have confirmed that data quality is a multi-dimensional concept (Ballou and Pazer 1985; Redman 1996; Wand and Wang 1996; Wang and Strong 1996; Huang et al. 1999). Over the last two decades, different sets of data quality dimensions have been identified from both database and data management perspectives. Table 1.2. shows the data quality dimension proposed by each researcher.

Table 1.2 : Data quality dimension proposed by data quality researchers

Boulou et al (1982,

Huang et al. (1999)

Pipino et al. (2002)

Lee et al.

(2002)

Wang, Reddy, & Kon (1995)

Stvilia et al.

(2006)

Wang et al (2001)

Number of Dimensions

Proposed

(4)

The dimensions that have been identified by Ballou et al (1982,1985,1987,1993) will be adopted in this study because they cover the most important dimensions that have been addressed in information system literature and have been reasonably widely accepted in the data quality field. Therefore, quality data in this paper means accurate, timely, complete, and consistent data. In any case, in accord with Olson (2003), we regard accuracy as the most important data quality dimension.: “The accuracy dimension is the foundation measure of the quality of data. If data is not right, the other dimensions are of little importance.” Olson (2003). The four dimensions identified by Boulou et al (1982, 1985,1987,1993) and their meaning are shown in table 1.3

Table 1.3: Data quality Dimensions (Ballou et al. 1982, 1985,1987,1993).

DIMENSION

MEANING

ACCURACY

which occurs when the recorded value is in conformity with the actual value;

TIMELINESS

which occurs when the recorded value is not out of date;

CONSISTENCY

which occurs when the representation of the data values, is the same in all cases

COMPLETENESS

which occurs when all values for a certain variable are recorded

MEASURING (ASSESSING) DATA QUALITY

To improve quality of data, Pipino, Lee & Wang (2002) stressed that data managers need to be able to measure each data quality dimension. Measuring quality usually involves comparing recorded data to some independently collected “gold standard,” and that it is impractical to measure the quality of every piece of data. A better approach they proffered is to sample the data and respond to patterns in quality. The resulting measures are known as metrics. Table 1.4 shows the methods of measurement and formulae for the data quality dimension adopted for this paper.

Table 1.4: Measurement of Data quality : Heinrich, Kaiser, Klier. (2007); Vassiliadis (2000)

Dimension

Methods of measurement

Formulae

Completeness

performance of statistic checks

the percentage of stored information detected to be incomplete with respect

to the real world values

Timeliness

Heinrich., Kaiser, Klier, (2007)

percentage of ready generated information (that meet deadline ) from data in database in relation to the whole.

Accuracy

documentation of the person or machine which entered the

information and performance of statistical checks

the percentage of stored information detected to be inaccurate with respect to the real world values, due to data entry reasons

Consistency

performance of statistic checks

the percentage of stored information detected to be inconsistent

Data Error Patterns (Defects)

The general data content defect (error) patterns proposed by English (1999) is given in table 1.5.

Table 1.5 : Error patterns (English, 1999)

ERROR TYPES

MEANING

1. Missing data values

Data that should have a value is missing. This includes both required fields and fields not required to be entered at data capture, but are needed in downstream processing.

2. Incorrect data values

These may be caused by transposition of key-strokes; entering data in the wrong place; misunderstanding of the meaning of the data captured; or forced values due to fields requiring a value not known to the information producer or data intermediary.

3. Duplicate occurrences

Multiple records that represent one single real world entity.

4. Inconsistent data values

Unmanaged data stored in redundant databases often gets updated inconsistently; that is, data may be updated in one database, but not the others. Or, it may even be updated inconsistently in the different database

5. Information quality

contamination

The result of deriving inaccurate data by combining accurate data with inaccurate data.

6. Domain schizophrenia

Fields may be used for different purposes depending on as specific requirement or purpose.

7. Non-atomic data values

Data fields may be mis-defined as non-atomic or multiple facts may be entered in a single field.

FRAMEWORK FOR DATA QUALITY ASSESSMENT

Liskin, (1990) and Wang (1989) each proposed frameworks for the assessment of data quality. Wang’s framework takes into account the fact that there are different types of data and different consumers and users. The framework also recognizes that data is used for different applications. As such, the needs and quality requirements are different for the different data customers and applications. Figure 1 shows the framework for data quality assessment.

STEP 7: COMPLETE THE FEEDBACK CYCLE. Periodically reassess your data consumers, how they use your data and the quality targets for your applications

STEP 3: SET DATA QUALITY TARGETS. Set targets for each measure based on the data consumer’s need and applications

STEP 2: SELECT MEASURES. Ensure that the data quality measures are useful and relevant for majority, if not all, of the data consumers.

STEP 6: ASSIGN RESPONSIBILITY. Assign data quality responsibility to data stewards who endure that data problems are fixed get fixed at the root cause. Provide performance incentive based on quality levels.

STEP 5: IDENTIFY DATA QUALITY DEFICIENCIES. Compare data quality results to target and identify data quality deficiencies. Identify and program resources to improve data quality or lower targets to be financially constrained

STEP 1: KNOW YOUR DATA CONSUMER. Enumerate the consumers and know the type of data they are using

STEP 4: CALCULATE DATA QUALITY FOR UNIQUE DATA. Measure data quality for each version of data. Data quality changes for each version of data therefore on must calculate data quality for each significantly unique version of data

Figure 1: Data quality assessment framework. Wang (1989)

DATA QUALITY STAKEHOLDER

In data quality and data warehouse fields; Strong et al, 1997; Wang, 1998; Sharks and Darke 1998) identified five stakeholder groups that are responsible for creating, maintaining, using, and managing data. They are:

data producers/suppliers: create or collect information;

data custodians: design, develop and operate the information system

data consumers / users: use the information in their works;

data managers: are responsible for managing the data quality;

The interrelationship of the data quality stakeholder groups and the information system is shown in figure 2.

DATA SUPPLIERS ( Create / collect data )

RAW DATA

DATA CUSTODIAN

Design, dev. Software and operate Info. System.

INFORMATION

SYSTEM

DATA MANAGERS

Supervise/Manager

Processed information

Information users

E.g. top management &

general users

Fig 2: Data quality stakeholder group framework components and their interrelationships

Factors that impact on Data Quality in Information Systems Database

There have been many studies of critical success factors in quality management such as Total Quality Management (TQM) and Just-In-Time (JIT) (Saraph et al 1989; Porter & Parker 1993; Black & Porter 1996; Badri, Davis & Davis 1995; Yusof & Aspinwall 1999). Some of the data quality literature has addressed the critical points and steps for DQ (Firth 1996; Segev 1996; Huang et al 1999; English 1999). Table 1.6 summarises these human factors.

Table 1.6 summary the factors affecting the quality of data

HUMAN FACTOR

Xu (2001)

English (1999)

Wang (1998,1999)

Firth (1996)

Segev (1996)

Zhu (1995)

Saraph (1989)

Johnson (1981)

Groomer (1989)

Bowen (1993)

Nichols (1987)

Role of top management

Data quality polices & standards

Role of data quality manager

Employee/ personnel relations

Performance evaluation and

rewards (responsibility for DQ)

Internal control (systems, process)

Input control

Continuous improvement

Training and communication

Manage Change

Model for human factors impacting on data quality in information systems is shown in figure 3

Relevant DQ policies & standards & its implementation

DQ approaches (control & improvement); Role of DQ;

Internal control, Input control; Understanding of the systems & DQ

DQ characteristics

External factors

Information Systems

(IS) data quality (DQ)

Organisational factors

Stakeholders’ related factors

Training; Organisational structure & culture

Performance evaluation & rewards

Manage change; Evaluate cost/benefit tradeoffs

Teamwork

Top management’s commitment to DQ

Role of DQ manager /manager group; Customer focus;

Employee/personnel relations

Information supplier quality management;

Figure 3: The model for human factors influencing data quality in information systems

IMPROVING DATA QUALITY

English (1999); Redman (1996); Wang et al., (1995b) and Ballou, & Pazer (2003), Haan, Adams, & Cook (2004) all agree that the quality of a real-world data set depends on a number of issues but the source of the data is the crucial factor. Ballou, & Pazer (2003) was specific when they said that data entry and acquisition are inherently prone to errors, both simple and complex. Marcus et al (2001) says that much effort can be given to improving this front-end process, with respect to reduction in entry errors, but the fact often remains that errors in large data sets are common. English (2003) proposed a data quality improvement cycle (figure 4).

Plan

Implement

Educate

Evaluate

Adapt

Fig 4: Data Quality Improvement Cycle. English (2003)

Olson (2003) notes that the mission of any data quality improvement programme should be three-fold; to improve, prevent, and monitor. An analysis of the requirements for a data quality improvement programme finds that the data quality practitioners, including English (999a), Wang et al., (2001), Olson (2003) and Loshin (2001), agree that the cause of poor quality data is often found to be human or process error. A programme of work is required by many participants in an organisation and often across business units to implement the above initiatives and Olson (2003) indicated that such a programme requires long term commitment. Embury (2001) notes that the general principles of quality management as applied to products can also be applied to data. This suggests there should be two basic approaches to the improvement of data quality, namely:

defect detection (and correction) and

defect prevention.

Defect prevention is considered to be far superior to defect detection, since detection is often costly and cannot guarantee to be 100% successful at any stage. To prevent data value errors, Redman (1996) gave the tips in table 1.7

Table 1.7: Database Design Tips to Improve Data Quality (from Redman, 1996)

Database Design Tip

Description of Design Tip

Create a data value as few times as possible.

Inconsistencies between multiple values often go unnoticed until they are the source of a problem.

Store data in as few databases as possible.

Multiple storage makes it difficult to maintain consistency, especially when data change.

Put data in machine-readable form as early in the business process as possible.

Computers and scanners are better than people at activities such as reading and inputting data. Â However, do not assume that computerized data collection is 100% accurate.

Minimize data format changes within the business process.

If format changes are necessary, use computers, not people, to make format changes.

When obtaining data for the first time, obtain them just before they are first needed.

Existing data values change rapidly. Â Capture changes to data values as soon as possible after they change.

Discontinue gathering and storing data that are no longer useful.

Plan for periodic review of data needs. Â When data are no longer useful, they need not be destroyed, simply moved to secondary storage.

Employ codes that are easy for data creators and users to understand.

Avoid long, numeric, meaningless coding conventions in favor of short, meaningful words or abbreviations.

Place edits as near as possible to data creation or modification.

Use edits as input criteria to a database, as opposed to exit criteria from a database to an application.

Employ single-fact data wherever possible.

Single-fact data help reduce code complexity and simplify operators’ jobs.

DATA CLEANSING

Rahm and Do (2000) states that the term Data cleansing, also called data improvement or scrubbing, are used synonymously to mean detecting and removing errors and inconsistencies from data in order to improve its quality. For Lee (2004), the process of cleansing is to improve the quality of data within the existing data structures. Figure 5 shows the process of data cleansing.

Data delivered

Quality checks

Data cleansing

Receive source file

Fig 5: The process of data cleansing

Cleansing data from impurities is an integral part of data processing and maintenance. This has lead to the development of a broad range of methods intending to enhance the accuracy and thereby the usability of existing data. This means standardizing non-standard data values and domains, filling in missing data, correcting incorrect data, and consolidating duplicate occurrences. The general framework for data cleansing is described by Maletic and Marcus (1999) as follows:

Define and determine error types

Search and identify error instances

Correct the errors uncovered

The actual process of data cleansing may involve removing typographical errors or validating and correcting values against a known list of entities. The validation may be strict (such as rejecting any address that does not have a valid street code) or fuzzy (such as correcting records that partially match existing, known records).

Storage of data

The storage of data can have an effect on data quality in a number of ways. Many of these are not obvious, but need to be considered both in the design of the storage vessel (database) and as a unit in the data quality chain.

Backup of data

The regular backup of data helps ensure consistent quality levels. It essential that organizations maintain current disaster recovery and back-up procedures. Whenever data are lost or corrupted, there is a concomitant loss in quality.

Archiving

Archiving (including obsolescence and disposal) of data is an area of data and risk management that needs more attention. Data archiving, in particular by universities, NGOs and private individuals should be a priority data management issue. Universities have high turnovers of staff and often the research data are stored in a distributed manner – usually in the researchers own PC or filing cabinet. If not fully documented, such data can very quickly lose their usability and accessibility. More often than not it is discarded sometime after the researcher has left the organisation, as no one knows what it is or cares to put the effort in to maintaining it. It is for this reason that Universities in particular need sound documenting and archiving strategies.

Individual researchers working outside of a major institution need to ensure that their data are maintained and/or archived after their death, or after they cease to have an interest in it. Similarly NGO organisations that may not have long-term funding for the storage of data, need to enter into arrangements with appropriate organisations that do have a long-term data management strategy (including for archiving) and who may have an interest in the data. The cleanup and disposal and archiving of data are also issues with data on the World Wide Web. Web sites that are abandoned by their creators, or that contain old and obsolete data leave cyberspace littered with digital debris (various references). Organisations need a data archiving strategy built into their information management chain.

Order Now