A Guide Into Business Intelligence Studies Information Technology Essay

Data Warehousing: Integration of data from multiple sources into large warehouses and support of on-line analytical processing and business decision making

DW vs. Operational Databases

Data Warehouse

Subject Oriented

Integrated

Nonvolatile

Time variant

Ad hoc retrieval

Operational Databases

Application oriented

Limited integration

Continuously updated

Current data values only

Predictable retrieval

Data Warehouse: a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.

Data Mart

A monothematic data warehouse

Department- oriented or business line oriented

Top-Down Approach

Advantages

A truly corporate effort, an enterprise view of data

Inherently architected – not a union of disparate data marts

Single, central storage of data about the content

Centralized rules and control

May see quick results if implemented with iterations

Disadvantages

Takes longer to build even with an iterative method

High exposure/risk to failure

Needs high level of cross-functional skills

High outlay without proof of concept

Bottom-Up Approach

Advantages

Faster and easier implementation of manageable pieces

Favorable return on investment and proof of concept

Less risk of failure

Inherently incremental; can schedule important data marts first

Allows project team to learn and grow

Disadvantages

Each data mart has its own narrow view of data

Permeates redundant data in every data mart

Perpetuates inconsistent and irreconcilable data

Proliferates unmanageable interfaces

Data Staging Component

Three major functions need to be performed for getting the data ready (ETL)

extract the data

transform the data

and then load the data into the data warehouse storage

Data Warehouse

Subject-Oriented – Data is stored by subjects

Integrated Data – Need to pull together all the relevant data from the various systems

Data from internal operational systems

Data from outside sources

Time-Variant Data – the stored data contains the current values

The use needs data not only about the current purchase, but on the past purchases

Nonvolatile Data – Data from the operational systems are moved into the data warehouse at specific intervals

Data Granularity – Data granularity in a data warehouse refers to the level of detail

The lower the level of detail, the finer the data granularity

The lowest level of detail ® a lot of data in the data warehouse

Four steps in dimensional modeling

Identify the process being modeled.

Determine the grain at which facts will be stored.

Choose the dimensions.

Identify the numeric measures for the facts.

Components of a star schema

Fact tables contain factual or quantitative data

1:N relationship between dimension tables and fact tables

Dimension tables contain descriptions about the subjects of the business

Dimension tables are denormalized to maximize performance

Slowly changing dimensions

Are the Customer and Product Dim independent of Time Dim?

Changes in names, family status, product district/region

How to handle these changes in order not to affect the history status? Eg. Insurance

3 suggestions for slowly changing dimensions

Type 1 — overwrite/erase old values; no accurate tracking of history needed; easy to implement;

Type 2 — create new record at time of change; partitioning the history (old and new description);

Type 3 — new “current” fields, legitimate need to track both old and new states “Original” and “current” values; Intermediate Values are lost

Junk Dimensions

Leave the flags in the fact tables

likely sparse data

no real browse entry capability

can significantly increase the size of the fact table

Remove the attributes from the design

potentially critical information will be lost

if they provide no relevance, remove them

Make a flag into it’s own dimension

may greatly increase the number of dimensions, increasing the size of the fact table

can clutter and confuse the design

Combine all relevant flags, etc. into a single dimension

the number of possibilities remain finite

information is retained

The Monster Dimension

It is a compromise

Avoids creating copies of dimension records in a significantly large dimension

Done to manage space and changes efficiently

3 types of multidimensional data

Data from external sources (represented by the blue cylinder) is copied into the small red marble cube, which represents input multidimensional data

Pre-calculated, stored results derived from it

on-the-fly results, calculated as required at run-time, but not stored in a database

Aggregation

The system uses physically stored aggregates as a way to enhance performance of common queries.

These aggregates, like indexes, are chosen silently by the database if they are physically present.

End users and application developers do not need to know what aggregates are available at any point in time, and applications are not required to explicitly code the name of an aggregate

When you go for higher level of aggregates, the sparsity percentage goes down, eventually reaching 100% of occupancy

Data Extraction

Two major types of data extractions from the source operational systems

“as is” (static) data and data of revision

“as is” or static data is the capture of data at a given point in time

For initial load

Data of revision is known as incremental data capture

Data Quality Issues

Dummy values in fields

Missing data

Unofficial use of fields

Cryptic values

Contradicting values

Reused primary keys

Inconsistent values

Incorrect values

Multipurpose fields

Steps in Data Cleansing

Parsing

Correcting

Standardizing

Matching

Consolidating

DATA TRANSFORMATION

All the extracted data must be made usable in the data warehouse

The quality of the data in many old legacy systems is less likely to be good enough for the data warehouse

Transformation of source data encompasses a wide variety of manipulations to change all the extracted source data into usable information to be stored in the data warehouse

Data warehouse practitioners have attempted to classify data transformations in several ways

Basic Tasks

Set of basic tasks

Selection

Splitting/Joining

Conversion

Summarization

Enrichment

Initial Load

Load mode

Incremental Loads

Constructive merge mode

Type 1 slowly changing dimension: destructive merge mode

Full Refresh

Load and append modes are applicable

OLAP defined:

On-line Analytical Processing(OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user

Users need the ability to perform multidimensional analysis with complex calculations

The basic virtues of OLAP

Enables analysts, executives, and managers to gain useful insights from the presentation of data

Can reorganize metrics along several dimensions and allow data to be viewed from different perspectives

Supports multidimensional analysis

Is able to drill down or roll up within each dimension

BUSINESS METADATA

Is like a roadmap or an easy-to-use information directory showing the contents and how to get it

How can I sign onto and connect with the data warehouse?

Which parts of the data warehouse can I access?

Can I see all the attributes from a specific table?

What are the definitions of the attributes I need in my query?

Are there any queries and reports already predefined to give the results I need?

TECHNICAL METADATA

Technical metadata is meant for the IT staff responsible for the development and administration of the data warehouse

Technical metadata is like a support guide for the IT professionals to build, maintain, and administer the data warehouse

Physical Design Objectives

Improve Performance

In OLTP, 1-2 secs max; in DW secs to mins

Ensure scalability

Manage storage

Provide Ease of Administration

Design for Flexibility.

Physical Design Steps

Develop Standards

Create Aggregates Plan

Determine Data Partitioning

Establish Clustering Options

Prepare Indexing Strategy

Assign storage structures

Partitioning

Breaking data into several physical units that can be handled separately

Not a question of whether to do it in data warehouses but how to do it

Granularity and partitioning are key to effective implementation of a warehouse

Partitions are spread across multiple disks to boost performance

Why Partition?

Flexibility in managing data

Smaller physical units allow

easy restructuring

free indexing

sequential scans if needed

easy reorganization

easy recovery

easy monitoring

Improve performance

Criterion for Partitioning

Vertically (groups of selected columns together. More typical in dimension tables)

Horizontally (e.g. recent events and past history. Typical in fact tables)

Parallelization

The argument goes:

if your main problem is that your queries run too slowly, use more than one machine at a time to make them run faster (Parallel Processing).

Oracle uses this strategy in its warehousing products.

Indexing

Structure separate from the table data it refers to, storing the location of rows in the database based on the column values specified when the index is created.

They are used in data warehouse to improve warehouse throughput

Indexing and loading

Indexing for large tables

Btree characteristics:

Balanced

Bushy: multi-way tree

Block-oriented

Dynamic

Bitmap Index

Bitmap indices are a special type of index designed for efficient querying on multiple keys

Records in a relation are assumed to be numbered sequentially from, say, 0

Given a number n it must be easy to retrieve record n

Particularly easy if records are of fixed size

Applicable on attributes that take on a relatively small number of distinct values

E.g. gender, country, state, â€¦

E.g. income-level (income broken up into a small number of levels such as 0-9999, 10000-19999, 20000-50000, 50000- infinity)

A bitmap is simply an array of bits

In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute

Bitmap has as many bits as records

In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise

Clustering

The technique involves placing and managing related units of data to be retrieved in the same physical block of storage

This arrangement causes related units of data to be retrieved together in one single operation

In a clustering index, the order of the rows is close to the index order. Close means that physical records containing rows will not have to be accessed more than one time if the index is accessed sequentially

DW Deployment

Major deployment activities

Complete user acceptance

Perform initial loads

Get user desktops ready

Complete initial user training

Institute initial user support

Deploy in stages

DW Growth & Maintenance

Monitoring the DW

Collection of Stats

Usage of Stats

For growth planning

For fine tuning

User training

Data Content

Applications & Tools

Dimensional Modeling Exercise

Exercise: Create a star schema diagram that will enable FIT-WORLD GYM INC. to analyze their revenue.

âˆ’ The fact table will include: for every instance of revenue taken – attribute(s) useful for analyzing revenue.

âˆ’ The star schema will include all dimensions that can be useful for analyzing revenue.

âˆ’ The only data sources available are shown bellow.

SOURCE 1

“FIT-WORLD GYM” Operational Database: ER-Diagram and the tables based on it (with data)

SOLUTION

Order Now