A Guide Into Business Intelligence Studies Information Technology Essay
Data Warehousing: Integration of data from multiple sources into large warehouses and support of on-line analytical processing and business decision making
DW vs. Operational Databases
Data Warehouse
Subject Oriented
Integrated
Nonvolatile
Time variant
Ad hoc retrieval
Operational Databases
Application oriented
Limited integration
Continuously updated
Current data values only
Predictable retrieval
Data Warehouse: a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision-making process.
Data Mart
A monothematic data warehouse
Department- oriented or business line oriented
Top-Down Approach
Advantages
A truly corporate effort, an enterprise view of data
Inherently architected – not a union of disparate data marts
Single, central storage of data about the content
Centralized rules and control
May see quick results if implemented with iterations
Disadvantages
Takes longer to build even with an iterative method
High exposure/risk to failure
Needs high level of cross-functional skills
High outlay without proof of concept
Bottom-Up Approach
Advantages
Faster and easier implementation of manageable pieces
Favorable return on investment and proof of concept
Less risk of failure
Inherently incremental; can schedule important data marts first
Allows project team to learn and grow
Disadvantages
Each data mart has its own narrow view of data
Permeates redundant data in every data mart
Perpetuates inconsistent and irreconcilable data
Proliferates unmanageable interfaces
Data Staging Component
Three major functions need to be performed for getting the data ready (ETL)
extract the data
transform the data
and then load the data into the data warehouse storage
Data Warehouse
Subject-Oriented – Data is stored by subjects
Integrated Data – Need to pull together all the relevant data from the various systems
Data from internal operational systems
Data from outside sources
Time-Variant Data – the stored data contains the current values
The use needs data not only about the current purchase, but on the past purchases
Nonvolatile Data – Data from the operational systems are moved into the data warehouse at specific intervals
Data Granularity – Data granularity in a data warehouse refers to the level of detail
The lower the level of detail, the finer the data granularity
The lowest level of detail ® a lot of data in the data warehouse
Four steps in dimensional modeling
Identify the process being modeled.
Determine the grain at which facts will be stored.
Choose the dimensions.
Identify the numeric measures for the facts.
Components of a star schema
Fact tables contain factual or quantitative data
1:N relationship between dimension tables and fact tables
Dimension tables contain descriptions about the subjects of the business
Dimension tables are denormalized to maximize performance
Slowly changing dimensions
Are the Customer and Product Dim independent of Time Dim?
Changes in names, family status, product district/region
How to handle these changes in order not to affect the history status? Eg. Insurance
3 suggestions for slowly changing dimensions
Type 1 — overwrite/erase old values; no accurate tracking of history needed; easy to implement;
Type 2 — create new record at time of change; partitioning the history (old and new description);
Type 3 — new “current” fields, legitimate need to track both old and new states “Original” and “current” values; Intermediate Values are lost
Junk Dimensions
Leave the flags in the fact tables
likely sparse data
no real browse entry capability
can significantly increase the size of the fact table
Remove the attributes from the design
potentially critical information will be lost
if they provide no relevance, remove them
Make a flag into it’s own dimension
may greatly increase the number of dimensions, increasing the size of the fact table
can clutter and confuse the design
Combine all relevant flags, etc. into a single dimension
the number of possibilities remain finite
information is retained
The Monster Dimension
It is a compromise
Avoids creating copies of dimension records in a significantly large dimension
Done to manage space and changes efficiently
3 types of multidimensional data
Data from external sources (represented by the blue cylinder) is copied into the small red marble cube, which represents input multidimensional data
Pre-calculated, stored results derived from it
on-the-fly results, calculated as required at run-time, but not stored in a database
Aggregation
The system uses physically stored aggregates as a way to enhance performance of common queries.
These aggregates, like indexes, are chosen silently by the database if they are physically present.
End users and application developers do not need to know what aggregates are available at any point in time, and applications are not required to explicitly code the name of an aggregate
When you go for higher level of aggregates, the sparsity percentage goes down, eventually reaching 100% of occupancy
Data Extraction
Two major types of data extractions from the source operational systems
“as is” (static) data and data of revision
“as is” or static data is the capture of data at a given point in time
For initial load
Data of revision is known as incremental data capture
Data Quality Issues
Dummy values in fields
Missing data
Unofficial use of fields
Cryptic values
Contradicting values
Reused primary keys
Inconsistent values
Incorrect values
Multipurpose fields
Steps in Data Cleansing
Parsing
Correcting
Standardizing
Matching
Consolidating
DATA TRANSFORMATION
All the extracted data must be made usable in the data warehouse
The quality of the data in many old legacy systems is less likely to be good enough for the data warehouse
Transformation of source data encompasses a wide variety of manipulations to change all the extracted source data into usable information to be stored in the data warehouse
Data warehouse practitioners have attempted to classify data transformations in several ways
Basic Tasks
Set of basic tasks
Selection
Splitting/Joining
Conversion
Summarization
Enrichment
Loading
Initial Load
Load mode
Incremental Loads
Constructive merge mode
Type 1 slowly changing dimension: destructive merge mode
Full Refresh
Load and append modes are applicable
OLAP defined:
On-line Analytical Processing(OLAP) is a category of software technology that enables analysts, managers and executives to gain insight into data through fast, consistent, interactive access in a wide variety of possible views of information that has been transformed from raw data to reflect the real dimensionality of the enterprise as understood by the user
Users need the ability to perform multidimensional analysis with complex calculations
The basic virtues of OLAP
Enables analysts, executives, and managers to gain useful insights from the presentation of data
Can reorganize metrics along several dimensions and allow data to be viewed from different perspectives
Supports multidimensional analysis
Is able to drill down or roll up within each dimension
BUSINESS METADATA
Is like a roadmap or an easy-to-use information directory showing the contents and how to get it
How can I sign onto and connect with the data warehouse?
Which parts of the data warehouse can I access?
Can I see all the attributes from a specific table?
What are the definitions of the attributes I need in my query?
Are there any queries and reports already predefined to give the results I need?
TECHNICAL METADATA
Technical metadata is meant for the IT staff responsible for the development and administration of the data warehouse
Technical metadata is like a support guide for the IT professionals to build, maintain, and administer the data warehouse
Physical Design Objectives
Improve Performance
In OLTP, 1-2 secs max; in DW secs to mins
Ensure scalability
Manage storage
Provide Ease of Administration
Design for Flexibility.
Physical Design Steps
Develop Standards
Create Aggregates Plan
Determine Data Partitioning
Establish Clustering Options
Prepare Indexing Strategy
Assign storage structures
Partitioning
Breaking data into several physical units that can be handled separately
Not a question of whether to do it in data warehouses but how to do it
Granularity and partitioning are key to effective implementation of a warehouse
Partitions are spread across multiple disks to boost performance
Why Partition?
Flexibility in managing data
Smaller physical units allow
easy restructuring
free indexing
sequential scans if needed
easy reorganization
easy recovery
easy monitoring
Improve performance
Criterion for Partitioning
Vertically (groups of selected columns together. More typical in dimension tables)
Horizontally (e.g. recent events and past history. Typical in fact tables)
Parallelization
The argument goes:
if your main problem is that your queries run too slowly, use more than one machine at a time to make them run faster (Parallel Processing).
Oracle uses this strategy in its warehousing products.
Indexing
Structure separate from the table data it refers to, storing the location of rows in the database based on the column values specified when the index is created.
They are used in data warehouse to improve warehouse throughput
Indexing and loading
Indexing for large tables
Btree characteristics:
Balanced
Bushy: multi-way tree
Block-oriented
Dynamic
Bitmap Index
Bitmap indices are a special type of index designed for efficient querying on multiple keys
Records in a relation are assumed to be numbered sequentially from, say, 0
Given a number n it must be easy to retrieve record n
Particularly easy if records are of fixed size
Applicable on attributes that take on a relatively small number of distinct values
E.g. gender, country, state, …
E.g. income-level (income broken up into a small number of levels such as 0-9999, 10000-19999, 20000-50000, 50000- infinity)
A bitmap is simply an array of bits
In its simplest form a bitmap index on an attribute has a bitmap for each value of the attribute
Bitmap has as many bits as records
In a bitmap for value v, the bit for a record is 1 if the record has the value v for the attribute, and is 0 otherwise
Clustering
The technique involves placing and managing related units of data to be retrieved in the same physical block of storage
This arrangement causes related units of data to be retrieved together in one single operation
In a clustering index, the order of the rows is close to the index order. Close means that physical records containing rows will not have to be accessed more than one time if the index is accessed sequentially
DW Deployment
Major deployment activities
Complete user acceptance
Perform initial loads
Get user desktops ready
Complete initial user training
Institute initial user support
Deploy in stages
DW Growth & Maintenance
Monitoring the DW
Collection of Stats
Usage of Stats
For growth planning
For fine tuning
User training
Data Content
Applications & Tools
Dimensional Modeling Exercise
Exercise: Create a star schema diagram that will enable FIT-WORLD GYM INC. to analyze their revenue.
− The fact table will include: for every instance of revenue taken – attribute(s) useful for analyzing revenue.
− The star schema will include all dimensions that can be useful for analyzing revenue.
− The only data sources available are shown bellow.
SOURCE 1
“FIT-WORLD GYM” Operational Database: ER-Diagram and the tables based on it (with data)
SOLUTION
Order Now