Data Warehouse Architecture And Implementation Choices Information Technology Essay

Data warehousing is just a product, or collection of products – it is a solution. It is simple and helps us to get a solution for better decision-making, other than that, it is a part of global set (Chuck Ballard, Dirk Herreman 1998). From this point of view, data are extracts from the outside sources, transforms and then loaded into the destination database (data warehouse) repository. As a result, these technologies have been successfully implements in many industries for retail industries they used for user profiling and inventory system management (G.Satyanarayana Reddy et al 2010).

A consistent data store that provides physical implementation and information stores on which a project needs to make strategic decision that supports to decision making data model. Integrating the data from multiple heterogeneous sources, the design should be constructs that supports ad-hoc queries and analytical reports. Data warehouse offer Online Analytical Processing (OLAP) tool to analysis multidimensional data. The performance and functional requirements of OLAP are different, when compared to Online Transaction Processing application (OLTP) it supports operational databases.

The main objectives of this project are to improve the retailer sales more effectively and implement the various features and knowledge into that data warehouse by integrating the Online Analytical Processing technology. While integrating the OLAP technology we generate multidimensional analytical queries efficiently to real time transaction. In this project, OLAP technology plays an important role because it provides valid answers to business and management queries to take better decision making.

1.2 Road map

Chapter 2 Literature review

2.1 Overview

This chapter discusses a framework of the data warehouse technology obtained through researching the various journals and research articles. The important features of the data warehouse technology are Data Warehouse architecture and Implementation choices, Extract Transform Load (ETL), Data Warehouse Design Schema, and Multidimensional Modeling. The gained knowledge is puts into the project and provides a best solution to the today’s organization. The data model will be discuss and analyzed through an investigation study in the researched areas. This literature review will gives the impression and uses of the Online Analytical Processing system in data warehouse.

This section is also involves the existing technologies and tools used effectively for OLAP applications. In this topic, this is not a first project therefore other portion of the research and works done in related topics are studied. Sequentially, understand the problem in detail and look forward into modern emerging technologies used to give a more effective, structure and decision support knowledge for future developers.

2.2 History of data warehouse

In early 1980s, the origin of the data-warehousing concept can be trace, when relational database management system appeared as commercial products. Bill Inmon (1993) the “father of data warehousing” gives definition “A subject-oriented, integrated, non-volatile, time-variant collection of data organized to support management needs”. Ralph Kimball (1996) says, “A copy of transaction data specifically structured for query and analysis”. The ease of relational model, together with query capabilities offer by the SQL language then it is describe as decision support or end-user computing. The surrounding data are extracts from organisation’s database and stored in freshly created database system to supports reporting functions of all types and ad hoc end-user queries.

In the early days, the extracted operational data are snapshot or subsets. A decision support (end – user computing) database system loads the snapshots in regular basis once in a week or month. These snapshots will accumulate in the system with limited number of editions, access will be provides to end user with report tools and query. Decision support system was not much concern in data modelling. Data model for decision support system characteristically coordinated with the operational system data model because snapshots are extracted (Chuck Ballard, 1998).

2.3 Data warehouse & Data mart

A data warehouse is intended for analysis and query, instead of transaction processing. A relational database contains historical data copied from the business data. It allows a business to merge data from other sources and divides analysis workload commencing to transaction workload. The process of collecting and transferring data it into business users it controls a data warehouse frameworks such as ETL, OLAP and other applications. The data warehouse consists of three different types such as data mart, enterprise data warehouse and operational data store. A data warehouse has several dimension tables connects with the fact table that builds a dimension data model.

The main drawbacks in the data warehouse are very expensive system to implement because, it mainly focuses on organizations and data modeling rather than user needs. Therefore, data warehouse needs more source and time to implement because maintenance is currently varying to business situations (Review essays, 2011).

Figure 2.1 Data warehouse and data mart

Data warehouse

Data mart 1

Data mart 2

Data mart n

A data mart is subset of a data warehouse that mainly focuses on particular business sectors such as finance, sales, marketing and managements. A data can be obtained from the data warehouse is known as dependent data mart. A data can be gathered directly from the source is known as independent data mart. A data mart has particular business related function such as evaluating and calculating the sales performance, evaluating the company profits for newly launched product, evaluating the marketing promotions, evaluating and calculating the performance of new company branch (Firestone, 1997).

A data mart has many problems such as data access, size of the data, functionality, scalability and performance similar to any other system. However, in organization data marts can damaged into different departments and mainly concentrate on their individual needs it cause very difficult process in cleansing and data access. In case, organizations has a data marts in each departments consists of production, marketing, promotions, sales, and inventory combines the total profit information’s together in a single data mart can be confusing and irresistible (Review essays, 2011).

2.4 Data warehouse architecture and Implementation choices

There are three common architectural models developed for the implementations of data warehouses.

Top Down Architecture, Top Down Implementation,

Bottom Up Architecture, Bottom Up Implementation, and

Combined Approach. In this section, we look at each of them in detail.

2.4.1 Top down architecture, Top down implementation

Figure 2.2 shows the top down architecture where a data warehouse creates many data marts. These data mart provide faster access to user and user can analysis and generates reports quickly. Users can access the data warehouse without any barrier (Chowdhury & Bikramjit, 2010). At the beginning of the project, a top down implementation requires more designing and planning work. This implementation involves from each organization or business that will participates by the individuals. The data standards, data structures, decision concerning data sources and overall data model need to be complete before the actual implementation begins. This approach involves the most of the enterprise wide or corporate wide data warehouse with higher standard access to the data the business organizations and workgroups. This approach gives better result from the beginning stage such as more consistent data definitions and enforcement of business rules through all the organizations (Chuck Ballard, et al 1998).

Figure 2.2 Top down approach

Data warehouse

Data mart

User analysis and reporting

2.4.2 Bottom up architecture, Bottom up implementation

Figure 2.3 shows the bottom up architecture, union of all the data marts and data warehouse creates the data warehouse. This architecture is mainly use for analysis purpose; specified data can be available if staging process can go through in data warehouse (Chowdhury & Bikramjit, 2010).

Figure 2.3 Bottom up approach

Data warehouse

Data mart

User analysis and reporting

A bottom up implementation requires planning and designing of data marts that should not be waiting for global infrastructure. Initial data mart implementation can expand and builds incrementally. Now, this approach is extensively accept rather than the top down approach, because we get data mart results immediately and used as justification for expanding additional global implementation. Many organizations prefer this approach for faster payback, especially in business management. Data marts have a smaller amount of complex design rather than global data warehouse and it enables faster results. Typically, initial implementation is very costly in source and hardware than organizing the large-scale data warehouse (Chuck Ballard, et al 1998).

2.4.3 Combined Approach

When implementing both the top down or bottom up approach, we notice positive and negative considerations, but in many case combination of the two is best approach. It is difficult to balancing, but a good developer can be done. The main part of this approach is to determine the level of designing and planning is required to support integration for global approach as data marts built in bottom up approach. A data warehouse implementation has many issues to be determined by using the combined approach it can be enable declaration of these issues as they encountered and data mart has less scope rather than global data warehouse. Implementation process could be monitoring carefully and management issues could result the gaining of both the techniques (Chuck Ballard, et al 1998).

2.5 Extract Transform Load (ETL)

According to the industry, approximately 60 – 70% of the data-warehousing project effort is spent for ETL process only the rest of the percentage is spent for analysis and reporting (Vishal et al 2010). ETL is a data integration function that involves extracting the data from original source, quality assuring and cleaning data, conforming the labels and measures, delivering data in a physical format that can be useful for report writers (Kimball 2009). Generally, companies use this process to solve the problem using this extract, transform and load (ETL) technology, which includes reading data from outside source, cleaning it up and formatting it uniformly and then finally loading it into a target repository (data warehouse). The source file of the ETL process comes from any source such as flat file, Excel spreadsheet, Mainframe application, ERP application or a CRM tool (Vishal, et al 2010).

Occasionally, data in staging area is protects to support functionality that requires history while in other times the data is deleted with each process. In staging area history is maintains, it is frequently referred to as a persistent staging area. If each load the data is deletes, the area should be considers as transient (Kimball, 2009).

In ETL process, each system ensures the overall effort efficiently, determines for each component and synchronizes how the tools will be use for all ETL events. Every organization should have ETL expert they can ensure that ETL process have more strength and patience. Three processes can be perform and combined into one tool that will make routine process to pull data out from one database and put it into another database. The database functions should be describes as extract, transform and load shown in the below Figure 2.4. Extraction is the first step; this process reads data from outside source database and extracting a desired subset of data, which includes in this process (Vishal et al 2010). Extracting data different platform is challenging one, as data source can be different DBMS (Kimball 2009). The main steps in ETL system are cleaning and conforming here accurate data can be defined and data should be correct, unambiguous, consistent, and completeness (Kimball 2009).

The second step is transform, here the process of converting the extracted, cleaning, conforming the acquired data from its previous form that can be placed it into another database. Combines with other data, look up tables or rules, transformations will occurs (Vishal et al 2010). In this process, transformation objects can extend and combines to build new, reusable objects, and context independent (Wayne and Colin, 2003). Final step of ETL process is writing the data into target (destination) database (Vishal et al 2010). While implementing the ETL process there are many problems a developer can face such as critical data can be missing in source system, technical challenges like integrating and transforming data from different environments, query performance is poor, difficult to maintain business rule, long load times (Vishal et al 2010).

Figure 2.4 Extract, Transform, Load (ETL)

Sources

Targets

ETL

App 1

App 2

App 3

Flat Files

XML Unstructured

Flat Files

XML

Data warehouse

Data mart

Nowadays organization needs more sources into several of distributed (Business Intelligence) BI applications. To overcome this data flow, ETL needs to develop from batch oriented, extracts and loads data in continuous, capture data in real time. Finally, BI gains a huge part of solutions by integrating data cleansing and profiling capabilities.

2.5.1 Data cleaning

In data warehouse, data cleaning is required when integrating the mixed data sources together with related data transformations. In data warehousing, data cleaning is the very important part of that project and so-called ETL process. Generally, data cleaning is otherwise known as scrubbing or cleansing. In order to improve the data quality, cleansing is used it deals with identifying and removing errors. In files and database, data quality problem are present such as invalid data, duplicate values, data are misspelling, and missing value etc. Multiple data source can be integrate in federated database systems, here cleansing process needs significantly due to source contains redundant data.

Cleansing provide extensive supports for data warehouse in order to improve the quality of data. Huge amount of data can loads and refresh incessantly from a different source, here the source contains some “dirty data”. In case, missing information and duplicate values produces incorrect information. However, data warehouse are use for better decision making, so incorrectness of their data provides wrong conclusions for business and managements (Erhard, 2000).

2.6 Data Warehouse Schema

The next part of this chapter offers strength in significant analysis of the three data warehouse designs. They are Entity relationship schema, Star schema, Snowflake schema; in this section, we can explain in detail. Due to space restriction, other data warehouse designs could not be calculate in the depth. However, in data warehouse designs, there is a lack of research but important works published over the few years back.

2.6.1 Entity Relationship Schema

In entity relationship schema, a data model can generate in the specific areas of interest using the two fundamental concepts in the ER schema are entities and relationships between the individual entities. Attributes also contains in the ER models that could be fits in either the entities or the relationships. However, an entity relationship model is an abstraction tool since it can be use to simplify and understand the unclear data relationship in the difficult system environment and business world.

The key factor of the ER model is describing the unique identifier of an entity is the major critical job. Here candidate key is describes as unique identifier from that select the key which is used to identify the entity it is known as primary key. In the ER diagram, the relationship is denotes with the lines connected between two entities (Chuck, et al 1998).

Figure 2.5 Entity Relationship Schemas

Entity

Relationship

Entity 1

Primary key 1

Entity 2

Primary key 2

Foreign key 1

Entity 4

Primary key 4

Foreign key 4

Primary key 3

Foreign key 2

Foreign key 3

Entity 3

Attribute

Figure 2.5 shows the entity relationship schema it can be use to symbolize entities within an organization. It represents the primary keys and foreign keys relationship between the two entities. In ER schema, the dimension table and fact table do not have any features, which is uniqueness of multidimensional modeling.

The motivation of the ER modeling is to protect the data integrity and store only once; maintenance is easy and provides flexibility, data elements can be stored once, process is achieves by dropping the data redundancy. In ER schema, these models are good for online transaction processing and make it simple transactions as feasible (Yashvardhan, 2011). The major problem in ER model is end user cannot remember or recognized, ad hoc query are not optimized, difficult to browsing, due to complex ER designs many data warehouse have failed.

2.6.2 Star Schema

In Data Warehousing, the most popular design is the star schema it represent multidimensional data. The essential structure for a dimension table is star schema (Chuck, et al 1998). Bringing together the fact table and dimension tables in a star like structure are usually called as the star join schema (Kimball and Margy, 2002). As Figure 2.6 shows, the star schema has only one fact table, which is star like structure connected with group of dimension tables surrounding to that fact table. Each dimension table has one primary key, which is connect to the fact table as foreign key.

Figure 2.6 Star Schema

Dimension table 1

Primary key 1

Dimension table 2

Primary key 2

Dimension table 4

Primary key 4

Dimension table 3

Primary key 3

Fact table

FK 1

FK 2

FK 3

FK 4

The benefit of star schema is simple design, easy to define hierarchy and understand. The star schema is fully denormalized structure when compared to snowflake schema. In snowflake schema, a dimension table have one or more parent table, but in star, there is no parent table. In snowflake schema each attribute will creates a separate table, there is possible for additional joins and performance humiliates. However, it is not possible in star model because hierarchies for the dimension tables are stored itself. The main drawback is huge dimension tables, poor performance in fact table for summary data (Enrico, 2006).

2.6.3 Snowflake Schema

In data warehouse, snowflake is the extremely popular in the data warehouse designs. However, snowflake schema thinks to be cooperated between a star schema and ER schema. Generally, snowflake schema is an extension of the star schema. It is a variation of star schema in which dimension tables structured into a hierarchy by normalizing them. Figures 2.7 shows the snowflake schema in this sketch the star schema is at the center. In snowflake schema, some mechanism in the data warehouse is normalizing up to third normal form. In this diagram, dimension tables hierarchies are splits into one or more sub dimension table (Umashanker, et al 2006).

These sub dimension table can increase the normalized table number and supports different level of hierarchies in the snowflake schema. In 1: M relationship way from the sub dimensional table to the fact table corresponds to the dimensional hierarchy.

Figure 2.7 Snowflake Schema

Dimension 1

Dimension 2

Dimension 5

Fact table

Dimension 3

Dimension 6

Dimension 4

Dimension 8

Dimension 9

Dimension 7

The main advantage of snowflake schema is easy to maintain and increases the flexibility. In several cases, the snowflake schema can increase the performance, as little table are joined. The major drawbacks are end user can work with multiple numbers of tables. Difficult to create the queries, because here we used more tables need to be join (Crag, 2007). In this selection of a modeling technique, most of the database designers do not worried about the space saving to be a most important decision criteria (Chuck, et al 1998).

2.7 Multidimensional Modeling

Kimball established the multidimensional database technology in the year 1998 (). Recently this technology reaches the mass market, currently distributes the multidimensional mechanism along with their relational database with major vendors. This technology has some major advantages in maturity and scalability. The main objective of the multidimensional data model is analyzing, instead of performing the online transactions. This technology is focuses on three major key concepts such as dimensions, cubes and measures and modeling the business rules. Multidimensional model is a main part of decision-making process for analyzing the huge amount of data. A multidimensional database model is establishes on relational factor and dimension relations are developed from the dimensions (Umashanker, et al 2006).

However, in Online Analytical Processing (OLAP) database requires summarize data at several stage of features at different combinations of attributes. In data warehouse, a data cube is creates from a subset of attributes, then it is called as multidimensional modeling. Generally, particular attributes will be select to measure values, while other attributes choose as dimensions (Antoaneta, et al 2004). In business-focused environment, the multidimensional model changes the idea of representation into more details to business sectors. All the key concepts such as cubes, measures and dimensions relate with each other that offer a reporting environment. In multidimensional model, user requires extra calculations rapidly and without any difficult. Typically, multidimensional database systems used for decision support purpose and business analysis process further classified into two divisions. The first process is relational database systems it build multidimensional schemas like snowflake and star schema. The second process is specifically designs for online purpose only (Umashanker, et al 2006).

2.7.1 Goals of Multidimensional Model

Information’s is offering to the end-user that corresponds to usual perceptive business, facts or figures from the different point of views that influence them. It offers physical implementation when the software recognizes the program to access the data easily. In multidimensional database, the basic modeling principles are, first the factual table is analyzed which includes a few primary storage area and facts. Dimension table links with primary code in the primary yards, while in fact data will represents the real measured values (Umashanker, et al 2006).

2.7.2 Application Areas

Data warehouse has huge repositories, which integrates data from various sources for analysis in a project. To find the overall tendency and efficiency of the system, online analytical processing (OLAP) gives best solution for queries that combines huge amount of data. As a result, factual table builds to the smallest levels. Dimension table decrease the fact table size and amount of duplicated data in several cases. The relationship between dimension and fact table is many to many, it requires snowflake form through normal dimensions. In snowflake method, it is use to minimize the dimension table figures, which is connected to other dimension table, which links with the main fact table. In fact table, checking data should be unique and it should not be summarized. Finally, these data should be at the constant level with the essential size. Monitoring data should be guarantee and all the checking information present in the fact table, which appears in dimension table when designing.

In addition, multidimensional model has three significant application regions within the data analysis. Before planning the multidimensional database system, first select the modeling process, which includes the functional requirements. After that verify the numeric values to find the functions. Dimensions are analysis and assembled to the perspectives to verify the checking data and values. At last, database should be defines in logical model, which is determined in the object analysis, perspectives (Umashanker, et al 2006).

2.7.3 Logical Multidimensional Model

The multidimensional database technology is very important in business regions, because it enforces simplicity. The multidimensional model is organizes to logical cubes, dimensions, measures, levels, hierarchies and attributes. This model is very simple because object should be defined that represents to the real-world business units. However, analysts know which dimensions and attributes gives the meaningful data, which industry measures they gives more importance and they know how the dimensions managed into hierarchies and level of their business. The essential logical models for online analytical processing (OLAP) applications are multidimensional data cubes and these OLAP tool offer analysis to the original information. To achieve the main objective, these tools utilize multidimensional database model for the data presentation and storage (Umashanker, et al 2006).

Figure 2.8 Logical Multidimensional Models

Cube

Measures

Dimension

Dimension Attributes

Levels

Hierarchies

The figure 2.8 shows the logical model for cubes that is not self-existing entities, instead of original data set. To organize the measures, those have the same figures that have the exact similar dimensions that provide logical cubes. Similar cube measures have the identical relationship to other objects and it can be displays and analyzed together easily. It allows user to interact with each dimension table and decides which values demonstrate in the destination presentation areas, here dimensional queries established number of steps. In relational model, users to operate all the total components, which guides to unexpected result sets and misunderstanding (Umashanker, et al 2006).

2.8 Online Analytical Processing (OLAP)

Online Analytical Processing (OLAP) allows the project and selects i.e. slice and dice data to the business users. The most important OLAP vendors are Cognos, Business Objects, and Hyperion. The background of the OLAP environment is to perform the analysis that involves large investments in financial resources and time (Mailvaganam, 2007). It uses a multidimensional data format (data cubes) to make easy query and response time. In OLAP technology, data aggregation and data model techniques manage and reviewed huge amount of data, therefore it can be calculated rapidly using graphic tools and online analysis. An OLAP system supports real time analysis that offers the flexibility.

Generally, in organization data are spreads into several data sources, which are mismatches with each other. In retail example: The sales and Point-of-sales data gathered in different place and arrangements made through web or call-center. The marketing team will get the OLAP reports such as customer between the ages 18 to 25, which products they prefer more it would be the time consuming process. Extracting the data from different data storehouse and making them well matched it is also the part of OLAP implementation. Example for incompatible data: Customers purchase made and age can be stored as birth date in age categories between 18 and 25 over the web. In operational system, data are stored such as point-of-sales in database types called as Online Transaction Processing (OLTP) c The major differences between the OLTP and OLAP system design can be discussed in the below table 1.1

Table 1.1 Differences between OLTP and OLAP

Item

OLTP System

Online Transaction Processing

(Operational System)

OLAP System

Online Analytical Processing

(Data Warehouse)

Data source

Operational data: original source of the data

Consolidation data: data comes from the several OLTP database

Purpose of data

To organize and manage fundamental tasks

Helps for decision support, planning and problem solving

User

IT Professionals

Knowledge worker

Database design

Application oriented

Subject oriented

Database size

100 MB-GB

100 GB-TB

Processing speed

Very fast

Depends upon the data involved

Queries

Simple queries and relatively standard

Complex queries

Access

Read and write

Read only

Backup recovery

To run the business, operational data is serious, needs data loss, major financial loss and legal ability

Instead of normal backups, many places might simply reload the data as recovery method.

The OLAP typical operations include roll up (drill-up) it increases the aggregation level, drill down (roll down) it decreases the aggregation level together with one or more dimension hierarchy, slice and dice (select and project), and pivot makes cross chart (Chaudhuri and Dayal, 1997). The key features of the OLAP applications found in different functional areas such as multidimensional data view, calculation-intensive capabilities and time intelligence (Aparajita, 2004).

Figure 2.9 Steps for OLAP Creation Process

Scheduled automated Processes Extract Data from OLTPs

Build cubes

Produce reports

Transform and Standardize data

Import to OLAP Database

Extract Data from OLTPs

When designing the solution, OLAP storage is one of the significant choices to be complete. It has three types we can look each of them in detail.

Multidimensional OLAP

Relational OLAP

Hybrid OLAP

2.8.1 Multidimensional OLAP

In MOLAP, source data can be stored in multidimensional cubes and not in relational database and aggregations also stored in the same format. In MOLAP, data retrieval is very fast, although it needs more disk space. Disk space is fewer amounts to worry in these days with poor storage and processing cost (Mailvaganam, 2007). The main advantages are when the cube is generates and querying data can be apply easily for all the calculations are pre-generated. The major drawback of this model is handle only few amount of data.

2.8.2 Relational OLAP

In ROLAP, all the data sources and aggregations are stored in the relational database. For significant data warehouse implementation, this method is worried for higher usage requires. In ROLAP, data retrieval is slow whether an aggregation is present or not. The relational database should access the data warehouse itself and it is best for minor data warehouse implementations (Mailvaganam, 2007). This model gives the appearance of usual OLAP functionalities such as slicing and dicing. Handle a huge amount of data can control all the functionalities in the database. The ROLAP has poor performance and all the constraints of SQL query have limited functionalities.

2.8.3 Hybrid OLAP

In HOLAP, it is a combination of both the MOLAP and ROLAP storage methods. This method stores the aggregations that survive inside the multidimensional structure. Data cell-level is separated by itself in a relational form here data is pre combined (Mailvaganam, 2007).

When using the MOLAP, cube surfing is fast which is same, no aggregations have been complete. The entire data source can be stored in the compacted multidimensional format and access rapidly than in the relational database. In both ROLAP and HOLAP, surfing is very slow and time processing also is decelerating particularly at higher aggregation levels in ROLAP. More storage space for MOLAP rather than HOLAP because data are copy at low aggregation level when compared to ROLAP. Generally, ROLAP data not are duplicates and no storage space required. However, MOLAP or HOLAP aggregations takes less storage space when compare to ROLAP aggregations.

2.8.4 Why OLAP in data warehouse

Data warehouse store the planned information that gives respond to “who?” and “what?” questions. In these types of question OLAP system, gives answer to these questions, it is their responsibility to give answer that places separately from data warehouse. The main activities for the OLAP systems are decision making for future proceedings. When compared to data warehouse, which is typically build on the relational technology. It provides fast access to planned information’s for further analysis OLAP use aggregate data in multidimensional view.

Both data warehouse and OLAP is correspondingly, hence data warehouse stores and manages the data and OLAP changes the data warehouse data into planned information. It varies from the essential navigation and surfing (slice and dice) for estimations and crucial analysis such as time and modeling is complex (Aparajita, 2004).

2.9 Summary

The main purpose of the literature review is to understand the data warehouse concepts how it is use in the real time business applications. In this chapter, we discussed some of the data warehouse key concepts are history of the data warehouse and implementation choices, ETL process, data warehouse design schema, multidimensional modeling and OLAP types.

Chapter 3 Designing the data warehouse system

3.1 Overview

In this project, we are designing the data warehouse for the Britain’s stationary market company (SMC) is design by many data warehouse system. Nowadays, the business organization often requires the different analytical reports, to learn how the business market went and take suitable measures to manage the sales effectively. From this instance, we can build the data warehouse system with the OLAP applications.

3.2 Design of Star Schema

In data warehouse schema, Kimball (2009), the father of the data mart supported that a star schema is the most important design. From the figure 3.1 shows, the design of star schema appears simple.

Figure 3.1 Star Schema Design Stage

The diagram looks like star shape in structure so it should be naming as star schema. The figure 3.1 shows the star schema it has many dimension tables and one fact table surrounding from its centre. In star schema design, the relationship between any one of its dimension table and the fact table are establishes on only join relation. Other features of the star schema are inside the dimension table the nature of data is denormalised when compared to the snowflake schema. The dimension table has one primary key, which is derive as a foreign key in the fact table i.e. from the figure 3.1 the dimension table attributes are Customer_Id, Product_Id, Order_Id, Time_Id is a primary key which is present in the fact_sales table as foreign key. The remaining attributes of the fact_sales table is price, quantity, and sales (product of price and quantity). From these attribute we can used to maintain the sales analysis of the stationary market company.

3.2.1 Choosing the dimensions

Choose the business process to model. In organization, a normal business activities should called as process that supports by a source system. The most important thing in business process is we are not referring any business division. In our project, business process contains orders, purchasing, inventory, shipments. Instead of building the separate models for the purchasing and sales we can build only single dimensional model orders itself it contains both division data. Select the dimensions that relates to corresponding values in fact table. In SMC system, we are determined on the following expressive dimensions as shown in figure 3.2, as customer, orders, orders line, product, and time. In addition, it contains the price, quantity and sales these attributes are special dimension we will discuss more in the next chapter.

Once the fact tables sale grain has select the time, customer, product and orders dimensions drop out instantaneously. In time dimension, the date attributes are assume as the calendared date values carried us by the fact sales system. In the other hand, our business might also require a fiscal year sales details but in our time dimension the fiscal year does not commence on January 1st as the calendar year.

In dimension tables it mainly focus on loading them with strong attributes. First, we talk about how time dimension design attributes will reflect in our SMC system. The time dimension is present on every data mart because all data warehouse is time sequence. In time dimension, date attributes is very important based on that only we could compare that which customer buys particular product frequently with the help of customers orders_date attributes. Apart from that part we know compare the sales with previous years, month, quarters, week, day etc. Next, we move to customer dimension table design, in this dimension it will maintain the whole customers details the attributes used here as customer id, customer account no, name, address 1, address 2, email id, phone no, fax no and postcode. Here customer account has a relationship with the orders dimension based on that we can choose the customers ordered details.

In orders dimension, it will maintain all the customers ordered information’s like order id, order no, order date, order lineno, orderline product no, quantity, price, status, vat status. In this dimension table, the key attribute is order line product no; because it has relationship between the product tables attributes from that we can get the product details also. In SMC system, most of the attributes are not part of the product dimensions. The general attributes in product dimensions are product id, product account no, product description, product cost, discount, product supplier and vat status.

3.2.2 Identifying the fact

Kimball says that determining a fact table by answering the query, “What are we measuring?” in the business process. Generally, organization users are strongly interested to analyze their business performance measures. The most important thing is to identify the fact with the smallest level that makes sense to analyze the SMC system business operations. We will design the dimension and fact tables regarding to the business users requirements to make right decision for management queries.

A fact table acquires the data that measures for the organization business operations. It might contain an event such as sales for the product were customer bought in last one or more years history. Fact tables develop extremely large, with millions of rows for one or more years of sales history.

In SMC system, the collected information in the fact tables is quantity (i.e. number of items sold in each product), price and sales can be calculated by multiple of both the quantity and price. See the figure 3.1 shows the shape of the fact table. Basically, fact table has some relationship into the corresponding dimension tables like in our SMC system has customer_id, order_id, product_id, time_id is foreign key that should be primary key in the customer dimension, orders dimension, product dimensions and time dimension respectively. After knowing, all these attributes load the data into the fact sales table.

3.3 SMC system design

Generally, a data warehouse has various design schemas but in my project, I am using the star schema because it provides a direct mapping between the businesses entities evaluated by the end users and offers high performance for star queries. Each design has one or more dimension tables and a fact table (sales). In SMC system, the fact table contains the stationary item sales records for the last 8 years. The SMC system has multiple dimension tables such as customer, order, product, time. Each dimension table contains the ample of information. Some of the key attributes in our SMC system are product description, cost, customer account no is used to check whether the customer is eligible for discount, quantity of item purchased and each customer have separate id based on that we store the customer details. Every company keeps the data warehouse system within it, a large amount of sales records can be maintain for each product in the fact sales table and the remaining attributes of the systems are customer id, order id, product id, time id, order no, quantity, price.

Figure 3.2 SMC system design

Data warehouse

Load

Sources

Staging area

Extract

Transform & Clean

ETL

Once the data warehouse builds, next we move into the cube analysis service

Chapter 4 Implementation

4.1 Overview

The designed star schema is implement into the Online Analytical Process (OLAP) applications with the help of ETL tools. On the other hand, it is

4.2 Choice of tool selection

While evaluating data warehouse tools, it is extremely hard to point-out best software application to use, it should be better in all features than in others. Due to market competition, a major leading database seller maintains the ever-growing user requirements. The main intention of this phase is to recognize the candidate tools for expanding and implementing the data warehouse and performs technical support functions where suitable.

To design and built the data warehouse system, I am using the Business Intelligence Development Studio (BIDS) is a new emerging tool in the Microsoft SQL Server 2008. We can get this software through the MSDN account with the help of our CMS Lab. This tool is use for analyzing and reporting the SQL Server data in the different ways. SQL Server is an absolute data analysis stage that includes some tools to work simultaneously and gives difficult data analysis for decision makers. Using the Business Intelligence Development Studio, I have implemented the three different types of solutions to this project.

SQL Server Integration services

SQL Server Analysis Services

SQL Server Reporting Services

In SSIS, it is step for Extraction, Transformation, and Loading ETL provided by Microsoft, which offers the Business Intelligence space. Why I am using the SSIS for the ETL process, because it gives new methods to manage the flow of your package before writing the code, this will made, data integration is standard, simple method, implementation speed is high, most excellent exercise to data warehouse, real time. The most important transformation features is set up loop inside the package by using these two tasks for loop container and foreach loop container. In this project, I am using the SQL Server database it mainly supports all the ETL tools while using in the SSIS it gives high speed implementation

3.3.1 Extraction of source data

This is the first step of design part; the figure 3.2 shows the extraction process step. To import the source files (such as Flat file, Excel file, and MS access file) into the destination (target database) by using the SSIS services. Here, choose the data source file after that select the server name and database that you already created into the system, select the corresponding fields. Next step is choosing the destination location (relational table), once we select the server name and database correct. Subsequently choose the source tables and view; here we can edit the table column names, after that save it into a package and run immediately. Once the package execution was successful, next we start the cleaning process. This same process can be follow to all the source data.

Figure 3.2 Extraction process

Source file

Extraction process

Extracted data

Source table

3.3.2 Transforming and cleaning

In this design chapter, the transform and cleaning is the significant process because the organization has only raw datasets. Before doing the transforming and cleaning step, a staging area is present, it is a temporary place mainly required for timing purpose in the data warehouse architecture. Some basic transformations are carried out i.e. the data has misspellings, null columns and duplicate values in the corresponding attribute (such as customer_account, product_account, order_no, order_lineno) because these fields are important to the related tables, combination of alpha numeric values, load only certain column(not null column), generate the surrogate key, join the multiple source data. These transformation process will be stored for every ETL process iteration it will be repeated. Once the transformation process is finished, the data is cleaned it will ready to load into the data warehouse.

Figure 3.3 Cleaning and transformation process

Extracted data

Cleaning & transformation process

Mapping file

Required data flow transformations

Cleaned data

3.3.3 Loading process

This is final step in the ETL process; once the transformations and cleaning process should be finished now the data is clean. The system will identify the corresponding field to map the cleaned data into destination table (target database) and load it. Once the whole process is finished and stored in the data warehouse database. Now the users generate the cube analysis based on the business requirements and produce the reports. The cube analysis and report can be explained in detail in the implementation part (chapter 4) with the OLAP applications.

Figure 4.3 Loading process

Cleaned data

Target table

Mapping file

Data warehouse

Order Now