Data Integration & Warehousing: March 2014

Monday 10 March 2014

Slowly Changing Dimensions & their types

*********************************************************************************************

Slowly Changing Dimensions

Slowly changing dimensions, referred as SCD, can be modeled basically in 3 different ways based on whether we want to store full histories, partial histories or no history. These different types are called Type 2, Type 3 and Type 1 respectively.

The "Slowly Changing Dimension" problem is a common one particular to data warehousing. In a nutshell, this applies to cases where the attribute for a record varies over time. For example: Consider table Customer. Prasad is a name of customer. He first lived in Mumbai. So, the original entry in the customer lookup table has the following record:

Customer key	Name	State
1	Prasad	Mumbai

Later on, he moved to Pune, on January, 2014. Now how will you modify its customer table to reflect this change? This is the "Slowly Changing Dimension" problem.

Generally there are three ways to solve this type of problem, and they are categorized as follows:

SCD Type 1: The new record replaces the original record. No HISTORY is maintained.

SCD Type 2: A new record is added into the table, therefore both old and new record exist.

SCD Type 3: It stores partial history; In Type 2, whenever the values of the attributes change, we insert new rows to the table. In case of type 3, we add new column to the table to store the history.

We will see all SCD types one by one as below:

SCD Type - 1

In SCD type - 1, the new information simply overwrites the original information. In other words, no history is kept. For example the following table

Customer key	Name	State
1	Prasad	Mumbai

After Prasad moved from Mumbai to Pune, the new information replaces the new record, and we have the following table:

Customer key	Name	State
1	Prasad	Pune

Advantages:

This is the easiest way to handle the Slowly Changing Dimension problem, since there is no need to keep track of the old information.

Disadvantages:

All history is lost. By using this method, it is not possible to trace back in history.

Usage:

About 40% to 50% of the time.

Where to use Type 1:

If it is not necessary for the data warehouse to keep track of historical changes.

SCD Type - 2

In SCD type - 2, a new record is added to the table to represent the new information. Therefore, both the original and the new record will be present. The new record gets its own primary key. Latest record can be tracked by various ways.

In above example given in above type will become

Customer key	Name	State
1	Prasad	Mumbai

After Prasad moved from Mumbai to Pune, we add the new information as a new row into the table

Customer key	Name	State
1	Prasad	Mumbai
3	Prasad	Pune

Latest record can be tracked by various ways.

1) Effective and End date concept:

Customer key	Name	State	Eff_dt	End_Dt
1	Prasad	Mumbai	02-Dec	NULL
2	Nitin	Nagpur	02-Dec	NULL

Customer key	Name	State	Eff_dt	End_Dt
1	Prasad	Mumbai	02-Dec	07-Jan
3	Prasad	Pune	07-Jan	NULL
2	Nitin	Nagpur	02-Dec	NULL

2) Versioning:

We can save the history by using the version method also as below

Customer key	Name	State	Version
1	Prasad	Mumbai	1
2	Nitin	Nagpur	1

Customer key	Name	State	Version
1	Prasad	Mumbai	1
3	Prasad	Pune	2
2	Nitin	Nagpur	1

Advantages:

This allows us to accurately keep all historical information.

Disadvantages:

This will cause the size of the table to grow fast. In cases where the number of rows for the table is very high to start with, storage and performance can become a concern.

This necessarily complicates the ETL process.

Usage:

About 50%-60% of the time.

When to use Type 2:

If it is necessary for the data warehouse to track historical changes.

SCD Type - 3

In SCD type - 3, there will be two columns to indicate the particular attribute of interest, one indicating the original value, and one indicating the current value. There will also be a column that indicates when the current value becomes active. Type 3 design is used to store partial history.

In Type 2 design. whenever the values of the attributes change, we insert new rows to the table. In case of type 3, however, we add new column to the table to store the history.

For example suppose Prasad lived in Mumbai then shifted to Pune then table will be maintained a below:

Customer key	Name	State
1	Prasad	Mumbai
2	Nitin	Nagpur

Customer key	Name	State	State_Old
1	Prasad	Pune	Mumbai
2	Nitin	Nagpur

Prasad again shifted to Kolhapur from Pune the table will be as:

Customer key	Name	State	State_Old	State_Old_1
1	Prasad	Kolhapur	Pune	Mumbai
2	Nitin	Nagpur

Advantages:

This does not increase the size of the table, since new information is updated.

This allows us to keep some part of history.

Disadvantages:

Type 3 will not be able to keep all history where an attribute is changed more than once. For example, if Prasad later moves to Nagpur then the Mumbai information will be lost.

Usage:

Type 3 is rarely used in actual practice.

Where to use Type 3:

If it is necessary for the data warehouse to track historical changes, and when such changes will only occur for a finite number of time.

Thursday 6 March 2014

Data warehousing - Architecture, Fact & Dimension tables

***********************************************************************
Data warehousing

A data warehousing is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management decision making process. This is the standard definition of data warehouse came from Bill Inmon (Father of Data warehouse). Below is the description of definition:

Subject-oriented: Data warehouse is designed to to help you to analyze the data.For Example "Sales" can be a particular subject. This ability to define a data warehouse by subject matter, sales in this case, makes the data warehouse subject oriented.

Integrated: A data warehouse integrates/combine the data form multiple sources. For example source A and source B may have different ways of identifying the products but in DW there will be single way of identifying the product.

Time-variant: The historical data is kept in data warehouse. For example one can get the data of 3 months, 9 months or even older data from data warehouse. This is totally opposite in transactions system in which only most recent data is kept. example to understand is, transactions system may hold most recent address of customer where data warehouse can hold all addresses associated with a customer.

Non-volatile: Once data is entered into data warehouse, it will not change. So, historical data in data warehouse can not be altered. This is because purpose of data warehouse is to enable you to analyze what has occurred.

Data Warehouse Architecture

Different data warehouse systems have different structures. In general all data warehouse systems have following layers:

1. Data source layer

2. Data extraction layer

3. Staging area

4. ETL layer

5. Data storage layer

6. Data logic layer

7. Data presentation layer

8. Metadata layer

9. System operations layer

We will discuss each layer as below:

1. Data source layer:

This indicates that data source that feed data into data warehouse. Data source can be of any format - flat file, relational database, Excel file. All these data sources together form data source layer. Data comes form different types of sources to this layer.

2. Data extraction layer:

Data is extracted from data source into data warehouse system. There is likely some minimal cleansing happens; but not any major data transformation happens.

3. Staging area:

This is the place where data sits prior to being scrubbed and transformed into data warehouse/data mart.

4. ETL layer:

Data gets transformed in this layer. Logic will get applied in this layer to transform the data from transnational nature to analytic nature. In this layer also some data cleansing happens.

5. Data storage layer:

This is the layer where transformed data and cleansed data sits. Depending on scope and functionality, three types of entities can be found here as data warehouse, data mart and operation data store (ODS).

6. Data logic layer:

This is where business logic rules stores. Business rules stored in this layer doesn't affect the underlying data transformation rules, but does affect what the report looks like,

7. Data presentation layer:

This refers to information that reaches the users. This can be in tabular form/ graphical form in browser. Usually OLAP tool and/or Reporting tool is used in this layer.

8. Metadata layer:

This is where information about data stored in data warehouse is stored. A metadata tool is often used to manage metadata.

9. System operations layer:

This layer includes information about how data warehouse system operates, such as ETL job status, system performance and user access history.

Dimension & Fact tables

Dimension table:

Dimension tables are used to describe the dimensions. Now here dimension means category of information, mainly contains descriptive text. Example Time dimension.

Dimension table contains dimension keys, values and attributes.

For example Time dimension would contains each hour, day, week, month, quarter and year.

Dimension tables are typically small, ranging from few to several thousand records. Occasionally, these tables can grow fairly large.

Below is Customer dimension table look like:

Customer_key

Customer_full_name

Customer_city

Customer_state

Customer_country

Attributes: It is nothing but unique level within dimension. For example Month is attribute in Time dimension.

Hierarchy: Specific level that represents relationship between different attributes in dimension tables. For example in Time dimension: Year > Quarter > Month > Day.

Fact table:

Fact table is table that contains measures of interest. Foe example sales amount wold be such a measure. This measure is stored in fact table with appropriate granularity. In this case, fact table would contains three columns such as date, store, sales amount columns.

Fact tables contains primary keys of dimension tables as foreign key as well as measurable facts that data analysts would want to examine. Fact tables can grow very large with millions or billions of rows.

Types of Facts:

There are three types of facts.

1. Additive: Additive facts are facts that can be summed up through all of the dimensions in the fact table.

2. Semi-Additive: Semi-Additive facts are facts that can be summed up for some of the dimensions in the fact tables , but not the others.

3. Non-Additive: Semi-Additive facts are facts that can not be summed up for any of the dimensions present in fact table.

Types of Fact Tables:

Based on above classifications, there are two types of fact tables:

1. Cumulative:

This type of fact tables described what has happened over the period of time. For example this fact table may describe the total sales by product by store by day. The facts of this type of fact tables are mostly additive facts.

2. Snapshot:

This type of fact table describes the state of things in particular instance of time and usually includes more semi additive and non-additive facts.

*********************************************************************************

Data Integration & Warehousing

Labels