*********************************************************************
Data warehousing Interview Questions
Data warehousing Interview Questions
1.
What is data warehouse?
A data
warehouse is a electronical storage of an Organization's historical data for
the purpose of analysis and reporting. According to Kimpball, a datawarehouse
should be subject-oriented, non-volatile, integrated and time-variant.
Non-volatile means that the data
once loaded in the warehouse will not get deleted later. Time-variant means the
data will change with respect to time.
2.
What are the benefits of data warehouse?
Historical
data stored in data warehouse helps to analyze different aspects of business
including, performance analysis, trend analysis, trend prediction etc. which
ultimately increases efficiency of business processes.
3.
Why Data Warehouse is used?
Data
warehouse facilitates reporting on different key business processes known as
KPI. Data warehouse can be further used for data mining which helps trend
prediction, forecasts, pattern recognition etc.
4. What is the
difference between OLTP and OLAP?
OLTP is the
transaction system that collects business data. Whereas OLAP is the reporting
and analysis system on that data.
OLTP systems are optimized for
INSERT, UPDATE operations and therefore highly normalized. On the other hand,
OLAP systems are deliberately denormalized for fast data retrieval through
SELECT operations.
5. What is data
mart?
Data marts are generally
designed for a single subject area. An organization may have data pertaining to
different departments like Finance, HR, Marketting etc. stored in data
warehouse and each department may have separate data marts. These data marts
can be built on top of the data warehouse.
6. What is ER model?
ER model is entity-relationship
model which is designed with a goal of normalizing the data.
7.
What is dimensional modeling?
Dimensional model consists of
dimension and fact tables. Fact tables store different transactional
measurements and the foreign keys from dimension tables that qualifies the
data. The goal of Dimensional model is not to achive high degree of
normalization but to facilitate easy and faster data retrieval.
8. What is
dimension?
A dimension
is something that qualifies a quantity (measure).
If I just
say… “20kg”, it does not mean anything. But 20kg of Rice (Product) is sold to
Ramesh (customer) on 5th April (date), gives a meaningful sense. These product,
customer and dates are some dimension that qualified the measure. Dimensions
are mutually independent.
Technically speaking, a dimension is a data element that
categorizes each item in a data set into non-overlapping regions.
9. What is fact?
A fact is
something that is quantifiable (Or measurable). Facts are typically (but not
always) numerical values that can be aggregated.
10. What are
additive, semi-additive and non-additive measures?
Non-additive measures are those which cannot
be used inside any numeric aggregation function (e.g. SUM(), AVG() etc.). One
example of non-additive fact is any kind of ratio or percentage. Example, 5%
profit margin, revenue to asset ratio etc. A non-numerical data can also be a
non-additive measure when that data is stored in fact tables.
Semi-additive measures are those where only a
subset of aggregation function can be applied. Let’s say account balance. A
sum() function on balance does not give a useful result but max() or min()
balance might be useful. Consider price rate or currency rate. Sum is
meaningless on rate; however, average function might be useful.
Additive measures can be used with any
aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc.
11. What is
Star-schema?
This schema is used in data
warehouse models where one centralized fact table references number of
dimension tables so as the keys (primary key) from all the dimension tables
flow into the fact table (as foreign key) where measures are stored. This
entity-relationship diagram looks like a star, hence the name.
Consider a fact table that
stores sales quantity for each product and customer on a certain time. Sales
quantity will be the measure here and keys from customer, product and time
dimension tables will flow into the fact table. A star-schema is a special case
of snow-flake schema.
12. What is
snow-flake schema?
This is
another logical arrangement of tables in dimensional modeling where a centralized
fact table references number of other dimension tables; however, those
dimension tables are further normalized into multiple related tables.
13. What are the
different types of dimension?
In
a data warehouse model, dimension can be of following types,
o
Conformed Dimension
o
Junk Dimension
o
Degenerated Dimension
o
Role Playing Dimension
Based
on how frequently the data inside a dimension changes, we can further classify
dimension as
o
Unchanging or static dimension (UCD)
o
Slowly changing dimension (SCD)
o
Rapidly changing Dimension (RCD)
14. What is a
'Conformed Dimension'?
A
conformed dimension is the dimension that is shared across multiple subject
area. Consider 'Customer' dimension. Both marketing and sales department may
use the same customer dimension table in their reports. Similarly, a 'Time' or
'Date' dimension will be shared by different subject areas. These dimensions
are conformed dimension.
Theoretically,
two dimensions which are either identical or strict mathematical subsets of one
another are said to be conformed.
15. What is
degenerated dimension?
A
degenerated dimension is a dimension that is derived from fact table and does
not have its own dimension table.
A
dimension key, such as transaction number, receipt number, Invoice number etc.
does not have any more associated attributes and hence cannot be designed as a
dimension table.
16. What is junk
dimension?
A
junk dimension is a grouping of typically low-cardinality attributes (flags,
indicators etc.) so that those can be removed from other tables and can be
junked into an abstract dimension table.
These
junk dimension attributes might not be related. The only purpose of this table
is to store all the combinations of the dimensional attributes which you could
not fit into the different dimension tables otherwise. One may want to read an
interesting document, De-clutter with Junk (Dimension)
17. What is a
role-playing dimension?
Dimensions
are often reused for multiple applications within the same database with
different contextual meaning. For instance, a "Date" dimension can be
used for "Date of Sale", as well as "Date of Delivery", or
"Date of Hire". This is often referred to as a 'role-playing
dimension'
18. What is SCD?
SCD
stands for slowly changing dimension, i.e. the dimensions where data is slowly
changing. These can be of many types, e.g. Type 0, Type 1, Type 2, Type 3 and Type
6, although Type 1, 2 and 3 are most common.
19. What is rapidly
changing dimension?
This
is a dimension where data changes rapidly. Describe different types of slowly
changing Dimension (SCD)
Type 0:
A Type 0 dimension is where dimensional changes are not considered. This does not mean that the attributes of the dimension do not change in actual business situation. It just means that, even if the value of the attributes change, history is not kept and the table holds all the previous data.
A Type 0 dimension is where dimensional changes are not considered. This does not mean that the attributes of the dimension do not change in actual business situation. It just means that, even if the value of the attributes change, history is not kept and the table holds all the previous data.
Type 1:
A type 1 dimension is where history is not maintained and the table always shows the recent data. This effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this update, we lose the previous values.
A type 1 dimension is where history is not maintained and the table always shows the recent data. This effectively means that such dimension table is always updated with recent data whenever there is a change, and because of this update, we lose the previous values.
Type 2:
A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed to group G2. Then there will be two separate records in dimension table like below,
A type 2 dimension table tracks the historical changes by creating separate rows in the table with different surrogate keys. Consider there is a customer C1 under group G1 first and later on the customer is changed to group G2. Then there will be two separate records in dimension table like below,
Key
|
Customer
|
Group
|
Start Date
|
End Date
|
1
|
C1
|
G1
|
1st Jan 2000
|
31st Dec 2005
|
2
|
C1
|
G2
|
1st Jan 2006
|
NULL
|
Note: Separate surrogate
keys are generated for the two records. NULL end date in the second row denotes
that the record is the current record. Also note that, instead of start and end
dates, one could also keep version number column (1, 2 … etc.) to denote
different versions of the record.
Type 3:
A type 3 dimension stored the
history in a separate column instead of separate rows. So unlike a type 2
dimension which is vertically growing, a type 3 dimension is horizontally
growing. See the example below,
Key
|
Customer
|
Previous Group
|
Current Group
|
1
|
C1
|
G1
|
G2
|
This
is only good when you need not store many consecutive histories and when date
of change is not required to be stored.
Type 6:
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra column to denote which record is the current record.
A type 6 dimension is a hybrid of type 1, 2 and 3 (1+2+3) which acts very similar to type 2, but only you add one extra column to denote which record is the current record.
Key
|
Customer
|
Group
|
Start Date
|
End Date
|
Current Flag
|
1
|
C1
|
G1
|
1st Jan 2000
|
31st Dec 2005
|
N
|
2
|
C1
|
G2
|
1st Jan 2006
|
NULL
|
Y
|
20. What is a mini
dimension?
Mini dimensions can be used to
handle rapidly changing dimension scenario. If a dimension has a huge number of
rapidly changing attributes it is better to separate those attributes in
different table called mini dimension.
This is
done because if the main dimension table is designed as SCD type 2, the table
will soon outgrow in size and create performance issues. It is better to
segregate the rapidly changing members in different table thereby keeping the
main dimension table small and performing.
21. What is a
fact-less-fact?
A fact table that does not
contain any measure is called a fact-less fact. This table will only contain
keys from different dimension tables. This is often used to resolve a
many-to-many cardinality issue.
For example in a school, where a single student may be taught by many teachers and a single teacher may have many students. To model this situation in dimensional model, one might introduce a fact-less-fact table joining teacher and student keys. Such a fact table will then be able to answer queries like,
For example in a school, where a single student may be taught by many teachers and a single teacher may have many students. To model this situation in dimensional model, one might introduce a fact-less-fact table joining teacher and student keys. Such a fact table will then be able to answer queries like,
Who are the students taught by a
specific teacher.
Which teacher teaches maximum
students.
Which student has highest number
of teachers etc
22. What is a
coverage fact?
A fact-less-fact table can only
answer 'optimistic' queries (positive query) but cannot answer a negative
query. Because fact-less fact table only
stores the positive scenarios, Coverage fact table attempts to answer negative
scenarios also - often by adding an extra flag column. Flag = 0 indicates a
negative condition and flag = 1 indicates a positive condition. To understand
this better, let's consider a class where there are 100 students and 5
teachers. So coverage fact table will ideally store 100 X 5 = 500 records (all
combinations) and if a certain teacher is not teaching a certain student, the
corresponding flag for that record will be 0.
23.
What are incident and snapshot facts?
A fact table stores some kind of
measurements. Usually these measurements are stored (or captured) against a
specific time and these measurements vary with respect to time. Now it might so
happen that the business might not able to capture all of its measures always
for every point in time. Then those unavailable measurements can be kept empty
(Null) or can be filled up with the last available measurements. The first case
is the example of incident fact and the second one is the example of snapshot
fact.
24. What is
aggregation and what is the benefit of aggregation?
A data warehouse usually
captures data with same degree of details as available in source. The
"degree of detail" is termed as granularity. But all reporting
requirements from that data warehouse do not need the same degree of details.
To understand this, think about the CEO of that retail chain. He does not really care about which certain sales girl in London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is interested is, perhaps to check the percentage increase of his revenue margin across Europe. Or maybe year to year sales growth on eastern Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is derived by summing up the individual sales data from each shop in East Europe. Therefore, to support different levels of data warehouse users, data aggregation is needed.
To understand this, think about the CEO of that retail chain. He does not really care about which certain sales girl in London sold the highest number of chopsticks or which shop is the best seller of 'brown breads'. All he is interested is, perhaps to check the percentage increase of his revenue margin across Europe. Or maybe year to year sales growth on eastern Europe. Such data is aggregated in nature. Because Sales of goods in East Europe is derived by summing up the individual sales data from each shop in East Europe. Therefore, to support different levels of data warehouse users, data aggregation is needed.
Slicing means showing the slice
of a data, given a certain set of dimension (e.g. Product) and value (e.g.
Brown Bread) and measures (e.g. sales).
Dicing
means viewing the slice with respect to different dimensions and in different
level of aggregations. Slicing and dicing operations are part of pivoting.
26.
What is drill-through?
Drill through is the process of
going to the detail level data from summary data.
For example if the CEO finds out that sales in East
Europe has declined this year compared to last year, he then might want to know
the root cause of the decrease. For this, he may start drilling through his
report to more detail level and eventually find out that even though individual
shop sales has actually increased, the overall sales figure has decreased
because a certain shop in Turkey has stopped operating the business. The detail
level of data, which CEO was not much interested on earlier, has this time
helped him to pin point the root cause of declined sales. And the method he has
followed to obtain the details from the aggregated data is called drill
through.