Data Warehouse Interview Questions

Here are the top 50 commonly asked questions in Data Warehouse interviews. Whether you’re just starting your preparation or need a quick refresher, these questions and answers will boost your confidence for the interview. Ranging from basic to advanced, they cover a wide array of Data Warehouse concepts. Practice these questions for campus and company interviews, positions from entry to mid-level experience, and competitive examinations. It’s also important to practice them to strengthen your understanding of data warehouse.

Data Warehouse Interview Questions with Answers

1. What is data warehouse?

DWs are central repositories of integrated data from one or more disparate sources. A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process. A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but it can include data from other sources. It separates analysis workload from transaction workload and enables an organization to consolidate data from several sources.

2. What is the difference between data mart and data warehouse?

Data warehouse:

  • Holds multiple subject areas
  • Holds very detailed information
  • Works to integrate all data sources
  • Does not necessarily use a dimensional model but feeds dimensional models

Data mart:

  • Often holds only one subject area- for example, Finance, or Sales
  • May hold more summarized data (although many hold full detail)
  • Concentrates on integrating information from a given subject area or set of source systems
  • Is built focused on a dimensional model using a star schema

3. What is star schema?

A star schema is the simplest form of a dimensional model, in which data is organized into facts and dimensions. The star schema consists of one or more fact tables referencing any number of dimension tables.

4. What is the difference between star schema and snowflake schema?

In a star schema, each dimension is represented by a single dimensional table, whereas in a snowflake schema, that dimensional table is normalized into multiple lookup tables, each representing a level in the dimensional hierarchy. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries.

5. What is fact table and dimension table in star schema?

The fact table contains business facts (or measures), and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables. Contrary to fact tables, dimension tables contain descriptive attributes that are typically textual fields (or discrete numbers that behave like text).

advertisement
advertisement

6. What is factless fact table?

Factless fact table is a fact table that does not contain any facts. It captures the many-to-many relationships between dimensions. Fact less tables are so called because they simply contain keys which refer to the dimension tables.

7. What are the characteristics of data warehouse?

Characteristics are as follows:

  • Subject Oriented: Data warehouses are designed to help you analyze data. For example, to learn more about your company’s sales data, you can build a warehouse that concentrates on sales. The ability to define a data warehouse by subject matter makes the data warehouse subject oriented.
  • Integrated: Integration is closely related to subject orientation. Data warehouses must put data from disparate sources into a consistent format. They must resolve such problems as naming conflicts and inconsistencies among units of measure. When they achieve this, they are said to be integrated.
  • Nonvolatile: Nonvolatile means that, once entered into the warehouse, data should not change.
  • Time Variant: In order to discover trends in business, analysts need large amounts of data. This is very much in contrast to online transaction processing (OLTP) systems, where performance requirements demand that historical data be moved to an archive. A data warehouse’s focus on change over time is what is meant by the term time variant.

8. What is OLAP?

OLAP is an acronym for Online Analytical Processing. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.

9. What is the difference between OLTP and OLAP?

In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). – OLAP (On-line Analytical Processing) is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations.

10. What is metadata in relation to datawarehouse?

Metadata is by definition “data about other data” of any type and sort in any media. It is used to facilitate the understanding, characteristics and management usage of data. Metadata defines data elements and attributes, data could be registered about structures and records as well (Length, columns and fields).

11. Explain different types of facts.

  • Non-additive Measures: Non-additive measures are those which cannot be used inside any numeric aggregation function
  • Semi Additive Measure: Semi-additive measures are those where only a subset of aggregation function can be applied.
  • Additive Measures: Additive measures can be used with any aggregation function like Sum(), Avg() etc. Example is Sales Quantity etc.

12. What are different types of dimension?

In a data warehouse model, dimension can be of following types:

  • Conformed Dimension
  • Junk Dimension
  • Degenerated Dimension
  • Role Playing Dimension

13. What is operational data store?

An operational data store is a database designed to integrate data from multiple sources for additional operations on the data. Unlike a master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting.

14. What is ETL?

ETL (Extract, Transform and Load) is a process in data warehousing responsible for pulling data out of the source systems and placing it into a data warehouse. Extracts data from homogeneous or heterogeneous data sources. Transforms the data for storing it in the proper format or structure for the purposes of querying and analysis.

15. What are Aggregate tables?

Aggregate tables are the tables which contain the existing warehouse data which has been grouped to certain level of dimensions. It is easy to retrieve data from the aggregated tables than the original table which has more number of records. This table reduces the load in the database server and increases the performance of the query.

advertisement

16. What is dimension modelling?

Dimensional modeling (DM) names a set of techniques and concepts used in data warehouse design. It is considered to be different from entity-relationship modeling (ER). Dimensional Modeling does not necessarily involve a relational database.

17. What is slowly changing dimension?

A Slowly Changing Dimension (SCD) is a dimension that stores and manages both current and historical data over time in a data warehouse. It is considered and implemented as one of the most critical ETL tasks in tracking the history of dimension records.

18. What is data cleaning?

Data cleansing, data cleaning or data scrubbing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Typical data cleaning tasks include record matching, deduplication, and column segmentation.

19. What is real time data-warehousing?

Real-time data warehousing captures business activity data as it occurs. As soon as the business activity is complete and there is data about it, the completed activity data flows into the data warehouse and becomes available instantly. In other words, real-time data warehousing is a framework for deriving information from data as the data becomes available.

20. What is VLDB?

A very large database is a database that contains an extremely high number of tuples (database rows), or occupies an extremely large physical filesystem storage space. The most common definition of VLDB is a database that occupies more than 1 terabyte or contains several billion rows, although naturally this definition changes over time.

advertisement

21. What is OLAP cube?

OLAP data cube is a representation of data in multiple dimensions, using facts and dimensions. It is characterized by the combination of information according to its relationship. It can consist in a collection of 0 to many dimensions, representing specific data.

22. What are the reasons for creating a data mart?

There are various reasons that lead to the creation of a data mart.

  • Easy access to frequently needed data
  • Creates collective view by a group of users
  • Improves end-user response time
  • Ease of creation
  • Lower cost than implementing a full data warehouse
  • Contains only business essential data

23. What is BUS Schema?

A BUS schema is used to identify the common dimensions across business processes, like identifying conforming dimensions. BUS schema has conformed dimension and standardized definition of facts. This schema has conformed dimensions and facts defined to be shared across all enterprise data marts.

24. What is a level of Granularity of a fact table?

Granularity is the level of depth represented by the data in a fact or dimension table in a data warehouse. High granularity means a minute, sometimes atomic grade of detail, often at the level of the transaction.

25. Explain Kimball’s four-step dimensional design process?

Steps are as follows:

  • Select the business process
  • Declare the grain
  • Identify dimensions
  • Identify the facts.

Advanced Data Science Interview Questions with Answers

26. What is conformed facts?

Conformed fact in a warehouse allows itself to have same name in separate tables. They can be compared and combined mathematically. Conformed fact tables are fact tables that use conformed dimensions.

27. What is junk dimension?

A Junk Dimension is a dimension table consisting of attributes that do not belong in the fact table or in any of the existing dimension tables. Junk dimensions are also appropriate for placing attributes like non-generic comments from the fact table. One Junk Dimension can hold several true/false or yes/no indicators that have no correlation with each other.

28. What is type 0 slowly changing dimension?

A type 0 slowly changing dimension is a dimension table with attributes values that never change. Attributes are considered and may be labeled “original”. The Type 0 method is passive. It manages dimensional changes and no action is performed. Values remain as they were at the time the dimension record was first inserted. In certain circumstances history is preserved with a Type 0.

29. What is conformed dimension?

Conformed dimensions are the dimensions which can be used across multiple Data Marts in combination with multiple facts tables accordingly. These conformed dimensions have a static structure. Any dimension table that is used by multiple fact tables can be conformed dimensions.

30. What are different types of factless fact table?

There are two kinds of fact tables which don’t have any facts at all. They are as follows:

  • First type of factless fact table records an event i.e. Attendance of the student. Many event tracking tables in the dimensional DWH turns out to be factless table
  • Second type of factless fact table is called coverage table. Coverage tables are frequently needed in (Dimensional DWH) when the primary fact table is sparse

31. What is audit dimension?

The audit dimension is a special dimension that is assembled in the back room by the ETL system for each fact table. Business can then possibly leverage it by helping them understand anomalous values, versions, and gain a general confidence in reported numbers.

32. Enlist few open source and proprietary ETL tools?

Open source tools are as follows:

  • Talend
  • CloverETL
  • Pentaho

Proprietary ETL tools are as follows:

  • SSIS
  • Informatica

33. What is type 1 slowly changing dimension?

In a Type 1 SCD, the new data overwrites the existing data. Thus the existing data is lost as it is not stored anywhere else. This is the default type of dimension you create. You do not need to specify any additional information to create a Type 1 SCD.

34. What is linked cube?

A linked cube is an MDStore cube object with a SubClassType of sbclsLinked. The contents of a linked cube are based on another cube that is defined and stored on a different Analysis server. A linked measure group is based on another measure group in a different cube within the same database or a different Analysis Services database.

35. Explain slicing and dicing operation?

Slicing and Dicing is a feature that helps us in seeing the more detailed information about a particular thing. Slicing and dicing refers to the ability to combine and re-combine the dimensions to see different slices of the information. Picture slicing a three-dimensional cube of information, in order to see what values are contained in the middle layer. Slicing and dicing a cube allows an end-user to do the same thing with multiple dimensions.

36. What is degenerated dimension?

A degenerate dimension is a key that has no attributes and hence does not join to an actual dimension table. Degenerate dimensions are very common when the grain of a fact table represents a single transaction item or line item because the degenerate dimension represents the unique identifier of the parent. Degenerate dimensions often play an integral role in the fact table’s primary key.

37. Explain drill down and drill up operation?

Drill Down and Drill Up (also known as Data Drilling) means to navigate in hierarchical dimensions of data stored in Data Warehouses. There are two opposite ways of data drilling:

  • Drill Down is used within OLAP to zoom in to more detailed data by Changing Dimensions
  • Drill Up means to zoom out stepwise

38. What is type 2 slowly changing dimension?

A Type 2 SCD retains the full history of values. When the value of a chosen attribute changes, the current record is closed. A new record is created with the changed data values and this new record becomes the current record. Each record contains the effective time and expiration time to identify the time period between which the record was active.

39. Explain pivoting operation in OLAP?

Pivot allows an analyst to rotate the cube in space to see its various faces. For example, cities could be arranged vertically and products horizontally while viewing data for a particular quarter. Pivoting could replace products with time periods to see data across time for a single product.

40. What is role playing dimension?

Dimensions are often recycled for multiple applications within the same database. For instance, a “Date” dimension can be used for “Date of Sale”, as well as “Date of Delivery”, or “Date of Hire”. This is often referred to as a “role-playing dimension”.

41. Explain importance data warehouse in BI?

Need of DW for BI becomes more acute when your organization scales out and when your OLTP performance becomes an issue. Your OLTP workload may be mission-critical and optimal performance is paramount, and you cannot afford to put your reporting workload on top of your OLTP workload on the same server. This is where you would want to periodically copy your OLTP database to a DW to offload your reporting needs with a DW.

42. What are different methods of loading dimension table?

The following are the methods of loading dimension tables:

  • Conventional Load: In this method all the table constraints will be checked against the data, before loading the data.
  • Direct Load or Faster Load: All the Constraints will be disabled. Data will be loaded directly. Later the data will be checked against the table constraints and the bad data won’t be indexed.

43. What are loops in data warehousing?

In datawarehousing, loops are existing between the tables. If there is a loop between the tables, then the query generation will take more time and it creates ambiguity. It is advised to avoid loop between the tables.

44. What is type 3 slowly changing dimension?

A Type 3 SCD stores two versions of values for certain selected level attributes. Each record stores the previous value and the current value of the selected attribute. When the value of any of the selected attributes changes, the current value is stored as the old value and the new value becomes the current value.

45. What are the benefits of Dimensional Modeling?

Benefits of the dimensional model are the following:

  • Understandability: Compared to the normalized model, the dimensional model is easier to understand and more intuitive
  • Query performance: Dimensional models are more denormalized and optimized for data querying
  • Extensibility: Dimensional models are scalable and easily accommodate unexpected new data

46. What is the difference between static dimension and rapidly changing dimension?

Static Dimension is not extracted from the original data source, but are created within the context of the data warehouse. A static dimension can be loaded manually — for example with Status codes — or it can be generated by a procedure, such as a Date or Time dimension. A rapidly changing dimension is a result of poor decisions during the requirements analysis and data modeling stages of the Data Warehousing project.

47. What are different types of OLAP server?

In the OLAP world, there are three different types:

  • MOLAP: This is the more traditional way of OLAP analysis. In MOLAP, data is stored in a multidimensional cube.
  • ROLAP: This methodology relies on manipulating the data stored in the relational database to give the appearance of traditional OLAP’s slicing and dicing functionality.
  • HOLAP: HOLAP technologies attempt to combine the advantages of MOLAP and ROLAP. For summary-type information, HOLAP leverages cube technology for faster performance. When detail information is needed, HOLAP can “drill through” from the cube into the underlying relational data.

48. What is galaxy schema?

Galaxy schema is a combination of many data marts. Galaxy schema also known as fact constellation schema because it is the combination of both of star schema and Snow flake schema. A fact constellation has multiple fact tables

49. Explain core dimension and dirty dimension?

Core Dimension is a Dimension table, which is used dedicated for single fact. If a Record occurs more than one time in a table by the difference of a non-key attribute then such a dimension is called Dirty dimension. In-accurate data is often termed as dirty data and has been characterized as missing data, wrong data and compound data with different representations.

50. What is the difference between offline and virtual cubes?

Offline cubes are created and managed by end-user applications and generally have little impact on data warehouse or cube design. Maintenance of offline cube data is the responsibility of the end user, who can refresh data from online cubes or update offline cubes created from local databases as necessary. Offline cubes do not interfere with normal data warehouse and cube management and maintenance.
A virtual cube is a combination of multiple cubes in one logical cube, somewhat like a relational database view that combines other views and tables. A virtual cube can include normal or linked cubes as component cubes. A virtual cube can also be based on a single cube to expose only selected subsets of its measures and dimensions.

advertisement
advertisement
Subscribe to our Newsletters (Subject-wise). Participate in the Sanfoundry Certification contest to get free Certificate of Merit. Join our social networks below and stay updated with latest contests, videos, internships and jobs!

Youtube | Telegram | LinkedIn | Instagram | Facebook | Twitter | Pinterest
Manish Bhojasia - Founder & CTO at Sanfoundry
Manish Bhojasia, a technology veteran with 20+ years @ Cisco & Wipro, is Founder and CTO at Sanfoundry. He lives in Bangalore, and focuses on development of Linux Kernel, SAN Technologies, Advanced C, Data Structures & Alogrithms. Stay connected with him at LinkedIn.

Subscribe to his free Masterclasses at Youtube & discussions at Telegram SanfoundryClasses.