# Data Mining Questions and Answers – Data Cleaning and Data Integration – Set 2

This set of Data Mining Multiple Choice Questions & Answers (MCQs) focuses on “Data Cleaning and Data Integration – Set 2”.

1. Which of the following is a factor in data discrepancy?
a) Redundant tuple
b) Reversed data
c) Data decay
d) Copied tuple

Explanation: When the data filled is no longer useful or valid, it leads to discrepancy in data. This is known as data decay. An example of data decay is a deactivated email address in the email-address attribute.

2. Which of the following is not a factor in data discrepancy?
a) Inconsistent data representation
b) Too many attributes
c) Human errors
d) Data integration

Explanation: Data discrepancy occurs due to many factors. Some of these factors are – inconsistent data representation, human errors in filling of data, system errors, discrepancy due to data integration.

3. The data which informs about the properties about data is known as _____
a) Meta data
b) Data attributes
c) Data codes
d) Data factors

Explanation: The data that gives information or knowledge about data and inform about the properties of the data is known as metadata. It is also known as data about data.

a) Using already used bits of an attribute to define new attribute
b) Using unused bits of an attribute to define new attribute
c) Using first three bits of an attribute to define new attribute
d) Using last one bit of an attribute to define new attribute

Explanation: When the unused bit portions of an already defined attribute is used to define new attributes, it is known as field overloading. It may result into errors or discrepancies in data.

5. The unique rule for an attribute specifies that _____
a) Each value of the attribute should be unique
b) Most values of the attribute should be unique
c) The first and last values of the attribute should be unique
d) The mean, median and mode values should be unique

Explanation: The unique rule on an attribute specifies that each value of the attribute should be unique and different than other values. It holds much importance during the analysis of the data.

6. The consecutive rule for an attribute specifies that _____
a) Each value of the attribute should be an integer
b) Each attribute value should be greater than the previous value
c) There are no missing values between the lowest and highest values
d) The mean, median and mode values should be unique

Explanation: The consecutive rule on an attribute specifies that between the highest and lowest values for an attribute, there should be no missing value. Also, all the data values of the attribute should be unique.

7. The null rule for an attribute specifies _____
a) The data values should not be null
b) All the data values should be null
c) The conditions and handling of null values
d) At most one null value is allowed

Explanation: The null rule on an attribute specifies the conditions on the data values for them to be treated as null. The rule also specifies the handling of null values. A data attribute may contain many null values. So it is important to handle them properly.

8. Data scrubbing tools do not _____
a) Use domain knowledge
b) Detect errors
c) Make correction in the data
d) Decrease the size of the metadata

Explanation: The data scrubbing tools use domain knowledge to detect errors. They make correction of the errors based on the domain knowledge. They use fuzzy techniques to perform this task.

9. Data auditing tools do not _____
a) Discover rules and relationships in the data
b) Detect data that violate certain rules
c) Use statistical analysis to find rules in the data
d) Use parsing to find rules in the data

Explanation: The data auditing tools discover rules and relationships in the data and find the data that violate these rules. They make use of statistical techniques to find the correlations in the data.

a) Allow the use of graphical user interface
b) Do not allow the use of graphical user interface
c) Do not allow data transformations
d) Are not used for data transformations

Explanation: ETL (Extraction, Transformation, Loading) tools allow the use of graphical user interface to perform transformations. Though, these tools support a limited number of transformations only and hence, user customized scripts are used for data transformation.

11. Entity identification problem is _____
a) To identify equivalent entities in multiple data sources
b) To identify different entities in multiple data sources
c) To identify non useful entities in multiple data sources
d) To identify nominal entities in multiple data sources

Explanation: Entity identification problem involves identification of equivalent identities in multiple data sources. It is often encountered during data integration from multiple data sources having different structures.

12. During data integration, functional dependencies and referential constraints of source and target systems should _____
a) Match
b) Not match
c) Merged
d) Ignored

Explanation: During data integration, functional dependencies and referential constraints of source and target systems should be matched. The structure of the data should be carefully analyzed during data integration.

13. In the following contingency table, which of the following is the expected frequency of the cell representing (male, football)?

Likes Football Likes Cricket Total
Male 30 20 50
Female 40 10 50
Total 70 30 100

a) 20
b) 25
c) 30
d) 35

Explanation: The expected frequency for the cell (male, football) is given by:
E = (count (male) * count (football))/Total
Where count (A) gives the total frequency of A
E = (50 * 70)/100 = 35

14. How many degrees of freedom does the given data possess?

Likes Summer Likes Winter Total
Male 30 70 100
Female 60 40 100
Total 90 110 200

a) 1
b) 2
c) 3
d) 4

Explanation: For a contingency table of r * c cells, the number of degrees of freedom is given by:
D = (r – 1) * (c – 1)
In the above case, r = 2 and c = 2
D = (2 – 1) * (2 – 1) = 1

15. Which of the following is the chi square value for the given data?

Likes Football Likes Cricket Total
Male 30 20 50
Female 40 10 50
Total 70 30 100

a) 2.2
b) 3.7
c) 5.3
d) 4.8

Explanation: The expected frequency of a cell (A, B) is given by:
E = (count (A) * count (B))/ Total
Where count (A) is the total frequency of A
E (Male, Football) = (50 * 70)/100 = 35
E (Male, Cricket) = (50 * 30)/100 = 15
E (Female, Football) = (50 * 70)/100 = 35
E (Female, Cricket) = (50 * 30)/100 = 15
Chi- square = S = (30-25)2/35 + (20-15)2/15 + (40-35)2/35 + (10-15)2/15
S = 25/35 + 25/15 + 25/35 + 25/15
S = 4.8

Sanfoundry Global Education & Learning Series – Data Mining.

To practice all areas of Data Mining, here is complete set of Multiple Choice Questions and Answers.

If you find a mistake in question / option / answer, kindly take a screenshot and email to [email protected]