This set of Data Mining Multiple Choice Questions & Answers (MCQs) focuses on “Data Cleaning and Data Integration – Set 2”.
1. Which of the following is a factor in data discrepancy?
a) Redundant tuple
b) Reversed data
c) Data decay
d) Copied tuple
View Answer
Explanation: When the data filled is no longer useful or valid, it leads to discrepancy in data. This is known as data decay. An example of data decay is a deactivated email address in the email-address attribute.
2. Which of the following is not a factor in data discrepancy?
a) Inconsistent data representation
b) Too many attributes
c) Human errors
d) Data integration
View Answer
Explanation: Data discrepancy occurs due to many factors. Some of these factors are – inconsistent data representation, human errors in filling of data, system errors, discrepancy due to data integration.
3. The data which informs about the properties about data is known as _____
a) Meta data
b) Data attributes
c) Data codes
d) Data factors
View Answer
Explanation: The data that gives information or knowledge about data and inform about the properties of the data is known as metadata. It is also known as data about data.
4. Field overloading is _____
a) Using already used bits of an attribute to define new attribute
b) Using unused bits of an attribute to define new attribute
c) Using first three bits of an attribute to define new attribute
d) Using last one bit of an attribute to define new attribute
View Answer
Explanation: When the unused bit portions of an already defined attribute is used to define new attributes, it is known as field overloading. It may result into errors or discrepancies in data.
5. The unique rule for an attribute specifies that _____
a) Each value of the attribute should be unique
b) Most values of the attribute should be unique
c) The first and last values of the attribute should be unique
d) The mean, median and mode values should be unique
View Answer
Explanation: The unique rule on an attribute specifies that each value of the attribute should be unique and different than other values. It holds much importance during the analysis of the data.
6. The consecutive rule for an attribute specifies that _____
a) Each value of the attribute should be an integer
b) Each attribute value should be greater than the previous value
c) There are no missing values between the lowest and highest values
d) The mean, median and mode values should be unique
View Answer
Explanation: The consecutive rule on an attribute specifies that between the highest and lowest values for an attribute, there should be no missing value. Also, all the data values of the attribute should be unique.
7. The null rule for an attribute specifies _____
a) The data values should not be null
b) All the data values should be null
c) The conditions and handling of null values
d) At most one null value is allowed
View Answer
Explanation: The null rule on an attribute specifies the conditions on the data values for them to be treated as null. The rule also specifies the handling of null values. A data attribute may contain many null values. So it is important to handle them properly.
8. Data scrubbing tools do not _____
a) Use domain knowledge
b) Detect errors
c) Make correction in the data
d) Decrease the size of the metadata
View Answer
Explanation: The data scrubbing tools use domain knowledge to detect errors. They make correction of the errors based on the domain knowledge. They use fuzzy techniques to perform this task.
9. Data auditing tools do not _____
a) Discover rules and relationships in the data
b) Detect data that violate certain rules
c) Use statistical analysis to find rules in the data
d) Use parsing to find rules in the data
View Answer
Explanation: The data auditing tools discover rules and relationships in the data and find the data that violate these rules. They make use of statistical techniques to find the correlations in the data.
10. ETL (Extraction, Transformation, Loading) tools _____
a) Allow the use of graphical user interface
b) Do not allow the use of graphical user interface
c) Do not allow data transformations
d) Are not used for data transformations
View Answer
Explanation: ETL (Extraction, Transformation, Loading) tools allow the use of graphical user interface to perform transformations. Though, these tools support a limited number of transformations only and hence, user customized scripts are used for data transformation.
11. Entity identification problem is _____
a) To identify equivalent entities in multiple data sources
b) To identify different entities in multiple data sources
c) To identify non useful entities in multiple data sources
d) To identify nominal entities in multiple data sources
View Answer
Explanation: Entity identification problem involves identification of equivalent identities in multiple data sources. It is often encountered during data integration from multiple data sources having different structures.
12. During data integration, functional dependencies and referential constraints of source and target systems should _____
a) Match
b) Not match
c) Merged
d) Ignored
View Answer
Explanation: During data integration, functional dependencies and referential constraints of source and target systems should be matched. The structure of the data should be carefully analyzed during data integration.
13. In the following contingency table, which of the following is the expected frequency of the cell representing (male, football)?
Likes Football | Likes Cricket | Total | |
---|---|---|---|
Male | 30 | 20 | 50 |
Female | 40 | 10 | 50 |
Total | 70 | 30 | 100 |
a) 20
b) 25
c) 30
d) 35
View Answer
Explanation: The expected frequency for the cell (male, football) is given by:
E = (count (male) * count (football))/Total
Where count (A) gives the total frequency of A
E = (50 * 70)/100 = 35
14. How many degrees of freedom does the given data possess?
Likes Summer | Likes Winter | Total | |
---|---|---|---|
Male | 30 | 70 | 100 |
Female | 60 | 40 | 100 |
Total | 90 | 110 | 200 |
a) 1
b) 2
c) 3
d) 4
View Answer
Explanation: For a contingency table of r * c cells, the number of degrees of freedom is given by:
D = (r – 1) * (c – 1)
In the above case, r = 2 and c = 2
D = (2 – 1) * (2 – 1) = 1
15. Which of the following is the chi square value for the given data?
Likes Football | Likes Cricket | Total | |
---|---|---|---|
Male | 30 | 20 | 50 |
Female | 40 | 10 | 50 |
Total | 70 | 30 | 100 |
a) 2.2
b) 3.7
c) 5.3
d) 4.8
View Answer
Explanation: The expected frequency of a cell (A, B) is given by:
E = (count (A) * count (B))/ Total
Where count (A) is the total frequency of A
E (Male, Football) = (50 * 70)/100 = 35
E (Male, Cricket) = (50 * 30)/100 = 15
E (Female, Football) = (50 * 70)/100 = 35
E (Female, Cricket) = (50 * 30)/100 = 15
Chi- square = S = (30-25)2/35 + (20-15)2/15 + (40-35)2/35 + (10-15)2/15
S = 25/35 + 25/15 + 25/35 + 25/15
S = 4.8
Sanfoundry Global Education & Learning Series – Data Mining.
To practice all areas of Data Mining, here is complete set of Multiple Choice Questions and Answers.