# Data Mining Questions and Answers – Data Cleaning and Data Integration

This set of Data Mining Multiple Choice Questions & Answers (MCQs) focuses on “Data Cleaning and Data Integration”.

1. Which of the following is method is can be used when the class label is missing for a tuple in the dataset?
a) Ignoring the tuple
b) Generating a copy of the tuple
c) Deleting the label attribute
d) Reversing the tuple

Explanation: A dataset might contain tuples with attributes having missing values. Sometimes, the label attribute value is missing. In such cases, we may ignore the tuple, although it is not an efficient method.

2. Which of the following is a drawback in manual filling of missing values?
a) Incorrect data
b) Redundancy in data
c) Complex data
d) Too many missing values in the data

Explanation: When a data set has missing values for certain attributes, it may be corrected by manual filling of the data in its place. But, when there are too many missing values, manually filling in those values is very time consuming.

3. Which of the following is a drawback of filling in a global constant for the missing value?
a) It increases the data size
b) It decreases the number of missing values
c) It may project wrong trend in data
d) It is difficult to update the data

Explanation: Missing data values can also be handled by filling in a global constant in its place. But when there are many missing values, the mining program may assume that the global constant is an important concept and project a wrong trend.

4. Which of the following is an appropriate measure of central tendency to use fill in missing values of an attribute of a skewed distribution?
a) Mean
b) Weighted mean
c) Median
d) Geometric mean

Explanation: When a distribution is skewed, the mean may not represent an appropriate picture of the data due to its sensitivity to the outliers and extreme values. So we use median in skewed data as a measure of central tendency.

5. Which of the following techniques is often used to generate the most probable value to fill in the missing value?
a) Spooling
b) Decision tree
c) Data cleaning
d) Numerosity reduction

Explanation: The missing values for an attribute can be filled in by the most probable values for that attribute. Decision tree, regression, Bayesian technique are some of the methods that can be used to generate the most probable value.

6. Which of the following values are considered in binning for noise smoothing for a data value?
a) Neighborhood
b) Average
c) Farthest
d) Compressed

Explanation: For smoothing data values to remove noise, binning method can be used. In binning, the data values are smoothed by taking into consideration the neighborhood data values.

7. Binning method performs _____
a) Geometric smoothing
b) Average smoothing
c) Local smoothing
d) Global smoothing

Explanation: Binning is performed to smooth out noise from data values. The binning method considers the neighborhood values for smoothing due to which it is also known as local smoothing.

8. In smoothing by bin means, each value in the bin is replaced by _____
a) Mean of the values in the bin
b) Median of the values in the bin
c) Mode of the values in the bin
d) Deviation of the values in the bin

Explanation: For smoothing data values to remove noise, binning method can be used. In smoothing by bin means, each value in the bin is replaced by the mean of the values in the bin.

9. In smoothing by bin boundaries, each value in the bin is replaced by _____
a) Minimum value in the bin
b) Average of the minimum and maximum value in the bin
c) Maximum value in the bin
d) Closest value between the maximum and minimum value

Explanation: For smoothing data values to remove noise, smoothing by bin boundaries method can be used. In this method, the maximum and minimum data values in the bin are taken as bin boundaries. Each value in the bin is replaced by its closest boundary value.

10. Given the data 3, 6, 1, 4, 9, 11, 8, 10, 17, which of the following is the correct representation of bins formed after equal frequency binning?
a) (1, 3, 4), (6, 8, 9), (10, 11, 17)
b) (1, 6, 4), (3, 8, 9), (10, 11, 17)
c) (1, 3, 4), (10, 8, 9), (6, 11, 17)
d) (1, 3, 9), (6, 8, 4), (10, 11, 17)

Explanation: For binning, sorted data values are used. The above data when sorted is: 1, 3, 4, 6, 8, 9, 10, 11, 17
On equal frequency binning, the values are partitioned into bins with each bin having equal number of values. If we take 3 bins, then each bin will have 9/3 = 3 values in the above case.
These bins are: (1, 3, 4), (6, 8, 9), (10, 11, 17)

11. Given the data 2, 4, 1, 3, 9, 13, 8, 10, 16, which of the following is the correct representation of bins formed after smoothing by bin means?
a) (2, 2, 3), (7, 7, 7), (13, 13, 13)
b) (2, 2, 2), (7, 7, 7), (13, 13, 13)
c) (2, 2, 2), (8, 8, 8), (13, 13, 13)
d) (1, 2, 2), (7, 7, 7), (11, 13, 13)

Explanation: The above data when sorted is: 1, 2, 3, 4, 8, 9, 10, 13, 16
On equal frequency binning, if we take 3 bins, each bin will have 9/3 = 3 values in the above case.
These bins are: (1, 2, 3), (4, 8, 9), (10, 13, 16)
The mean of Bin 1 is = (1 + 2 + 3)/3 = 2
The mean of Bin 2 is = (4 + 8 + 9)/3 = 7
The mean of Bin 3 is = (10 + 13 + 16)/3 = 13
So the bins become: (2, 2, 2), (7, 7, 7), (13, 13, 13)

12. Given the data 2, 6, 1, 4, 9, 12, 8, 11, 14, which of the following is the correct representation of bins formed after smoothing by bin means?
a) (1, 2, 4), (6, 9, 9), (10, 10, 14)
b) (1, 2, 6), (6, 9, 9), (10, 10, 14)
c) (1, 2, 4), (9, 9, 9), (10, 10, 14)
d) (1, 2, 4), (6, 9, 9), (10, 14, 14)

Explanation: The above data when sorted is: 1, 2, 4, 6, 8, 9, 10, 11, 14
On equal frequency binning, if we take 3 bins, each bin will have 9/3 = 3 values in the above case.
These bins are: (1, 2, 4), (6, 8, 9), (10, 11, 14)
In binning by bin boundaries, the maximum and minimum values in the bin are known as the bin boundaries. Each value in the bin is replaced by the boundary value closest to it.
The boundary values of Bin 1 are 1 and 4
The boundary values of Bin 2 are 6 and 9
The boundary values of Bin 3 are 10 and 14
So the bins become: (1, 2, 4), (6, 9, 9), (10, 10, 14)

13. Data smoothing can also be performed by regression.
a) True
b) False

Explanation: For smoothing data values to remove noise, regression can also be performed. In regression, attributes are fit in a function and the attributes are used to find the other attributes depending upon the relationship between them.

14. Which of the following is a cause of discrepancy in data?
a) Too many optional fields in data forms
b) Large size of data
c) Redundancy in data
d) Complex data

Explanation: When a user is asked to fill in data forms that have too many optional fields, the user may choose to leave the fields blank in order to avoid divulging any personal information. This creates discrepancy in data.

Sanfoundry Global Education & Learning Series – Data Mining.

To practice all areas of Data Mining, here is complete set of Multiple Choice Questions and Answers.

If you find a mistake in question / option / answer, kindly take a screenshot and email to [email protected]