Here are the top 50 commonly asked questions in Data Mining interviews. Whether you’re just starting your preparation or need a quick refresher, these questions and answers will boost your confidence for the interview. Ranging from basic to advanced, they cover a wide array of Data Mining concepts. Practice these questions for campus and company interviews, positions from entry to mid-level experience, and competitive examinations. It’s also important to practice them to strengthen your understanding of Data Mining.
Data Mining Interview Questions with Answers
1. What is data mining?
Data mining is the computational process of discovering patterns in large data sets involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.
2. What is the difference between data mining and data warehousing?
Data mining is the process of finding patterns in a given data set. These patterns can often provide meaningful and insightful data to whoever is interested in that data. Data warehousing can be said to be the process of centralizing or aggregating data from multiple sources into one common repository.
3. What is the difference between data mining and data analytics?
Data Analytics is using statistics and data science tools to analyze data and data mining it is the identification of correlations and patterns within data.
4. What is data purging?
Purging is the process of freeing up space in the database or of deleting obsolete data that is not required by the system. The purge process can be based on the age of the data or the type of data.
5. What are different stages of data mining?
Different stages of data mining are:
- Exploration: This stage involves preparation and collection of data. it also involves data cleaning, transformation. Based on size of data, different tools to analyze the data may be required. This stage helps to determine different variables of the data to determine their behavior.
- Model building and validation: This stage involves choosing the best model based on their predictive performance. The model is then applied on the different data sets and compared for best performance.
- Deployment: Based on model selected in previous stage, it is applied to the data sets. This is to generate predictions or estimates of the expected outcome.
6. What is MDX?
Multidimensional Expressions (MDX) is a query language for OLAP databases. Much like SQL, it is a query language for relational databases. It is also a calculation language, with syntax similar to spreadsheet formulas.
7. What is discrete and continuous data in data mining?
Discrete data can be considered as defined or finite data. Example: Passport No, DOB. Continuous data can be considered as data which changes continuously and in an ordered fashion. Example: Salary
8. What is decision tree?
A decision tree is a decision support tool that uses a tree-like graph or model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm. Decision trees are commonly used in operations research, specifically in decision analysis, to help identify a strategy most likely to reach a goal, but are also a popular tool in machine learning.
9. What is Naive Bayes Classifier?
Naive Bayes is a simple technique for constructing classifiers: models that assign class labels to problem instances, represented as vectors of feature values, where the class labels are drawn from some finite set. It is not a single algorithm for training such classifiers, but a family of algorithms based on a common principle: all naive Bayes classifiers assume that the value of a particular feature is independent of the value of any other feature, given the class variable.
10. What is KMeans clustering?
The KMeans algorithm clusters data by trying to separate samples in n groups of equal variance, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields.
11. What is Time Series analysis in data mining?
Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data. Time series forecasting is the use of a model to predict future values based on previously observed values.
12. Explain Association rules in Data mining.
Association rules are created by analyzing data for frequent if/then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the database. Confidence indicates the number of times the if/then statements have been found to be true.In data mining, association rules are useful for analyzing and predicting customer behavior.
13. What is supervised learning?
Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples. In supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal).
14. What is unsupervised learning?
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. The most common unsupervised learning method is cluster analysis, which is used for exploratory data analysis to find hidden patterns or grouping in data.
15. What is the difference between classification and regression in data mining?
In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. Training from the input data is common in both tasks.
16. What are the different functions of data mining?
Following are the different functions of data mining:
- Characterization
- Association and correlation analysis
- Classification
- Prediction
- Clustering analysis
- Evolution analysis
17. What is Data Aggregation and Generalization?
Aggregation is the technique wherein, as the name suggest, summary or aggregation operation are applied to the data for example, we can construct a data cube for analysis of the data at multiple granularities. Generalizing the data basically deals with the concept wherein low level or primitive data is replaced by high level concept.
18. What is the difference between bagged trees and random forest?
Bagging has a single parameter, which is the number of trees. All trees are fully grown binary tree (unpruned) and at each node in the tree one searches over all features to find the feature that best splits the data at that node.
Random forests has 2 parameters:
- The first parameter is the same as bagging (the number of trees)
- The second parameter (unique to randomforests) is mtry which is how many features to search over to find the best feature.
19. What is Cluster Analysis?
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense or another) to each other than to those in other groups (clusters).
20. What is OLAP?
OLAP is an acronym for Online Analytical Processing. OLAP performs multidimensional analysis of business data and provides the capability for complex calculations, trend analysis, and sophisticated data modeling.
21. What is bagging?
Bootstrap aggregating, also called bagging, is a machine learning ensemble meta-algorithm designed to improve the stability and accuracy of machine learning algorithms used in statistical classification and regression. It also reduces variance and helps to avoid overfitting. Although it is usually applied to decision tree methods, it can be used with any type of method. Bagging is a special case of the model averaging approach.
22. What is boosting?
Boosting is a machine learning ensemble meta-algorithm for primarily reducing bias, and also variance in supervised learning, and a family of machine learning algorithms which convert weak learners to strong ones.
23. Explain importance of data mining in BI.
Online business intelligence software for data mining takes advantage of web data mining and data warehousing to help you gather your information in a timelier and more valuable manner. The business intelligence software will search the trade magazines and newspapers relevant to your business to provide the growth information you need. With web data mining it can help you evaluate your performance in comparison to your competition.
24. What is the difference between clustering and classification?
Classification is supervised learning technique used to assign per-defined tag to instance on the basis of features. So classification algorithm requires training data. Classification model is created from training data, then classification model is used to classify new instances.
Clustering is unsupervised technique used to group similar instances on the basis of features. Clustering does not require training data. Clustering does not assign per-defined label to each and every group.
25. What is multi-dimensional analysis?
Multi-dimensional Data Analysis (MDDA) refers to the process of summarizing data across multiple levels (called dimensions) and then presenting the results in a multi-dimensional grid format. This process is also referred to as OLAP, Data Pivot., Decision Cube, and Crosstab.
Advanced Data Mining Interview Questions with Answers
26. Explain in detail neural networks?
Neural Networks are analytic techniques modeled after the (hypothesized) processes of learning in the cognitive system and the neurological functions of the brain and capable of predicting new observations (on specific variables) from other observations (on the same or other variables) after executing a process of so-called learning from existing data. Neural Networks is one of the Data Mining techniques. The resulting “network” developed in the process of “learning” represents a pattern detected in the data.
27. What is Backpropagation in neural networks?
Backpropagation, an abbreviation for “backward propagation of errors”, is a common method of training artificial neural networks used in conjunction with an optimization method such as gradient descent. The method calculates the gradient of a loss function with respect to all the weights in the network. The gradient is fed to the optimization method which in turn uses it to update the weights, in an attempt to minimize the loss function. Backpropagation requires a known, desired output for each input value in order to calculate the loss function gradient.
28. What is different between statistics and data mining?
Statistics is a branch of mathematics concerning the collection and the description of data. Statistics is at the core of data mining – helping to distinguish between random noise and significant findings, and providing a theory for estimating probabilities of predictions, etc. However Data Mining is more than Statistics. DM covers the entire process of data analysis, including data cleaning and preparation and visualization of the results, and how to produce predictions in real-time, etc.
29. What is linear regression?
Linear regression is the most basic and commonly used predictive analysis. Regression estimates are used to describe data and to explain the relationship between one dependent variable and one or more independent variables. There are several linear regression analyses available to the researcher.
30. Explain nearest neighbor technique?
Nearest neighbor is a prediction technique that is quite similar to clustering – its essence is that in order to predict what a prediction value is in one record look for records with similar predictor values in the historical database and use the prediction value from the record that it “nearest” to the unclassified record.
31. What is the difference between clustering and nearest neighbor prediction?
The main distinction between clustering and the nearest neighbor technique is that clustering is what is called an unsupervised learning technique and nearest neighbor is generally used for prediction or a supervised learning technique. Unsupervised learning techniques are unsupervised in the sense that when they are run there is not particular reason for the creation of the models the way there is for supervised learning techniques that are trying to perform prediction. In prediction, the patterns that are found in the database and presented in the model are always the most important patterns in the database for performing some particular prediction. In clustering there is no particular sense of why certain records are near to each other or why they all fall into the same cluster.
32. How is the space for clustering and nearest neighbor defined?
For clustering the n-dimensional space is usually defined by assigning one predictor to each dimension. For the nearest neighbor algorithm predictors are also mapped to dimensions but then those dimensions are literally stretched or compressed based on how important the particular predictor is in making the prediction. The stretching of a dimension effectively makes that dimension (and hence predictor) more important than the others in calculating the distance.
33. What is the difference between Hierarchical and Non-Hierarchical Clustering?
- Hierarchical and Non Hierarchical Clustering have key differences in running time, assumptions, input parameters and resultant clusters. Typically, Non Hierarchical clustering is faster than hierarchical clustering.
- Hierarchical clustering requires only a similarity measure, while Non Hierarchical clustering requires stronger assumptions such as number of clusters and the initial centers.
- Hierarchical clustering does not require any input parameters, while Non Hierarchical clustering algorithms require the number of clusters to start running.
- Hierarchical clustering returns a much more meaningful and subjective division of clusters but Non Hierarchical clustering results in exactly k clusters.
- Hierarchical clustering algorithms are more suitable for categorical data as long as a similarity measure can be defined accordingly.
34. Explain Agglomerative hierarchical clustering?
Agglomerative clustering techniques start with as many clusters as there are records where each cluster contains just one record. The clusters that are nearest each other are merged together to form the next largest cluster. This merging is continued until a hierarchy of clusters is built with just a single cluster containing all the records at the top of the hierarchy.
35. Explain Divisive hierarchical clustering?
Divisive clustering techniques take the opposite approach from agglomerative techniques. These techniques start with all the records in one cluster and then try to split that cluster into smaller pieces and then in turn to try to split those smaller pieces.
36. Explain Non-Hierarchical Clustering in detail?
Non-Hierarchical clustering algorithms generate various partitions and then evaluate them by some criterion. They are also referred to as nonhierarchical as each instance is placed in exactly one of k mutually exclusive clusters. Because only one set of clusters is the output of a typical Non-Hierarchical clustering algorithm, the user is required to input the desired number of clusters (usually called k). There are two main non-hierarchical clustering techniques.
37. Where can decision trees be used?
- The decision tree technology can be used for exploration of the dataset and business problem.
- Decision tree technology has been used is for preprocessing data for other prediction algorithms
- Some forms of decision trees were initially developed as exploratory tools to refine and preprocess data for more standard statistical techniques like logistic regression
38. When does the decision tree algorithms stop growing the tree?
Most decision tree algorithms stop growing the tree when one of three criteria are met:
- The segment contains only one record. (There is no further question that you could ask which could further refine a segment of just one.)
- All the records in the segment have identical characteristics. (There is no reason to continue asking further questions segmentation since all the remaining records are the same.)
- The improvement is not substantial enough to warrant making the split
39. Explain ID3 algorithm?
ID3 is a non-incremental algorithm, meaning it derives its classes from a fixed set of training instances. An incremental algorithm revises the current concept definition, if necessary, with a new sample. The classes created by ID3 are inductive, that is, given a small set of training instances, the specific classes created by ID3 are expected to work for all future instances. The distribution of the unknowns must be the same as the test cases. Induction classes cannot be proven to work in every case since they may classify an infinite number of instances. Note that ID3 (or any inductive algorithm) may misclassify data.
40. What is CART?
CART stands for Classification and Regression Trees and is a data exploration and prediction algorithm. In building the CART tree each predictor is picked based on how well it teases apart the records with different predictions. One of the great advantages of CART is that the algorithm has the validation of the model and the discovery of the optimally general model built deeply into the algorithm. The CART algorithm is relatively robust with respect to missing data.
41. Where to use neural networks?
- Neural networks of various kinds can be used for clustering and prototype creation
- Neural networks can be used for outlier analysis
- Neural networks can be used for feature extraction
42. What is CHAID?
Chi-square Automatic Interaction Detector (CHAID) was a technique created by Gordon V. Kass in 1980. CHAID is a tool used to discover the relationship between variables. CHAID analysis builds a predictive medel, or tree, to help determine how variables best merge to explain the outcome in the given dependent variable. CHAID is a type of decision tree technique, based upon adjusted significance testing.
43. What is the difference between rule induction and decision tree?
Decision trees also produce rules but in a very different way than rule induction systems. Decision trees produce rules that are mutually exclusive and collectively exhaustive with respect to the training database while rule induction systems produce rules that are not mutually exclusive and might be collectively exhaustive.
44. What is the difference between KDD and data mining?
KDD is the overall process of extracting knowledge from data while Data Mining is a step inside the KDD process, which deals with identifying patterns in data. In other words, Data Mining is only the application of a specific algorithm based on the overall goal of the KDD process.
45. What is predictive modelling?
Predictive modeling is the process of creating, testing and validating a model to best predict the probability of an outcome. A number of modeling methods from machine learning, artificial intelligence, and statistics are available in predictive analytics software solutions for this task.
46. What is the difference between data analysis and data mining?
Data Analysis means studying the patterns in the data, or finding the similarities between data. And data Mining means once you find the similarities/patterns in data what do you make of it or how do you present it to party who was interested in analyzing the data in first place.
47. What is the purpose of EDA?
In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.
48. What is hypothesis testing?
A statistical hypothesis is a hypothesis that is testable on the basis of observing a process that is modeled via a set of random variables. A statistical hypothesis test is a method of statistical inference. Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The usual process of hypothesis testing consists of four steps.
49. What is web scraping?
Web Scraping is a technique employed to extract large amounts of data from websites whereby the data is extracted and saved to a local file in your computer or to a database in table format. Web Scraping is the technique of automating this process, so that instead of manually copying the data from websites, the Web Scraping software will perform the same task within a fraction of the time.
50. What are the best data mining tools?
Tools are as follows:
- Salford Systems Tools (CART, Random Forest, MARS, TreeNet)
- SAS Enterprise Miner/Text Miner
- SPSS Clementine
- Megaputer Intelligence PolyAnalyst
Useful Resources: