Data Science Interview Questions

Here are the top 50 commonly asked questions in Data Science interviews. Whether you’re just starting your preparation or need a quick refresher, these questions and answers will boost your confidence for the interview. Ranging from basic to advanced, they cover a wide array of data science concepts. Practice these questions for campus and company interviews, positions from entry to mid-level experience, and competitive examinations. It’s also important to practice them to strengthen your understanding of data science.

Data Science Interview Questions with Answers

1. What is data science?

Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics.

2. What is data analytics?

Analytics is about applying a mechanical or algorithmic process to derive the insights for example running through various data sets looking for meaningful correlations between them. Firms may apply analytics to business data to describe, predict, and improve business performance.

3. What is text analytics?

Text Analytics refers to the process of deriving high-quality information from text. High-quality information is typically derived through the devising of patterns and trends through means such as statistical pattern learning.

4. What is machine learning?

Machine learning is a subfield of computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a “Field of study that gives computers the ability to learn without being explicitly programmed”.

5. What is big data analytics?

Big data analytics is the process of examining large data sets containing a variety of data types — i.e., big data — to uncover hidden patterns, unknown correlations, market trends, customer preferences and other useful business information.

6. What is SAS data analytics?

SAS (Statistical Analysis System) is a software suite developed by SAS Institute for advanced analytics, business intelligence, data management, and predictive analytics. It is the largest market-share holder for advanced analytics.


7. What is unstructured data?

Unstructured Data refers to information that either does not have a pre-defined data model or is not organized in a pre-defined manner. Unstructured information is typically text-heavy, but may contain data such as dates, numbers, and facts as well.

8. What is the purpose of recommender systems?

Recommender systems or recommendation systems (sometimes replacing “system” with a synonym such as platform or engine) are a subclass of information filtering system that seek to predict the ‘rating’ or ‘preference’ that a user would give to an item.

9. What is predictive analytics?

Predictive analytics is used to mean predictive modeling, “scoring” data with predictive models, and forecasting. However, people are increasingly using the term to refer to related analytical disciplines, such as descriptive modeling and decision modeling or optimization.
Sentiment analysis, for instance, is a common type of predictive analytics

10. What is descriptive analytics?

Descriptive analytics is centered around data presentation, slicing, dicing, and management oversight. Descriptive Analytics describe what happened in the past. Descriptive Analytics risk less from over-fit as they are describing what actually happened.

11. What is prescriptive analytics?

Prescriptive Analytics use optimization and simulation algorithms to advice on possible outcomes and answer. Prescriptive analytics attempt to quantify the effect of future decisions in order to advise on possible outcomes before the decisions are actually made. At their best, prescriptive analytics predicts not only what will happen, but also why it will happen providing recommendations regarding actions that will take advantage of the predictions.

12. What is the purpose of data cleaning?

Data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database. Data cleansing differs from data validation in that validation almost invariably means data is rejected from the system at entry and is performed at entry time, rather than on batches of data.

13. What is logistic regression?

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. Logistic regression measures the relationship between the categorical dependent variable and one or more independent variables by estimating probabilities using a logistic function, which is the cumulative logistic distribution.

14. What is normal distribution?

The normal distributions are a very important class of statistical distributions. All normal distributions are symmetric and have bell-shaped density curves with a single peak. It is data distribution pattern occurring in many natural phenomena, such as height, blood pressure, lengths of objects produced by machines, etc.

15. What is the difference between univariate and bivariate analysis?

Univariate analysis is perhaps the simplest form of statistical analysis. Like other forms of statistics, it can be inferential or descriptive. The key fact is that only one variable is involved. Bivariate analysis is a simple (two variable) special case of multivariate analysis.


16. What is power analysis?

Power analysis is an important aspect of experimental design. It allows us to determine the sample size required to detect an effect of a given size with a given degree of confidence.

17. What is interpolation and extrapolation?

Interpolation is guessing data points that fall within the range of the data you have, i.e. between your existing data points. Extrapolation is guessing data points from beyond the range of your data.

18. What is Tidy Data?

A dataset is said to be tidy if it satisfies the following condition:

  • observations are in rows
  • variables are in columns
  • contained in a single dataset.

Tidy data makes it easy to carry out data analysis.

19. What is data wrangling?

Data wrangling is loosely the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.


20. What are categorical variable?

A categorical variable (sometimes called a nominal variable) is one that has two or more categories, but there is no intrinsic ordering to the categories. For example, gender is a categorical variable having two categories (male and female) and there is no intrinsic ordering to the categories. A purely categorical variable is one that simply allows you to assign categories but you cannot clearly order the variables.

21. What is null hypothesis?

Null hypothesis starts by assuming that your hypothesis is false, but you want to look out for evidence that proves you that ‘au contraire’, you are wrong, and thus your initial ‘null hypothesis’, is wrong. This would be an equivalent of the negative of the negative of your hypothesis, which would be a positive. The hypothesis is when you have an idea and you think it is right, and you go after looking for evidence to justify your belief.

22. What is the difference between supervised and unsupervised learning?

Supervised learning is when the data you feed your algorithm is “tagged” to help your logic make decisions.
Example: Bayes spam filtering, where you have to flag an item as spam to refine the results.

Unsupervised learning are types of algorithms that try to find correlations without any external inputs other than the raw data.
Example: data mining clustering algorithms.

23. What is exploratory data analysis?

In statistics, exploratory data analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. A statistical model can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling or hypothesis testing task.

24. Enlist some of the data science applications.

Using data science, companies have become intelligent enough to push & sell products as per customer’s purchasing power & interest. Applications and uses are as follows:

  • Internet search
  • Recommender systems
  • Image recognition
  • Speech recognition
  • Price Comparison Websites

25. What is forensic analytics?

Forensic analytics is the procurement and analysis of electronic data to reconstruct, detect, or otherwise support a claim of financial fraud. The main steps in forensic analytics are (a) data collection, (b) data preparation, (c) data analysis, and (d) reporting.

Advanced Data Science Interview Questions with Answers

26. Explain what regularization is and why it is useful?

Regularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent overfitting. This is most often done by adding a constant multiple to an existing weight vector. This constant is often either the L1 (Lasso) or L2 (ridge), but can in actuality can be any norm. The model predictions should then minimize the mean of the loss function calculated on the regularized training set.

27. What is bias-variance trade off?

Bias–variance tradeoff (or dilemma) is the problem of simultaneously minimizing two sources of error that prevent supervised learning algorithms from generalizing beyond their training set.
The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs.
The variance is error from sensitivity to small fluctuations in the training set. High variance can cause overfitting: modeling the random noise in the training data, rather than the intended outputs.

28. What is root cause analysis?

Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable event from recurring; whereas a causal factor is one that affects an event’s outcome, but is not a root cause.

29. What is statistical power?

Statistical power or sensitivity of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true.
To put in another way, Statistical power is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error.

30. What are different resampling methods?

Resampling refers to methods for doing one of these

  • Estimating the precision of sample statistics by using subsets of available data or drawing randomly with replacement from a set of data points (bootstrapping)
  • Exchanging labels on data points when performing significance tests
  • Validating models by using random subsets

31. What is selection bias?

Selection bias is the selection of individuals, groups or data for analysis in such a way that proper randomization is not achieved, thereby ensuring that the sample obtained is not representative of the population intended to be analyzed. It is sometimes referred to as the selection effect.

32. What do you mean by dimensionality reduction?

In machine learning and statistics, dimensionality reduction or dimension reduction is the process of reducing the number of random variables under consideration, via obtaining a set “uncorrelated” principle variables.

33. Describe in brief the data Science Process flowchart?

Here’s the step-by-step flowchart for the data science process:

  1. Data is collected from sensors in the environment.
  2. Data is “cleaned” or it can process to produce a data set (typically a data table) usable for processing.
  3. Exploratory data analysis and statistical modeling may be performed.
  4. A data product is a program such as retailers use to inform new purchases based on purchase history. It may also create data and feed it back into the environment.

34. Compare and contrast R and SAS?

SAS is commercial software whereas R is free source and can be downloaded by anyone.
SAS is easy to learn and provide easy option for people who already know SQL whereas R is a low level programming language and hence simple procedures takes longer codes.

35. What is the difference between uniform and skewed distribution?

In probability theory and statistics, the continuous uniform distribution or rectangular distribution is a family of symmetric probability distributions such that for each member of the family, all intervals of the same length on the distribution’s support are equally probable. In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or even undefined. The qualitative interpretation of the skew is complicated.

36. What is cross-validation?

Cross validation is a model evaluation method that is better than residuals. The problem with residual evaluations is that they do not give an indication of how well the learner will do when it is asked to make new predictions for data it has not already seen.

37. How is data visualization important in data science?

There are many reasons to use data visualization, but these are a few of the top benefits:

  • Improve Understanding: The data purpose is clear and you can dig into the details directly on the screen to quickly obtain awareness about any problems and discover a new perception.
  • Maintain Focus: Data visualizations can be strikingly appealing. The mixture of data, visuals, and interactivity creates a method that can inform, engage, and influence viewers. Not many people enjoy staring at enormous data sets, or rows and rows of numbers, and fewer can make sense of them without some type of data visualization tool, even if it’s only a pivot table.
  • Tackle the growing volume of data: With huge amounts of incoming data, businesses need to transform the data into simple visuals for effective evaluation. Interaction with large data sets accelerates the analysis process.
  • Improve Decision-Making: By generating and analyzing data visualizations, you can quickly identify dependencies, trends, and abnormalities that otherwise may go unnoticed.

38. What do you mean by Multicollinearity?

Multicollinearity is problem that you can run into when you’re fitting a regression model, or other linear model. It refers to predictors that are correlated with other predictors in the model. Unfortunately, the effects of multicollinearity can feel murky and intangible, which makes it unclear whether it’s important to fix.

39. What is deep learning?

Deep learning refers to artificial neural networks that are composed of many layers. It’s a growing trend in ML due to some favorable results in applications where the target function is very complex and the datasets are large.

40. Why do we need scalable vector machine?

SVM is a supervised machine learning algorithm which can be used for classification or regression problems. It uses a technique called the kernel trick to transform your data and then based on these transformations it finds an optimal boundary between the possible outputs. Simply put, it does some extremely complex data transformations, then figures out how to separate your data based on the labels or outputs you’ve defined. SVM it capable of doing both classification and regression.

41. What is the purpose of PCA method?

Principal component analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset. It’s often used to make data easy to explore and visualize. PCA is mostly used as a tool in exploratory data analysis and for making predictive models. PCA can be done by eigenvalue decomposition of a data covariance (or correlation) matrix or singular value decomposition of a data matrix, usually after mean centering the data matrix for each attribute.

42. What is the purpose of latent semantic indexing?

Latent semantic indexing (LSI) is an indexing and retrieval method that uses a mathematical technique called singular value decomposition (SVD) to identify patterns in the relationships between the terms and concepts contained in an unstructured collection of text.

43. Why central limit theorem is important?

The central limit theorem concerns the sampling distribution of the sample means. We may ask about the overall shape of the sampling distribution. The central limit theorem says that this sampling distribution is approximately normal – commonly known as a bell curve. The Central Limit Theorem is of fundamental importance for inferential statistics.

44. How can you assess a good logistic model?

There are various methods to assess the results of a logistic regression analysis:

  • Using Classification Matrix to look at the true negatives and false positives
  • Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening
  • Lift helps assess the logistic model by comparing it with random selection.

45. What is the purpose of gradient descent?

Gradient descent is a first-order optimization algorithm. To find a local minimum of a function using gradient descent, one takes steps proportional to the negative of the gradient (or of the approximate gradient) of the function at the current point.

46. What is the goal of A/B Testing?

It is a statistical hypothesis testing for randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. An example for this could be identifying the click through rate for a banner ad.

47. What is the difference between Cluster and Systematic Sampling?

Cluster sampling is a sampling technique used when “natural” but relatively heterogeneous groupings are evident in a statistical population. It is often used in marketing research. In this technique, the total population is divided into these groups (or clusters) and a simple random sample of the groups is selected Systematic sampling is a type of probability sampling method in which sample members from a larger population are selected according to a random starting point and a fixed, periodic interval.

48. What is the difference between ICA and PCA?

In PCA the basis you want to find is the one that best explains the variability of your data. The first vector of the PCA basis is the one that best explains the variability of your data (the principal direction) the second vector is the 2nd best explanation and must be orthogonal to the first one, etc.
In ICA the basis you want to find is the one in which each vector is an independent component of your data, you can think of your data as a mix of signals and then the ICA basis will have a vector for each independent signal.

49. What is the difference between Validation Set and Test set?

Validation set: It is a set of examples used to tune the parameters [i.e., architecture, not weights] of a classifier, for example to choose the number of hidden units in a neural network
Test set: A set of examples used only to assess the performance [generalization] of a fully specified classifier

50. What are the factors to find the most accurate recommendation algorithms?

Factors to find the most accurate recommendation algorithms are as follows:

  1. Diversity
  2. Recommender Persistence
  3. Privacy
  4. User Demographics
  5. Robustness
  6. Serendipity
  7. Trust
  8. Labeling

Useful Resources:

If you find any mistake above, kindly email to [email protected]

Subscribe to our Newsletters (Subject-wise). Participate in the Sanfoundry Certification contest to get free Certificate of Merit. Join our social networks below and stay updated with latest contests, videos, internships and jobs!

Youtube | Telegram | LinkedIn | Instagram | Facebook | Twitter | Pinterest
Manish Bhojasia - Founder & CTO at Sanfoundry
Manish Bhojasia, a technology veteran with 20+ years @ Cisco & Wipro, is Founder and CTO at Sanfoundry. He lives in Bangalore, and focuses on development of Linux Kernel, SAN Technologies, Advanced C, Data Structures & Alogrithms. Stay connected with him at LinkedIn.

Subscribe to his free Masterclasses at Youtube & discussions at Telegram SanfoundryClasses.