Data-Science Interview Questions
Master the most commonly asked interview questions with comprehensive, expert-crafted answers designed to help you succeed.
What is Data Science?
Data Science is a multidisciplinary field focused on extracting meaningful insights and knowledge from structured and unstructured data using various techniques from statistics, computer science, machine learning, and domain expertise.
It involves the complete data lifecycle including:
- Data Collection – Gathering raw data from various sources.
- Data Cleaning – Removing inconsistencies and preparing the data.
- Data Exploration and Analysis – Identifying patterns, trends, and anomalies.
- Modeling – Building predictive models using algorithms and machine learning.
- Interpretation and Decision Making – Drawing actionable insights to guide business or scientific decision-making.
Key Components of Data Science:
- Statistics & Probability
- Data Engineering
- Machine Learning
- Data Visualization
- Big Data Technologies
- Domain Knowledge
Example: A data scientist working in e-commerce might analyze customer behavior to recommend personalized products, improve sales, and reduce churn using predictive analytics.
Define the terms KPI, Lift, Model Fitting, Robustness, and DOE.
Below are definitions of key data science terms commonly used in analytics and model evaluation:
- KPI (Key Performance Indicator): A measurable value that demonstrates how effectively a company or individual is achieving key business objectives. KPIs help evaluate the success of an organization in reaching its targets.
- Lift: A metric that measures the effectiveness of a predictive model compared to random guessing. A lift value greater than 1 indicates that the model performs better than a random model.
- Model Fitting: The process of training a statistical or machine learning model to represent the relationship between input variables and the target output. A good fit minimizes error between predicted and actual outcomes.
- Robustness: The ability of a model or system to perform reliably under varying conditions or when exposed to noisy, missing, or unexpected data.
- DOE (Design of Experiments): A systematic method used to plan experiments in such a way that data collected can effectively be used to evaluate the effects of various factors on a response variable.
What is the difference between Data Analytics and Data Science?
While Data Analytics and Data Science are closely related fields, they serve different purposes and require different approaches. Here's a comparison to understand the distinction:
Data Science | Data Analytics |
---|---|
Focuses on uncovering patterns and insights using predictive modeling and machine learning. | Focuses on interpreting existing data to inform current decision-making processes. |
Deals with both structured and unstructured data to solve complex problems. | Primarily deals with structured data and business intelligence tasks. |
Uses advanced techniques like machine learning, statistical modeling, and algorithms. | Relies more on statistics, data aggregation, and visualization tools. |
Aims to predict future trends and drive innovation. | Aims to analyze historical data for current insights. |
Broad field covering data preparation, cleaning, modeling, and insight generation. | Subset of data science focusing on specific, goal-oriented analysis. |
What are some of the techniques used for sampling? What is the main advantage of sampling?
In data science, sampling is essential when working with large datasets, where analyzing the entire population is impractical. Sampling involves selecting a representative subset of the data to perform analysis efficiently while still gaining valuable insights.
Main Advantage of Sampling: Sampling allows analysts to work with manageable amounts of data, reducing computational time and cost while still enabling accurate and meaningful analysis.
Types of Sampling Techniques:
Probability Sampling | Non-Probability Sampling |
---|---|
Simple Random Sampling: Every member has an equal chance of being selected. | Convenience Sampling: Data is collected from sources that are easiest to access. |
Stratified Sampling: Population is divided into strata and sampled from each. | Quota Sampling: Samples are selected to meet specific quotas or criteria. |
Cluster Sampling: Entire population is divided into clusters and a few clusters are randomly chosen. | Snowball Sampling: Existing subjects recruit future subjects from among their acquaintances. |
What are Eigenvectors and Eigenvalues?
Eigenvectors are special vectors whose direction remains unchanged when a linear transformation is applied to them. These are often unit vectors (having magnitude 1) and are also called right vectors.
Eigenvalues are scalars (coefficients) associated with eigenvectors. When a matrix transforms an eigenvector, the vector is only scaled by the corresponding eigenvalue, not rotated.
Mathematical representation:
Where:
- A is a square matrix
- v is the eigenvector
- λ (lambda) is the eigenvalue
This decomposition of a matrix into its eigenvectors and eigenvalues is called Eigen Decomposition.
Application: Eigenvectors and eigenvalues are widely used in Machine Learning techniques like Principal Component Analysis (PCA) to reduce the dimensionality of data while preserving the variance, helping extract significant features and patterns.
What is Marginal Probability?
Marginal Probability refers to the probability of a single event occurring, independent of the outcomes of any other events. It is called 'marginal' because it is obtained by summing (or integrating) over the probabilities of joint events, effectively focusing only on one variable and ignoring others.
Example: Suppose you're studying the probability of different weather conditions. If you want the probability of it raining tomorrow, regardless of wind or temperature, that is the marginal probability of rain.
Mathematically: If P(A, B) is the joint probability of events A and B, then the marginal probability of A is:
It is an essential concept in probability theory and statistics, especially in understanding distributions and data behavior across individual variables.
What are the Probability Axioms?
The Probability Axioms are the foundational rules that define valid probability values and behaviors in probability theory. These axioms ensure that the probability values assigned to events are mathematically consistent and meaningful.
There are three main axioms:
Axiom | Description |
---|---|
1. Non-Negativity | The probability of any event A is always greater than or equal to 0. P(A) ≥ 0 |
2. Normalization | The probability of the sample space (something that is certain to happen) is 1. P(S) = 1 |
3. Additivity | If A and B are mutually exclusive events (they cannot happen together), then the probability of A or B occurring is the sum of their individual probabilities. P(A ∪ B) = P(A) + P(B) |
These axioms serve as the basis for all probability calculations and models in statistics and data science.
What is Conditional Probability?
Conditional Probability refers to the likelihood of an event occurring given that another event has already occurred. It is a fundamental concept in probability theory that helps in understanding the dependency between events.
Mathematical Definition:
The conditional probability of event A given that event B has occurred is denoted as P(A | B).
Formula:
Where:
- P(A | B): Probability of A given B
- P(A ∩ B): Joint probability of both A and B happening
- P(B): Probability of event B occurring
This formula is used when the occurrence of one event affects the probability of another. It's commonly applied in real-world situations like medical testing, risk assessment, and machine learning models.
What is Bayes’ Theorem and when do we use it in Data Science?
Bayes’ Theorem is a mathematical formula used to calculate the probability of a hypothesis based on prior knowledge or evidence. In simpler terms, it helps us update our beliefs or predictions when new data becomes available.
Bayes’ Theorem Formula:
Where:
- P(A | B) = Probability of event A given that B has occurred (posterior)
- P(B | A) = Probability of event B given that A is true (likelihood)
- P(A) = Probability of event A (prior)
- P(B) = Probability of event B (evidence)
Usage in Data Science:
- Spam filtering: Classifying emails as spam or not based on keywords.
- Medical diagnosis: Updating the probability of a disease given new symptoms.
- Predictive modeling: Estimating the likelihood of outcomes using prior data.
Bayes’ Theorem is foundational in Naïve Bayes classifiers and is widely used in probabilistic machine learning and decision-making models.
When is resampling done?
Resampling is a methodology used to sample data for improving accuracy and quantify the uncertainty of population parameters. It is done to ensure the model is good enough by training the model on different patterns of a dataset to ensure variations are handled. It is also done in the cases where models need to be validated using random subsets or when substituting labels on data points while performing tests.
What do you understand by Imbalanced Data?
Data is said to be highly imbalanced if it is distributed unequally across different categories. These datasets result in an error in model performance and result in inaccuracy.
What is Pandas, and why do we use it in data science?
In data science, Pandas is essential for working with large datasets, performing data wrangling and conducting exploratory data analysis (EDA). Its intuitive syntax and wide range of functions make it an invaluable tool for handling time-series data, missing values and more.
Explain the difference between deep copy and shallow copy in Python ? Provide examples
Shallow Copy: Useful when datasets are independent but share immutable parts to save memory.
Deep Copy: Necessary for creating entirely independent copies of datasets or models, especially during simulation or when manipulating nested structures without altering the original.
Example:
shallow = copy.copy(original)
deep = copy.deepcopy(original)
How do you handle missing data?
After observing that my data set has missing values, I'll figure out how it occurs. Are they represented as NaN, None, empty strings, weird characters like -999, a combination of two or more, or something else?
How to handle missing data
Once I make sense of what my missing data looks like, I dig into why these values are missing, and they usually fall into three categories:
-
Missing Completely At Random (MCAR): No pattern, just random gaps. These are usually safe to drop, especially if there aren't many.
Example: In a survey dataset, 10% of income entries are missing due to a technical glitch that affected a random subset of responses. There's no pattern based on age, education, employment status, or anything else. -
Missing At Random (MAR): This is when the missing data is related to other observed variables, but not to the income value itself.
Example: In the same dataset, 10% of income values are missing, mostly among respondents who are students. Here, missing data is related to the occupation variable, not the actual income value. Impute based on related features like occupation, education level, or age. Safe to drop or impute with mean/median since the missing data doesn't introduce bias. -
MNAR (Missing Not At Random): The reason it's missing is tied to the value itself.
Example: If high spenders choose not to share income, that's tougher to handle and sometimes better tracked with a missingness flag. The probability of missingness increases with the income amount. Imputation is risky here. I'll consider flagging missingness with a binary indicator (income_missing
) or using models that can account for MNAR, like EM algorithms or data augmentation techniques.
Approaches to Handle Missing Data:
-
Deletion (if safe):
- Listwise: Drop rows with missing values (only when missingness is random and small).
- Pairwise: Use available values for calculations, such as correlations.
- Drop columns: Remove low-value features with lots of missing data.
-
Simple imputation:
- Mean/Median/Mode: Use for numeric or categorical columns, depending on distribution.
- Arbitrary values: Fill with 0 or "Unknown" if it makes sense contextually.
- Forward/Backward fill: Best for time series to keep temporal consistency.
-
Advanced imputation:
- KNN imputer: Fills gaps by finding similar rows using distance metrics.
- Iterative imputer: Builds a model with other columns to estimate missing values.
- Interpolation: Good for numeric sequences, especially when data follows a trend.
-
Use missingness as a feature: If the missing value could carry a signal, I add a binary indicator column (e.g.,
was_missing = 1
). - Oversampling or undersampling: If missing data causes class imbalance, I use resampling to maintain a fair target distribution.
Common pitfall: Filling in values without understanding the pattern of missingness. For example, using mean imputation on MNAR data can introduce bias and weaken your model's predictive power.
What is the difference between the long format data and wide format data?
The difference between long format data and wide format data comes down to how your data is structured. A wide format has values that do not repeat in the columns, while a long format has values that do repeat in the columns.
In wide format, you spread data across columns. Each variable (e.g., Jan, Feb, March) gets its own column. You'll usually see this in reports or dashboards.
In long format, data is stacked in rows. One column stores the values, and another column tells you what those values represent. This format is cleaner for grouped summaries and time series analysis.
Use case: Wide format is useful for reporting and making data visualizations. Long format is preferred for time series, grouped summaries, and plotting tools like Seaborn or ggplot.
Common pitfall: Trying to perform group-level analysis on wide-format data without reshaping it first.
What do you understand by imbalanced data?
Imbalanced data is a situation where the classes, labels, or categories in a dataset are not equally represented. In such datasets, one category has significantly more or fewer samples than the others. This is a common issue in supervised machine learning and deep learning, where the non-uniform distribution of samples can lead to biased outcomes and reduced model reliability.
For example, in a healthcare model, if the dataset contains far fewer samples for a rare disease compared to healthy cases, the imbalance can cause the model to predict the majority class more often, missing critical detections for the minority class.
Example scenario: In an email spam detection model, if 95% of emails are non-spam and only 5% are spam, the dataset is imbalanced towards non-spam. Such a model might achieve high accuracy simply by predicting 'non-spam' most of the time but would fail to effectively detect spam emails.
Impact: This imbalance can lead to poor performance in real-world applications, especially when detecting rare but important events, making it essential to apply techniques like resampling, SMOTE, or adjusting class weights to handle imbalance effectively.
How do you combine data from multiple sources with inconsistent formats?
When combining data from different sources with inconsistent formats, the first step is to standardize formats such as dates, column names, booleans, and numbers so all datasets align in structure and meaning.
Steps:
- Align schemas: Match corresponding columns across datasets. Drop or keep extra columns depending on their relevance.
- Unify categories: Resolve inconsistencies like "Y" vs. "Yes" to avoid analysis issues.
- Tag the source: Add a column identifying the data source for tracking and debugging.
- Merge or stack: Use concatenation if datasets share the same structure, or merging/joining if matching keys like customer IDs.
- Final clean-up: Check for duplicates, mismatched types, and broken values after combining.
Common pitfall: Merging without validating data types or keys can lead to lost or duplicated rows.
How do you build a random forest model?
A random forest is an ensemble model composed of multiple decision trees, each trained on random subsets of data and features. The final prediction is typically made by aggregating (averaging for regression or voting for classification) the outputs of all trees.
Steps to build a random forest model:
- Randomly select k features from the total m features, where k << m.
- Among these k features, calculate the best split for the current node.
- Split the node into daughter nodes using the best split.
- Repeat steps 2 and 3 until leaf nodes are formed.
- Repeat steps 1 to 4 n times to create n decision trees.
You are given a data set consisting of variables with more than 30 percent missing values. How will you deal with them?
When dealing with variables that have more than 30% missing values, the handling strategy depends on dataset size and the importance of the features.
Approaches:
- Large datasets: Remove rows with missing values to retain only complete records, as sufficient data remains for training.
- Small datasets: Impute missing values with statistical measures like mean or average using tools such as Pandas
df.mean()
ordf.fillna(mean)
in Python.
How can outlier values be treated?
Outlier treatment depends on the nature of the data and the reason for the outliers. The following approaches can be used:
- Dropping garbage values: Remove outliers if they are clearly invalid, such as non-numeric values in numeric fields.
Example: Height of an adult recorded as "abc ft" is invalid and can be removed. - Removing extreme values: If the data range is tightly clustered and a point lies far outside this range, consider removing it.
Example: If most values are between 0 and 10 but one value is 100, that value can be removed. - Model-based adjustments: Use models less sensitive to outliers, such as random forests, or try non-linear models if outliers impact linear models.
- Normalization: Scale the data so that extreme points are brought closer to the main range, reducing their influence.
Why Choose Our Question Bank?
Get access to expertly crafted answers and comprehensive preparation materials
Complete Collection
Access all 20 carefully curated questions covering every aspect of Data-Science interviews
Expert Answers
Get detailed, professional answers crafted by industry experts with real-world experience
Instant Access
Start preparing immediately with instant access to all questions and answers after sign-up