Data Analytics - bluemixtechnologies.in

Blog

Data Analytics Interview Questions and Answers | Common & Technical-Related asked in Top IT MNCs (like TCS, Infosys, Accenture, Zoho, Wipro, IBM, and Deloitte)

Basic Level (1–15)

1. What is Data Analytics?

Data Analytics is the process of examining datasets to conclude the information they
contain, often with the help of specialised tools and software.

2. What are the main types of Data Analytics?

Descriptive: What happened
Diagnostic: Why it happened
Predictive: What will happen
Prescriptive: What should be done

3. What are the key steps in a data analysis project?

Define objective
Collect data
Clean and preprocess data
Analyse data
Visualise results
Conclude and make recommendations

4. What is the difference between structured and unstructured data?

Structured: Organised in rows/columns (e.g., SQL tables)
Unstructured: Raw data like images, emails, videos, and social media text

5. What is data cleaning, and why is it important?

It’s the process of correcting or removing inaccurate, incomplete, or irrelevant data to
ensure quality analysis.

6. What is the role of SQL in data analytics?

SQL is used to query, filter, aggregate, and manage structured data stored in databases.

7. Explain the difference between INNER JOIN and LEFT JOIN in SQL.

INNER JOIN: Returns only matching records from both tables
LEFT JOIN: Returns all records from the left table and matching ones from the right

8. How do you find duplicate records in SQL?

SELECT column_name, COUNT(*)
FROM table_name
GROUP BY column_name
HAVING COUNT(*) > 1;

9. What libraries in Python are used for data analysis?

Pandas – Data manipulation
NumPy – Numerical operations
Matplotlib / Seaborn – Visualisation
Scikit-learn – Machine learning

10. What is a DataFrame in Pandas?

A two-dimensional, labelled data structure similar to an Excel sheet used to handle and
analyse data in Python.

11. How do you handle missing data in Pandas?

Indexes are performance-boosting structures that allow faster retrieval of rows from a table.

dropna() → Remove missing rows
fillna() → Fill missing values with mean/median/mode

12. What is the difference between variance and standard deviation?

Variance measures how far data points spread from the mean; standard deviation is the
square root of variance.

13. What is correlation in data analytics?

Correlation measures the relationship between two variables. Value ranges from -1 to
+1.

14. What is normalisation?

It’s the process of scaling numeric data into a specific range (e.g., 0–1) for model
efficiency.

15. What is outlier detection, and how do you handle it?

Outliers are extreme data points. They can be handled using methods like IQR
(Interquartile Range) or z-score analysis.

Intermediate Level (16–35)

16. What is Exploratory Data Analysis (EDA)?

EDA involves summarising main data characteristics using visualisation and statistics to
identify patterns and anomalies.

17. What are some common visualisation tools used in analytics?

Tableau, Power BI, Matplotlib, Seaborn, Looker.

18. What is a dashboard in Power BI?

A single-page visualisation showing KPIs and business metrics from datasets and
reports.

19. Explain the difference between OLAP and OLTP.

OLAP: Analytical, used for reporting (read-heavy)
OLTP: Transactional, used for daily operations (write-heavy)

20. What is the difference between supervised and unsupervised learning?

Supervised: Trained on labelled data (e.g., regression, classification)
Unsupervised: Finds hidden patterns (e.g., clustering)

21. What is regression analysis?

A statistical method to predict a dependent variable based on one or more independent
variables.

22. What are categorical and numerical variables?

Categorical: Qualitative (e.g., gender, city)
Numerical: Quantitative (e.g., salary, age)

23. What is feature engineering?

The process of creating, transforming, or selecting features to improve model
performance.

24. What is multicollinearity?

When independent variables in a model are highly correlated, it reduces model accuracy.

25. What is a confusion matrix?

A table that shows model prediction results — True Positives, True Negatives, False
Positives, False Negatives.

26. Define precision and recall.

Precision: Correct positive predictions / Total positive predictions
Recall: Correct positive predictions / Actual positives

27. What is cross-validation?

A technique to test model performance on unseen data by splitting the data into training
and testing subsets multiple times.

28. Explain the p-value in hypothesis testing.

It indicates the probability that observed results are due to chance. A p-value < 0.05
typically indicates significance.

29. What is the Central Limit Theorem (CLT)?

The CLT states that the sampling distribution of the mean will be approximately normal,
regardless of the population’s distribution, given a large sample size.

30. What are time series data and their applications?

Data recorded over time intervals (e.g., sales, temperature, stock prices). Used for
forecasting trends.

31. What is the difference between correlation and causation?

Correlation shows a relationship between variables; causation shows that one variable
causes the other.

32. What is data wrangling?

Transforming raw data into a clean and usable format for analysis.

33. What is dimensionality reduction?

Reducing the number of features while preserving key information (e.g., using PCA).

34. What is ETL in data analytics?

Extract, Transform, Load – The process of collecting data, cleaning it, and storing it
into a data warehouse.

35. What is a KPI in analytics?

A Key Performance Indicator used to measure the success of a process (e.g., sales
growth, churn rate).

Advanced & Scenario-Based (36–50)

36. How do you deal with imbalanced datasets?

Techniques like SMOTE (Synthetic Minority Over-sampling Technique),
undersampling, or class weighting.

37. Explain the difference between batch and real-time analytics.

Batch: Processed periodically (e.g., daily sales reports)
Real-time: Processed instantly (e.g., fraud detection)

38. What is a data warehouse?

A central repository for storing large volumes of structured data for analysis and
reporting.

39. What is big data analytics?

The process of analysing large, complex data sets (using Hadoop, Spark) to uncover
hidden patterns.

40. What are some common big data tools?

Hadoop, Apache Spark, Hive, Kafka, Flink.

41. What is A/B Testing?

A method to compare two versions (A and B) of a process or product to determine
which performs better.

42. What is logistic regression used for?

Predicting categorical outcomes (e.g., Yes/No, Pass/Fail).

43. What is clustering?

An unsupervised learning method to group similar data points (e.g., K-Means
clustering).

44. What is a data pipeline?

A series of processes that extract, transform, and load (ETL) data from source to
destination systems.

45. Explain anomaly detection.

Identifying rare or unusual data points that differ from the majority.

46. How do you measure model accuracy?

Metrics such as accuracy, F1-score, Precision, Recall, and ROC-AUC are used.

47. What is feature scaling, and why is it important?

Scaling ensures features contribute equally to model training (especially for distance-
based models).

48. What are some challenges faced by data analysts?

Poor data quality
Incomplete data sources
Handling large datasets
Misinterpretation of insights

49. How do you present analytical results to non-technical stakeholders?

Use data visualisation, dashboards, and storytelling to make insights
understandable.

50. What’s the difference between a Data Analyst, Data Engineer, and Data Scientist?

Data Analyst: Interprets and visualises data
Data Engineer: Builds pipelines and manages data flow
Data Scientist: Builds predictive models using machine learning