Data Analysis Q&A

  1. What is data analysis?
    The process of inspecting, cleaning, and modeling data to discover useful information and support decision-making.
  2. What are the types of data?
    Quantitative (numerical) and qualitative (categorical) data.
  3. What is the difference between structured and unstructured data?
    Structured data is organized in a fixed format (like databases), while unstructured data lacks a predefined structure (like text or images).
  4. What is data cleaning?
    The process of correcting or removing inaccurate records from a dataset.
  5. What is exploratory data analysis (EDA)?
    Analysing data sets to summarize their main characteristics, often using visual methods.
  6. What are the common tools used in data analysis?
    Excel, R, Python (Pandas, NumPy), SQL, Tableau, and Power BI.
  7. What is a data visualization?
    The graphical representation of information and data to make analysis easier to understand.
  8. What is the purpose of a histogram?
    To show the distribution of a dataset by displaying the frequency of data points in specified ranges.
  9. What is a box plot?
    A graphical representation that shows the distribution of a dataset based on five summary statistics (minimum, first quartile, median, third quartile, maximum).
  10. What is correlation?
    A statistical measure that expresses the extent to which two variables are linearly related.
  11. What is the mean?
    The average of a set of values, calculated by dividing the sum of all values by the number of values.
  12. What is the median?
    The middle value in a dataset when it is ordered from least to greatest.
  13. What is the mode?
    The value that appears most frequently in a dataset.
  14. What is standard deviation?
    A measure of the amount of variation or dispersion in a set of values.
  15. What is a p-value?
    The probability of obtaining test results at least as extreme as the observed results, assuming that the null hypothesis is true.
  16. What is hypothesis testing?
    A statistical method that uses sample data to evaluate a hypothesis about a population parameter.
  17. What is a confidence interval?
    A range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter.
  18. What is regression analysis?
    A statistical method for modeling the relationship between a dependent variable and one or more independent variables.
  19. What is linear regression?
    A method for predicting a quantitative response based on the linear relationship between the dependent and independent variables.
  20. What is logistic regression?
    A regression analysis used for prediction of outcome variables that are categorical.
  21. What is data wrangling?
    The process of transforming and mapping raw data into a more useful format for analysis.
  22. What is normalization?
    A technique to scale data to fit within a specified range, often [0,1].
  23. What is standardization?
    The process of rescaling data to have a mean of 0 and a standard deviation of 1.
  24. What is data aggregation?
    The process of summarizing or consolidating data to provide a more concise view.
  25. What is a pivot table?
    A data processing tool used to summarize, sort, reorganize, group, count, and total data stored in a table.
  26. What is data mining?
    The practice of examining large datasets to uncover patterns and extract valuable information.
  27. What are outliers?
    Data points that differ significantly from other observations in a dataset.
  28. What is feature engineering?
    The process of using domain knowledge to create features that make machine learning algorithms work better.
  29. What is supervised learning?
    A type of machine learning where the model is trained on labeled data.
  30. What is unsupervised learning?
    A type of machine learning where the model is trained on unlabeled data to identify patterns.
  31. What is overfitting?
    A modeling error that occurs when a model is too complex and captures noise in the data rather than the underlying pattern.
  32. What is underfitting?
    A modelling error that occurs when a model is too simple to capture the underlying trend of the data.
  33. What are decision trees?
    A model that makes decisions based on answering a series of questions.
  34. What is cross-validation?
    A technique for assessing how the results of a statistical analysis will generalize to an independent dataset.
  35. What is a confusion matrix?
    A table used to evaluate the performance of a classification model by comparing predicted and actual values.
  36. What are precision and recall?
    Precision is the ratio of true positives to the total predicted positives, while recall is the ratio of true positives to the actual positives.
  37. What is clustering?
    A machine learning technique used to group similar data points together.
  38. What is a neural network?
    A computational model inspired by the way biological neural networks in the human brain process information.
  39. What libraries are commonly used in Python for data analysis?
    Pandas, NumPy, Matplotlib, Seaborn, and SciPy.
  40. What is R?
    A programming language and software environment for statistical computing and graphics.
  41. What is Tableau?
    A data visualization tool that allows users to create interactive and shareable dashboards.
  42. What is Power BI?
    A business analytics tool by Microsoft that provides interactive visualizations and business intelligence capabilities.
  43. What is Hadoop?
    An open-source framework for processing and storing large datasets across clusters of computers.
  44. What is ETL?
    Extract, Transform, Load; a process used to integrate data from multiple sources.
  45. What is big data?
    Extremely large datasets that may be analyzed computationally to reveal patterns, trends, and associations.
  46. What is cloud computing?
    The delivery of computing services over the internet, including storage, databases, and software.
  47. What is the importance of data analysis in business?
    It helps organizations make informed decisions, improve operations, and increase profitability.
  48. What is data storytelling?
    The practice of building a narrative around data to make it more understandable and engaging.
  49. How do you handle missing data?
    By using techniques such as imputation, deletion, or interpolation.
  50. What is A/B testing?
    A randomized experiment comparing two versions of a variable to determine which one performs better.
  51. What is a KPI?
    Key Performance Indicator; a measurable value that demonstrates how effectively an organization is achieving its key business objectives.
  52. What is data governance?
    A framework for managing data availability, usability, integrity, and security.
  53. What is the difference between descriptive and inferential statistics?
    Descriptive statistics summarize data, while inferential statistics make predictions or inferences about a population based on a sample.
  54. What is a data analyst’s role?
    To collect, process, and analyze data to help organizations make informed decisions.
  55. What is the difference between a data analyst and a data scientist?
    Data analysts focus on interpreting existing data, while data scientists build models and algorithms to predict future outcomes.
  56. What is a data pipeline?
    A set of tools and processes that move data from one system to another for analysis.
  57. What is data scalability?
    The ability to handle growth in data volume without sacrificing performance.

 

 

Scroll to Top