Data Analysis Q&A
- Home
- / Python Handay
- / Data Analysis Q&A
- What is data analysis?
The process of inspecting, cleaning, and modeling data to discover useful information and support decision-making. - What are the types of data?
Quantitative (numerical) and qualitative (categorical) data. - What is the difference between structured and unstructured data?
Structured data is organized in a fixed format (like databases), while unstructured data lacks a predefined structure (like text or images). - What is data cleaning?
The process of correcting or removing inaccurate records from a dataset. - What is exploratory data analysis (EDA)?
Analysing data sets to summarize their main characteristics, often using visual methods. - What are the common tools used in data analysis?
Excel, R, Python (Pandas, NumPy), SQL, Tableau, and Power BI. - What is a data visualization?
The graphical representation of information and data to make analysis easier to understand. - What is the purpose of a histogram?
To show the distribution of a dataset by displaying the frequency of data points in specified ranges. - What is a box plot?
A graphical representation that shows the distribution of a dataset based on five summary statistics (minimum, first quartile, median, third quartile, maximum). - What is correlation?
A statistical measure that expresses the extent to which two variables are linearly related. - What is the mean?
The average of a set of values, calculated by dividing the sum of all values by the number of values. - What is the median?
The middle value in a dataset when it is ordered from least to greatest. - What is the mode?
The value that appears most frequently in a dataset. - What is standard deviation?
A measure of the amount of variation or dispersion in a set of values. - What is a p-value?
The probability of obtaining test results at least as extreme as the observed results, assuming that the null hypothesis is true. - What is hypothesis testing?
A statistical method that uses sample data to evaluate a hypothesis about a population parameter. - What is a confidence interval?
A range of values, derived from sample statistics, that is likely to contain the value of an unknown population parameter. - What is regression analysis?
A statistical method for modeling the relationship between a dependent variable and one or more independent variables. - What is linear regression?
A method for predicting a quantitative response based on the linear relationship between the dependent and independent variables. - What is logistic regression?
A regression analysis used for prediction of outcome variables that are categorical. - What is data wrangling?
The process of transforming and mapping raw data into a more useful format for analysis. - What is normalization?
A technique to scale data to fit within a specified range, often [0,1]. - What is standardization?
The process of rescaling data to have a mean of 0 and a standard deviation of 1. - What is data aggregation?
The process of summarizing or consolidating data to provide a more concise view. - What is a pivot table?
A data processing tool used to summarize, sort, reorganize, group, count, and total data stored in a table. - What is data mining?
The practice of examining large datasets to uncover patterns and extract valuable information. - What are outliers?
Data points that differ significantly from other observations in a dataset. - What is feature engineering?
The process of using domain knowledge to create features that make machine learning algorithms work better. - What is supervised learning?
A type of machine learning where the model is trained on labeled data. - What is unsupervised learning?
A type of machine learning where the model is trained on unlabeled data to identify patterns. - What is overfitting?
A modeling error that occurs when a model is too complex and captures noise in the data rather than the underlying pattern. - What is underfitting?
A modelling error that occurs when a model is too simple to capture the underlying trend of the data. - What are decision trees?
A model that makes decisions based on answering a series of questions. - What is cross-validation?
A technique for assessing how the results of a statistical analysis will generalize to an independent dataset. - What is a confusion matrix?
A table used to evaluate the performance of a classification model by comparing predicted and actual values. - What are precision and recall?
Precision is the ratio of true positives to the total predicted positives, while recall is the ratio of true positives to the actual positives. - What is clustering?
A machine learning technique used to group similar data points together. - What is a neural network?
A computational model inspired by the way biological neural networks in the human brain process information. - What libraries are commonly used in Python for data analysis?
Pandas, NumPy, Matplotlib, Seaborn, and SciPy. - What is R?
A programming language and software environment for statistical computing and graphics. - What is Tableau?
A data visualization tool that allows users to create interactive and shareable dashboards. - What is Power BI?
A business analytics tool by Microsoft that provides interactive visualizations and business intelligence capabilities. - What is Hadoop?
An open-source framework for processing and storing large datasets across clusters of computers. - What is ETL?
Extract, Transform, Load; a process used to integrate data from multiple sources. - What is big data?
Extremely large datasets that may be analyzed computationally to reveal patterns, trends, and associations. - What is cloud computing?
The delivery of computing services over the internet, including storage, databases, and software. - What is the importance of data analysis in business?
It helps organizations make informed decisions, improve operations, and increase profitability. - What is data storytelling?
The practice of building a narrative around data to make it more understandable and engaging. - How do you handle missing data?
By using techniques such as imputation, deletion, or interpolation. - What is A/B testing?
A randomized experiment comparing two versions of a variable to determine which one performs better. - What is a KPI?
Key Performance Indicator; a measurable value that demonstrates how effectively an organization is achieving its key business objectives. - What is data governance?
A framework for managing data availability, usability, integrity, and security. - What is the difference between descriptive and inferential statistics?
Descriptive statistics summarize data, while inferential statistics make predictions or inferences about a population based on a sample. - What is a data analyst’s role?
To collect, process, and analyze data to help organizations make informed decisions. - What is the difference between a data analyst and a data scientist?
Data analysts focus on interpreting existing data, while data scientists build models and algorithms to predict future outcomes. - What is a data pipeline?
A set of tools and processes that move data from one system to another for analysis. - What is data scalability?
The ability to handle growth in data volume without sacrificing performance.