Day 12: Exploratory Data Analysis (EDA) Techniques

Welcome to Day 12! Today, we’re diving into Exploratory Data Analysis (EDA), a critical step in any data science project. EDA helps us uncover patterns, relationships, and distributions within our data, providing insights to guide further analysis and feature selection.


What is EDA?

EDA involves summarizing and visualizing datasets to understand their structure and relationships. This step often includes:

  1. Descriptive Statistics: Understanding the central tendency, variability, and distribution.

  2. Data Visualization: Plotting data to reveal trends, patterns, and anomalies.


Key EDA Techniques

1. Descriptive Statistics

Descriptive statistics summarize the central tendency, dispersion, and shape of a dataset’s distribution.

Key Metrics:

  • Mean: Average value.

  • Median: Middle value.

  • Standard Deviation: Measures spread or variability.

  • Percentiles: Cut points in the data (e.g., 25th, 50th, 75th percentiles).

How to Apply:

  • Quickly generate statistics for numerical columns using:

      df.describe()
    
  • This outputs metrics like count, mean, standard deviation, min, max, and percentiles for each numerical feature.

Benefits:

  • Provides a quick overview of dataset distribution.

  • Helps identify potential issues like extreme values or missing data.


2. Univariate Analysis

Univariate analysis examines individual features to understand their distribution, central tendency, and variability.

Numerical Features:

  1. Histograms:

    • Show the frequency distribution of a numerical variable.
    df['salary'].hist()
  1. Box Plots:

    • Visualize the spread of data and identify outliers.
    df.boxplot(column='salary')

Categorical Features:

  1. Bar Charts:

    • Illustrate the frequency of categories.
    df['job_title'].value_counts().plot(kind='bar')
  1. Pie Charts:

    • Represent category proportions.
    df['job_title'].value_counts().plot(kind='pie', autopct='%1.1f%%')

Benefits:

  • Detects patterns like skewed distributions, uniformity, or frequent values.

  • Highlights outliers in numerical data and imbalances in categorical data.


3. Multivariate Analysis

Multivariate analysis examines relationships between two or more features.

Scatter Plots:

  • Explore correlations between numerical features.

      import seaborn as sns
      sns.scatterplot(x='experience', y='salary', data=df)
    
    • Reveals trends (e.g., salary increases with experience).

Heatmaps:

  • Visualize correlation across all numerical variables.

      sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
    
    • Highlights strong positive or negative relationships between features.

Pair Plots:

  • Plot pairwise relationships between numerical features.

      sns.pairplot(df)
    
    • Useful for spotting patterns across multiple variable combinations.

Benefits:

  • Identifies correlated features that might be redundant.

  • Detects nonlinear relationships or clusters in data.


4. Identifying Outliers

Outliers are data points that deviate significantly from other observations. They can distort analysis if not handled properly.

Techniques:

  1. Box Plots:

    • Visualize potential outliers directly.
    sns.boxplot(x=df['salary'])
  1. IQR Method:

    • Calculate and filter out values beyond 1.5 times the interquartile range (IQR).
    Q1 = df['salary'].quantile(0.25)
    Q3 = df['salary'].quantile(0.75)
    IQR = Q3 - Q1
    outliers = df[(df['salary'] < (Q1 - 1.5 * IQR)) | (df['salary'] > (Q3 + 1.5 * IQR))]
  1. Z-Scores:

    • Detect outliers using standard deviations.
    from scipy.stats import zscore
    z_scores = zscore(df['salary'])
    outliers = df[(z_scores < -3) | (z_scores > 3)]

Benefits:

  • Helps address potential sources of bias or error.

  • Provides insights into extreme cases or anomalies.


5. Distribution Analysis

Understanding the distribution of numerical features helps decide the right preprocessing techniques, such as transformations or scaling.

Techniques:

  1. Histogram with Kernel Density Estimation (KDE):

    • A KDE curve overlays the histogram to show the probability density.
    sns.histplot(df['experience'], kde=True)
  1. Q-Q Plots:

    • Compare the distribution of a variable to a normal distribution.
    import scipy.stats as stats
    import matplotlib.pyplot as plt
    stats.probplot(df['experience'], dist="norm", plot=plt)
  1. Skewness and Kurtosis:

    • Quantify the asymmetry (skewness) and peakedness (kurtosis) of a distribution.
    skewness = df['experience'].skew()
    kurtosis = df['experience'].kurt()

Benefits:

  • Identifies if numerical features follow normal distribution assumptions.

  • Highlights skewed or heavy-tailed distributions that may require transformations (e.g., log transformation).


Summary of Benefits

  • Descriptive Statistics: Summarize data at a glance.

  • Univariate Analysis: Understand the behavior of individual features.

  • Multivariate Analysis: Explore relationships between features.

  • Outlier Detection: Handle extreme values systematically.

  • Distribution Analysis: Validate assumptions for statistical and machine learning models.

By systematically applying these EDA techniques, you can uncover valuable insights and prepare your data effectively for further analysis and modeling.


Reflection on Day 12

EDA is like detective work—it reveals hidden insights that might otherwise go unnoticed. Through visualizations and summaries, we’re now better equipped to refine our hypotheses and prepare our data for modeling.


What’s Next?

Tomorrow, we’ll explore advanced EDA techniques such as pair plots and deeper correlation analysis to gain even more insights into our data.

Thank you for joining Day 12! Let’s keep uncovering the stories hidden in our data.