What are missing values in data analysis?

Missing values are empty data points, like blanks in an attendance sheet. They occur when data wasn't recorded, doesn't exist, or equipment failed.

What are outliers in EDA?

Outliers are data points that don't fit with the rest, like measuring 50 feet tall in a classroom. They may be errors or rare real values.

What is target leakage in machine learning?

Target leakage is when your data contains the answer you're predicting. It's like having test answers on your pencil - your model cheats.

EDA Patterns | Data Science Detective Guide

🔍 EDA Patterns: Becoming a Data Detective

Imagine you’re a detective with a magnifying glass. Before solving any mystery, you must examine every clue carefully. That’s exactly what Exploratory Data Analysis (EDA) is—examining your data before making big decisions!

🎯 The Big Picture

Think of your data like a brand new puzzle box you just received. Before you start solving it, you want to:

Check if any pieces are missing
Find pieces that don’t belong
See if someone cheated by putting the answer on the box
Notice if pieces are lopsided or uneven
Spot patterns like stripes or waves

That’s EDA! Let’s explore each detective skill.

1️⃣ Missing Value Analysis

What Are Missing Values?

Imagine you have a attendance sheet for your class:

Name	Monday	Tuesday	Wednesday
Emma	✓	✓	✓
Jack	✓	?	✓
Lily	✓	✓	?

Those question marks are missing values! We don’t know if Jack came on Tuesday or if Lily came on Wednesday.

Why Do Values Go Missing?

Someone forgot to write it down (like forgetting homework!)
The information doesn’t exist (asking a fish its favorite shoe)
Equipment broke (thermometer stopped working)

How to Find Missing Values

# Count missing values
data.isnull().sum()

# See percentage missing
(data.isnull().sum() / len(data)) * 100

What Can We Do About Them?

graph TD
    A["Found Missing Values!"] --> B{How many?}
    B -->|Very few| C["Fill with average/middle value"]
    B -->|Some| D["Use smart guessing"]
    B -->|Too many| E["Remove that column"]
    C --> F["Clean Data Ready!"]
    D --> F
    E --> F

Simple Example: If 3 out of 100 students forgot to write their age, we could:

Use the average age of the class
Use the most common age
Ask them again!

2️⃣ Outlier Detection in EDA

What Are Outliers?

Imagine measuring everyone’s height in your class:

Most kids: 4-5 feet tall
One measurement: 50 feet tall 🦕

That 50 feet is an outlier—it doesn’t fit with the rest! Either someone made a mistake, or there’s a dinosaur in your class.

The Neighborhood Rule

Think of your data living in a neighborhood:

Most houses are similar sizes
One house is a giant castle 🏰
That castle is the outlier!

How to Spot Outliers

Method 1: The Box Plot

graph LR
    A["Small values"] --- B["Most data lives here"]
    B --- C["Large values"]
    D["🔴 Outlier!"] -.-> A
    E["🔴 Outlier!"] -.-> C

Method 2: The 3-Sigma Rule

Calculate the average (mean)
Calculate the spread (standard deviation)
Anything 3 spreads away = outlier!

# Find outliers using IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data < Q1 - 1.5*IQR) |
               (data > Q3 + 1.5*IQR)]

What To Do With Outliers?

Situation	Action
Typo/Error	Fix it!
Real but rare	Keep it, note it
Breaks your model	Consider removing

3️⃣ Target Leakage Detection

The Cheating Problem 🎮

Imagine you’re taking a test, but someone wrote the answers on your pencil. That’s cheating! Your score would be perfect, but you didn’t really learn anything.

Target leakage is when your data accidentally contains the answer you’re trying to predict!

Real Example

You want to predict: “Will this student pass the exam?”

Your data includes:

Hours studied ✅ (Good! Helps predict)
Attendance ✅ (Good!)
Final grade ❌ (CHEATING! This IS the answer!)

How Leakage Sneaks In

graph TD
    A["Want to Predict: Will customer buy?"] --> B{Check your features}
    B --> C["✅ Age - OK!"]
    B --> D["✅ Past purchases - OK!"]
    B --> E["❌ Purchase confirmation email - LEAK!"]
    B --> F["❌ Receipt number - LEAK!"]

Spotting Leakage

Warning signs:

Your model is too accurate (99%+ on first try)
A feature is suspiciously perfect
Feature was created after the event

# Check for suspicious correlations
# If feature correlates > 0.95 with target
# Investigate for leakage!
correlation = data.corr()['target']
suspicious = correlation[correlation > 0.95]

The Time Travel Test

Ask yourself: “Would I know this information BEFORE the event happens?”

Customer’s birthday: ✅ Yes, I’d know this before they buy
Shipping address: ❌ No, they give this AFTER buying

4️⃣ Skewness and Transformations

What is Skewness?

Imagine a seesaw at the playground:

Balanced: Equal kids on both sides
Skewed Right: Heavy kid on the right, it tips that way
Skewed Left: Heavy kid on the left

Data can be skewed too!

Seeing Skewness

Normal (Balanced) Data:

    ████
   ██████
  ████████
 ██████████

Right-Skewed Data (Long tail right):

████
██████
████████
██████████████████████→

Most people earn average money, but a few billionaires pull the tail right!

Why Does Skewness Matter?

Many formulas assume balanced data
Skewed data can trick your model
Predictions become unfair

Fixing Skewness with Transformations

Think of transformations like magic glasses 🥽 that make lopsided things look balanced!

Common Transformations:

Transformation	When to Use	Effect
Log	Right-skewed	Shrinks big values
Square Root	Right-skewed	Gentler shrinking
Square	Left-skewed	Expands big values

# Log transformation for right-skewed data
import numpy as np
data['salary_log'] = np.log(data['salary'] + 1)

# Check if it helped
print(f"Before: {data['salary'].skew():.2f}")
print(f"After: {data['salary_log'].skew():.2f}")

Example:

Salaries: $30K, $40K, $50K, $1,000,000
After log: 4.5, 4.6, 4.7, 6.0 (Much more balanced!)

5️⃣ Seasonality and Trends

What Are Trends?

A trend is like watching a plant grow:

Every day, it gets a little taller
Over weeks, you see it going UP

That upward movement is a trend!

graph LR
    A["📈 Upward Trend"] --> B["Things getting bigger over time"]
    C["📉 Downward Trend"] --> D["Things getting smaller over time"]
    E["➡️ No Trend"] --> F["Staying about the same"]

What is Seasonality?

Seasonality is like the weather—it repeats in patterns!

Ice cream sales: 🍦 High in summer, low in winter
Umbrella sales: 🌂 High in rainy season, low when sunny
Toy sales: 🎁 High in December, lower other months

The Wave Pattern

Ice Cream Sales:

Summer  ████████████
Fall    ████████
Winter  ████
Spring  ████████
Summer  ████████████  ← Pattern repeats!

Why This Matters

If you predict toy sales for December using July data, you’ll be VERY wrong!

How to Find These Patterns

Step 1: Plot over time

# Plot your data over time
data.plot(x='date', y='sales')

Step 2: Look for:

Overall direction (trend)
Repeating waves (seasonality)

Step 3: Decompose

from statsmodels.tsa.seasonal import \
    seasonal_decompose

# Break data into parts
result = seasonal_decompose(
    data['sales'],
    period=12  # Monthly seasons
)
result.plot()

This shows you:

Trend line: The overall direction
Seasonal pattern: The repeating waves
Residual: What’s left (random noise)

🎓 Putting It All Together

You’re now a Data Detective! Here’s your checklist:

graph TD
    A["🔍 Start EDA"] --> B["Check for Missing Values"]
    B --> C["Hunt for Outliers"]
    C --> D["Detect Target Leakage"]
    D --> E["Check Skewness"]
    E --> F["Look for Seasonality &amp; Trends"]
    F --> G["✅ Data Ready for Analysis!"]

Remember:

Pattern	Think Of It As…
Missing Values	Empty puzzle pieces
Outliers	Dinosaur in the classroom
Target Leakage	Answers on your pencil
Skewness	Unbalanced seesaw
Seasonality	Weather patterns
Trends	Plant growing

🚀 Your Data Detective Toolkit

Always look at your data first - Never trust blindly!
Ask questions - Why is this missing? Why is this huge?
Visualize everything - Pictures reveal secrets numbers hide
Think about time - When was this recorded? Does it repeat?
Check for cheating - Could this feature know the future?

“The goal of EDA is not to find answers—it’s to find the right questions!”

You’ve got this, Detective! 🕵️‍♀️🔍📊

EDA Patterns

Unable to load concept

Coming Soon...

🔍 EDA Patterns: Becoming a Data Detective

🎯 The Big Picture

1️⃣ Missing Value Analysis

What Are Missing Values?

Why Do Values Go Missing?

How to Find Missing Values

What Can We Do About Them?

2️⃣ Outlier Detection in EDA

What Are Outliers?

The Neighborhood Rule

How to Spot Outliers

What To Do With Outliers?

3️⃣ Target Leakage Detection

The Cheating Problem 🎮

Real Example

How Leakage Sneaks In

Spotting Leakage

The Time Travel Test

4️⃣ Skewness and Transformations

What is Skewness?

Seeing Skewness

Why Does Skewness Matter?

Fixing Skewness with Transformations

5️⃣ Seasonality and Trends

What Are Trends?

What is Seasonality?

The Wave Pattern

Why This Matters

How to Find These Patterns

🎓 Putting It All Together

Remember:

🚀 Your Data Detective Toolkit

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue