๐ EDA Patterns: Becoming a Data Detective
Imagine youโre a detective with a magnifying glass. Before solving any mystery, you must examine every clue carefully. Thatโs exactly what Exploratory Data Analysis (EDA) isโexamining your data before making big decisions!
๐ฏ The Big Picture
Think of your data like a brand new puzzle box you just received. Before you start solving it, you want to:
- Check if any pieces are missing
- Find pieces that donโt belong
- See if someone cheated by putting the answer on the box
- Notice if pieces are lopsided or uneven
- Spot patterns like stripes or waves
Thatโs EDA! Letโs explore each detective skill.
1๏ธโฃ Missing Value Analysis
What Are Missing Values?
Imagine you have a attendance sheet for your class:
| Name | Monday | Tuesday | Wednesday |
|---|---|---|---|
| Emma | โ | โ | โ |
| Jack | โ | ? | โ |
| Lily | โ | โ | ? |
Those question marks are missing values! We donโt know if Jack came on Tuesday or if Lily came on Wednesday.
Why Do Values Go Missing?
- Someone forgot to write it down (like forgetting homework!)
- The information doesnโt exist (asking a fish its favorite shoe)
- Equipment broke (thermometer stopped working)
How to Find Missing Values
# Count missing values
data.isnull().sum()
# See percentage missing
(data.isnull().sum() / len(data)) * 100
What Can We Do About Them?
graph TD A[Found Missing Values!] --> B{How many?} B -->|Very few| C[Fill with average/middle value] B -->|Some| D[Use smart guessing] B -->|Too many| E[Remove that column] C --> F[Clean Data Ready!] D --> F E --> F
Simple Example: If 3 out of 100 students forgot to write their age, we could:
- Use the average age of the class
- Use the most common age
- Ask them again!
2๏ธโฃ Outlier Detection in EDA
What Are Outliers?
Imagine measuring everyoneโs height in your class:
- Most kids: 4-5 feet tall
- One measurement: 50 feet tall ๐ฆ
That 50 feet is an outlierโit doesnโt fit with the rest! Either someone made a mistake, or thereโs a dinosaur in your class.
The Neighborhood Rule
Think of your data living in a neighborhood:
- Most houses are similar sizes
- One house is a giant castle ๐ฐ
- That castle is the outlier!
How to Spot Outliers
Method 1: The Box Plot
graph LR A[Small values] --- B[Most data lives here] B --- C[Large values] D[๐ด Outlier!] -.-> A E[๐ด Outlier!] -.-> C
Method 2: The 3-Sigma Rule
- Calculate the average (mean)
- Calculate the spread (standard deviation)
- Anything 3 spreads away = outlier!
# Find outliers using IQR method
Q1 = data.quantile(0.25)
Q3 = data.quantile(0.75)
IQR = Q3 - Q1
outliers = data[(data < Q1 - 1.5*IQR) |
(data > Q3 + 1.5*IQR)]
What To Do With Outliers?
| Situation | Action |
|---|---|
| Typo/Error | Fix it! |
| Real but rare | Keep it, note it |
| Breaks your model | Consider removing |
3๏ธโฃ Target Leakage Detection
The Cheating Problem ๐ฎ
Imagine youโre taking a test, but someone wrote the answers on your pencil. Thatโs cheating! Your score would be perfect, but you didnโt really learn anything.
Target leakage is when your data accidentally contains the answer youโre trying to predict!
Real Example
You want to predict: โWill this student pass the exam?โ
Your data includes:
- Hours studied โ (Good! Helps predict)
- Attendance โ (Good!)
- Final grade โ (CHEATING! This IS the answer!)
How Leakage Sneaks In
graph TD A[Want to Predict: Will customer buy?] --> B{Check your features} B --> C[โ Age - OK!] B --> D[โ Past purchases - OK!] B --> E[โ Purchase confirmation email - LEAK!] B --> F[โ Receipt number - LEAK!]
Spotting Leakage
Warning signs:
- Your model is too accurate (99%+ on first try)
- A feature is suspiciously perfect
- Feature was created after the event
# Check for suspicious correlations
# If feature correlates > 0.95 with target
# Investigate for leakage!
correlation = data.corr()['target']
suspicious = correlation[correlation > 0.95]
The Time Travel Test
Ask yourself: โWould I know this information BEFORE the event happens?โ
- Customerโs birthday: โ Yes, Iโd know this before they buy
- Shipping address: โ No, they give this AFTER buying
4๏ธโฃ Skewness and Transformations
What is Skewness?
Imagine a seesaw at the playground:
- Balanced: Equal kids on both sides
- Skewed Right: Heavy kid on the right, it tips that way
- Skewed Left: Heavy kid on the left
Data can be skewed too!
Seeing Skewness
Normal (Balanced) Data:
โโโโ
โโโโโโ
โโโโโโโโ
โโโโโโโโโโ
Right-Skewed Data (Long tail right):
โโโโ
โโโโโโ
โโโโโโโโ
โโโโโโโโโโโโโโโโโโโโโโโ
Most people earn average money, but a few billionaires pull the tail right!
Why Does Skewness Matter?
- Many formulas assume balanced data
- Skewed data can trick your model
- Predictions become unfair
Fixing Skewness with Transformations
Think of transformations like magic glasses ๐ฅฝ that make lopsided things look balanced!
Common Transformations:
| Transformation | When to Use | Effect |
|---|---|---|
| Log | Right-skewed | Shrinks big values |
| Square Root | Right-skewed | Gentler shrinking |
| Square | Left-skewed | Expands big values |
# Log transformation for right-skewed data
import numpy as np
data['salary_log'] = np.log(data['salary'] + 1)
# Check if it helped
print(f"Before: {data['salary'].skew():.2f}")
print(f"After: {data['salary_log'].skew():.2f}")
Example:
- Salaries: $30K, $40K, $50K, $1,000,000
- After log: 4.5, 4.6, 4.7, 6.0 (Much more balanced!)
5๏ธโฃ Seasonality and Trends
What Are Trends?
A trend is like watching a plant grow:
- Every day, it gets a little taller
- Over weeks, you see it going UP
That upward movement is a trend!
graph LR A[๐ Upward Trend] --> B[Things getting bigger over time] C[๐ Downward Trend] --> D[Things getting smaller over time] E[โก๏ธ No Trend] --> F[Staying about the same]
What is Seasonality?
Seasonality is like the weatherโit repeats in patterns!
- Ice cream sales: ๐ฆ High in summer, low in winter
- Umbrella sales: ๐ High in rainy season, low when sunny
- Toy sales: ๐ High in December, lower other months
The Wave Pattern
Ice Cream Sales:
Summer โโโโโโโโโโโโ
Fall โโโโโโโโ
Winter โโโโ
Spring โโโโโโโโ
Summer โโโโโโโโโโโโ โ Pattern repeats!
Why This Matters
If you predict toy sales for December using July data, youโll be VERY wrong!
How to Find These Patterns
Step 1: Plot over time
# Plot your data over time
data.plot(x='date', y='sales')
Step 2: Look for:
- Overall direction (trend)
- Repeating waves (seasonality)
Step 3: Decompose
from statsmodels.tsa.seasonal import \
seasonal_decompose
# Break data into parts
result = seasonal_decompose(
data['sales'],
period=12 # Monthly seasons
)
result.plot()
This shows you:
- Trend line: The overall direction
- Seasonal pattern: The repeating waves
- Residual: Whatโs left (random noise)
๐ Putting It All Together
Youโre now a Data Detective! Hereโs your checklist:
graph TD A[๐ Start EDA] --> B[Check for Missing Values] B --> C[Hunt for Outliers] C --> D[Detect Target Leakage] D --> E[Check Skewness] E --> F[Look for Seasonality & Trends] F --> G[โ Data Ready for Analysis!]
Remember:
| Pattern | Think Of It Asโฆ |
|---|---|
| Missing Values | Empty puzzle pieces |
| Outliers | Dinosaur in the classroom |
| Target Leakage | Answers on your pencil |
| Skewness | Unbalanced seesaw |
| Seasonality | Weather patterns |
| Trends | Plant growing |
๐ Your Data Detective Toolkit
- Always look at your data first - Never trust blindly!
- Ask questions - Why is this missing? Why is this huge?
- Visualize everything - Pictures reveal secrets numbers hide
- Think about time - When was this recorded? Does it repeat?
- Check for cheating - Could this feature know the future?
โThe goal of EDA is not to find answersโitโs to find the right questions!โ
Youโve got this, Detective! ๐ต๏ธโโ๏ธ๐๐