Data Preprocessing

Loading concept...

🧹 Data Preprocessing: The Art of Cleaning Your Data Kitchen

The Story: Your Data is a Messy Kitchen

Imagine you want to bake the most delicious cake ever. But your kitchen is a mess! Some ingredients are missing, some are spoiled, and you don’t have enough of certain things. Before you can bake, you need to clean and prepare everything.

Machine Learning is just like baking. Your data is the kitchen, and preprocessing is the cleanup before cooking. Bad ingredients = bad cake. Messy data = bad predictions!


🕳️ Handling Missing Values

What Are Missing Values?

Think of a puzzle with some pieces gone. Missing values are those empty spots in your data.

Real Example:

Student | Age | Score
--------|-----|------
Emma    | 12  | 95
Jack    | ??  | 88
Lily    | 11  | ??

Jack’s age and Lily’s score are missing!

Why Do Values Go Missing?

  • 📋 Someone forgot to fill a form
  • 💻 A computer glitch lost the data
  • 🙅 A person skipped a question
  • 📡 A sensor stopped working

Three Ways to Handle Missing Pieces

1. Remove the Row (Throw it away)

Before: [Emma, 12, 95], [Jack, ??, 88]
After:  [Emma, 12, 95]

Good when: You have lots of data and few missing values

2. Fill with a Guess (Imputation) More on this next!

3. Use a Special Marker Mark it as “unknown” and let the model learn to handle it


🔧 Imputation Techniques

What is Imputation?

Imputation = Filling in the blanks with smart guesses.

Like when you lose one sock, you pick another similar one!

The Main Techniques

graph TD A[Missing Value Found!] --> B{What type of data?} B -->|Numbers| C[Mean/Median Fill] B -->|Categories| D[Mode Fill] C --> E[Use Average or Middle Value] D --> F[Use Most Common Value]

1. Mean Imputation (Average Fill)

You have test scores: 80, 90, ??, 100

Mean = (80 + 90 + 100) ÷ 3 = 90

Fill the blank with 90!

2. Median Imputation (Middle Value)

Ages: 10, 12, ??, 50, 11

Sorted: 10, 11, 12, 50 → Middle = 11.5

Better for data with outliers!

3. Mode Imputation (Most Common)

Favorite colors: Red, Blue, Red, ??, Red

Mode = Red (appears most often)

Fill with Red!

4. Smart Imputation (KNN)

Look at similar students. If students with similar ages and grades have score X, use X!


🔍 Outlier Detection and Treatment

What is an Outlier?

An outlier is like finding a giraffe in a group of cats. It doesn’t fit!

Example:

Heights: 5ft, 5.2ft, 5.1ft, 15ft, 4.9ft
                         ^^^^
                      OUTLIER!

How to Spot Outliers

The IQR Method (Box Plot Thinking)

graph TD A[Sort Your Data] --> B[Find Q1 - 25% mark] B --> C[Find Q3 - 75% mark] C --> D[Calculate IQR = Q3 - Q1] D --> E[Lower Fence = Q1 - 1.5×IQR] D --> F[Upper Fence = Q3 + 1.5×IQR] E --> G[Anything outside = Outlier!] F --> G

The Z-Score Method

How many “steps” away from average?

  • Z > 3 or Z < -3 = Likely an outlier!

What to Do with Outliers?

Approach When to Use
Remove Clearly a mistake
Cap Real but extreme
Keep Important info!
Transform Log/sqrt to reduce effect

Example: Capping

Before: [10, 12, 11, 100, 13]
After:  [10, 12, 11, 20, 13]
(Cap at reasonable maximum)

📈 Data Augmentation

What is Data Augmentation?

You have 10 photos of cats. But you need 100!

Data augmentation = Creating new data from existing data.

Like making copies with small changes!

For Images

graph TD A[Original Cat Photo] --> B[Flip Horizontal] A --> C[Rotate Slightly] A --> D[Zoom In/Out] A --> E[Change Brightness] A --> F[Crop Differently] B --> G[Now you have 6 cats!] C --> G D --> G E --> G F --> G

For Text

Original: “The dog is happy”

Augmented versions:

  • “The puppy is happy” (synonym swap)
  • “The dog is joyful” (synonym swap)
  • “A happy dog” (paraphrase)

For Numbers (Tabular Data)

SMOTE Technique

  1. Find a data point
  2. Find its neighbor
  3. Create a new point between them!
Point A: [2, 4]
Point B: [4, 6]
New Point: [3, 5] (middle!)

⚖️ Class Imbalance Handling

What is Class Imbalance?

Imagine a classroom with:

  • 95 boys
  • 5 girls

If you guess “boy” every time, you’re right 95%!

But you never learn to identify girls. That’s unfair!

Real Examples

Problem Common Class Rare Class
Fraud Detection Normal (99.9%) Fraud (0.1%)
Disease Diagnosis Healthy (95%) Sick (5%)
Spam Email Normal (80%) Spam (20%)

Solutions

graph TD A[Imbalanced Data] --> B{Choose Strategy} B --> C[Oversample Minority] B --> D[Undersample Majority] B --> E[Generate Synthetic Data] B --> F[Use Class Weights] C --> G[Copy rare examples] D --> H[Remove common examples] E --> I[Create new rare examples] F --> J[Penalize errors on rare class more]

1. Oversampling (Copy the Minority)

Before: [Cat, Dog, Dog, Dog, Dog]
After:  [Cat, Cat, Cat, Dog, Dog, Dog, Dog]

2. Undersampling (Reduce the Majority)

Before: [Cat, Dog, Dog, Dog, Dog]
After:  [Cat, Dog, Dog]

3. SMOTE (Smart Synthetic Data)

Create fake examples of the rare class that look realistic!

4. Class Weights

Tell the model: “Mistakes on rare items cost more!”

Weight for Dog: 1
Weight for Cat: 10 (5x more important!)

🎯 Putting It All Together

The Preprocessing Pipeline

graph TD A[Raw Messy Data] --> B[Handle Missing Values] B --> C[Detect & Treat Outliers] C --> D[Augment if Needed] D --> E[Balance Classes] E --> F[Clean Data Ready!] F --> G[Train Your Model]

Quick Decision Guide

Problem Solution
Empty cells Imputation or removal
Weird values Outlier treatment
Too little data Augmentation
Uneven classes Balancing techniques

🚀 Key Takeaways

  1. Missing Values = Empty puzzle pieces. Fill or remove them!

  2. Imputation = Smart guessing. Use mean, median, or mode.

  3. Outliers = Giraffes among cats. Detect and decide: keep, remove, or cap.

  4. Augmentation = Making copies with changes. More data = better learning!

  5. Class Imbalance = Unfair teams. Balance them for fair predictions.


💡 Remember This Forever

“Garbage in, garbage out!”

Your model can only be as good as your data. Clean data = Smart predictions. Messy data = Confused model.

Data preprocessing is like brushing your teeth before a date. It’s not glamorous, but skip it, and everything goes wrong! 🦷✨

Loading story...

No Story Available

This concept doesn't have a story yet.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Interactive Preview

Interactive - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Interactive Content

This concept doesn't have interactive content yet.

Cheatsheet Preview

Cheatsheet - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Cheatsheet Available

This concept doesn't have a cheatsheet yet.

Quiz Preview

Quiz - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Quiz Available

This concept doesn't have a quiz yet.