What are missing values in data?

Missing values are empty spots in your data, like puzzle pieces that are gone. They occur when forms are incomplete or sensors fail.

What is imputation in data preprocessing?

Imputation fills in missing data with smart guesses using mean (average), median (middle value), or mode (most common value).

What is an outlier and how do you handle it?

An outlier is a data point that doesn't fit with others. You can remove it, cap it at a limit, keep it, or transform it.

What is class imbalance in machine learning?

Class imbalance occurs when one category vastly outnumbers another. Fix it with oversampling, undersampling, SMOTE, or class weights.

Data Preprocessing | Machine Learning Guide

🧹 Data Preprocessing: The Art of Cleaning Your Data Kitchen

The Story: Your Data is a Messy Kitchen

Imagine you want to bake the most delicious cake ever. But your kitchen is a mess! Some ingredients are missing, some are spoiled, and you don’t have enough of certain things. Before you can bake, you need to clean and prepare everything.

Machine Learning is just like baking. Your data is the kitchen, and preprocessing is the cleanup before cooking. Bad ingredients = bad cake. Messy data = bad predictions!

🕳️ Handling Missing Values

What Are Missing Values?

Think of a puzzle with some pieces gone. Missing values are those empty spots in your data.

Real Example:

Student | Age | Score
--------|-----|------
Emma    | 12  | 95
Jack    | ??  | 88
Lily    | 11  | ??

Jack’s age and Lily’s score are missing!

Why Do Values Go Missing?

📋 Someone forgot to fill a form
💻 A computer glitch lost the data
🙅 A person skipped a question
📡 A sensor stopped working

Three Ways to Handle Missing Pieces

1. Remove the Row (Throw it away)

Before: [Emma, 12, 95], [Jack, ??, 88]
After:  [Emma, 12, 95]

Good when: You have lots of data and few missing values

2. Fill with a Guess (Imputation) More on this next!

3. Use a Special Marker Mark it as “unknown” and let the model learn to handle it

🔧 Imputation Techniques

What is Imputation?

Imputation = Filling in the blanks with smart guesses.

Like when you lose one sock, you pick another similar one!

The Main Techniques

graph TD
    A["Missing Value Found!"] --> B{What type of data?}
    B -->|Numbers| C["Mean/Median Fill"]
    B -->|Categories| D["Mode Fill"]
    C --> E["Use Average or Middle Value"]
    D --> F["Use Most Common Value"]

1. Mean Imputation (Average Fill)

You have test scores: 80, 90, ??, 100

Mean = (80 + 90 + 100) ÷ 3 = 90

Fill the blank with 90!

2. Median Imputation (Middle Value)

Ages: 10, 12, ??, 50, 11

Sorted: 10, 11, 12, 50 → Middle = 11.5

Better for data with outliers!

3. Mode Imputation (Most Common)

Favorite colors: Red, Blue, Red, ??, Red

Mode = Red (appears most often)

Fill with Red!

4. Smart Imputation (KNN)

Look at similar students. If students with similar ages and grades have score X, use X!

🔍 Outlier Detection and Treatment

What is an Outlier?

An outlier is like finding a giraffe in a group of cats. It doesn’t fit!

Example:

Heights: 5ft, 5.2ft, 5.1ft, 15ft, 4.9ft
                         ^^^^
                      OUTLIER!

How to Spot Outliers

The IQR Method (Box Plot Thinking)

graph TD
    A["Sort Your Data"] --> B["Find Q1 - 25% mark"]
    B --> C["Find Q3 - 75% mark"]
    C --> D["Calculate IQR = Q3 - Q1"]
    D --> E["Lower Fence = Q1 - 1.5×IQR"]
    D --> F["Upper Fence = Q3 + 1.5×IQR"]
    E --> G["Anything outside = Outlier!"]
    F --> G

The Z-Score Method

How many “steps” away from average?

Z > 3 or Z < -3 = Likely an outlier!

What to Do with Outliers?

Approach	When to Use
Remove	Clearly a mistake
Cap	Real but extreme
Keep	Important info!
Transform	Log/sqrt to reduce effect

Example: Capping

Before: [10, 12, 11, 100, 13]
After:  [10, 12, 11, 20, 13]
(Cap at reasonable maximum)

📈 Data Augmentation

What is Data Augmentation?

You have 10 photos of cats. But you need 100!

Data augmentation = Creating new data from existing data.

Like making copies with small changes!

For Images

graph TD
    A["Original Cat Photo"] --> B["Flip Horizontal"]
    A --> C["Rotate Slightly"]
    A --> D["Zoom In/Out"]
    A --> E["Change Brightness"]
    A --> F["Crop Differently"]
    B --> G["Now you have 6 cats!"]
    C --> G
    D --> G
    E --> G
    F --> G

For Text

Original: “The dog is happy”

Augmented versions:

“The puppy is happy” (synonym swap)
“The dog is joyful” (synonym swap)
“A happy dog” (paraphrase)

For Numbers (Tabular Data)

SMOTE Technique

Find a data point
Find its neighbor
Create a new point between them!

Point A: [2, 4]
Point B: [4, 6]
New Point: [3, 5] (middle!)

⚖️ Class Imbalance Handling

What is Class Imbalance?

Imagine a classroom with:

95 boys
5 girls

If you guess “boy” every time, you’re right 95%!

But you never learn to identify girls. That’s unfair!

Real Examples

Problem	Common Class	Rare Class
Fraud Detection	Normal (99.9%)	Fraud (0.1%)
Disease Diagnosis	Healthy (95%)	Sick (5%)
Spam Email	Normal (80%)	Spam (20%)

Solutions

graph TD
    A["Imbalanced Data"] --> B{Choose Strategy}
    B --> C["Oversample Minority"]
    B --> D["Undersample Majority"]
    B --> E["Generate Synthetic Data"]
    B --> F["Use Class Weights"]
    C --> G["Copy rare examples"]
    D --> H["Remove common examples"]
    E --> I["Create new rare examples"]
    F --> J["Penalize errors on rare class more"]

1. Oversampling (Copy the Minority)

Before: [Cat, Dog, Dog, Dog, Dog]
After:  [Cat, Cat, Cat, Dog, Dog, Dog, Dog]

2. Undersampling (Reduce the Majority)

Before: [Cat, Dog, Dog, Dog, Dog]
After:  [Cat, Dog, Dog]

3. SMOTE (Smart Synthetic Data)

Create fake examples of the rare class that look realistic!

4. Class Weights

Tell the model: “Mistakes on rare items cost more!”

Weight for Dog: 1
Weight for Cat: 10 (5x more important!)

🎯 Putting It All Together

The Preprocessing Pipeline

graph TD
    A["Raw Messy Data"] --> B["Handle Missing Values"]
    B --> C["Detect &amp; Treat Outliers"]
    C --> D["Augment if Needed"]
    D --> E["Balance Classes"]
    E --> F["Clean Data Ready!"]
    F --> G["Train Your Model"]

Quick Decision Guide

Problem	Solution
Empty cells	Imputation or removal
Weird values	Outlier treatment
Too little data	Augmentation
Uneven classes	Balancing techniques

🚀 Key Takeaways

Missing Values = Empty puzzle pieces. Fill or remove them!
Imputation = Smart guessing. Use mean, median, or mode.
Outliers = Giraffes among cats. Detect and decide: keep, remove, or cap.
Augmentation = Making copies with changes. More data = better learning!
Class Imbalance = Unfair teams. Balance them for fair predictions.

💡 Remember This Forever

“Garbage in, garbage out!”

Your model can only be as good as your data. Clean data = Smart predictions. Messy data = Confused model.

Data preprocessing is like brushing your teeth before a date. It’s not glamorous, but skip it, and everything goes wrong! 🦷✨

Data Preprocessing

Unable to load concept

Coming Soon...

🧹 Data Preprocessing: The Art of Cleaning Your Data Kitchen

The Story: Your Data is a Messy Kitchen

🕳️ Handling Missing Values

What Are Missing Values?

Why Do Values Go Missing?

Three Ways to Handle Missing Pieces

🔧 Imputation Techniques

What is Imputation?

The Main Techniques

🔍 Outlier Detection and Treatment

What is an Outlier?

How to Spot Outliers

What to Do with Outliers?

📈 Data Augmentation

What is Data Augmentation?

For Images

For Text

For Numbers (Tabular Data)

⚖️ Class Imbalance Handling

What is Class Imbalance?

Real Examples

Solutions

🎯 Putting It All Together

The Preprocessing Pipeline

Quick Decision Guide

🚀 Key Takeaways

💡 Remember This Forever

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue