Data Preparation

Loading concept...

🍳 Data Preparation: Getting Your Ingredients Ready

Imagine you’re making a delicious cake. Before you can bake, you need to prepare your ingredients—wash the fruits, measure the flour, remove any bad eggs. Data Preparation is exactly like this! It’s how we get messy, raw data ready so computers can learn from it properly.


The Kitchen Analogy 🧑‍🍳

Think of your data as grocery bags full of ingredients. Some bags have:

  • Labels like “Red” or “Large” instead of numbers
  • Ingredients of wildly different sizes (a watermelon vs. a grape)
  • Duplicate items (oops, bought milk twice!)
  • Ingredients that need to be chopped or transformed

Our job? Clean, organize, and prepare everything so our AI chef can cook up amazing predictions!


🏷️ Categorical Encoding Methods

What’s the Problem?

Computers are like calculators—they only understand numbers! But our data often has words:

Fruit Color
Apple Red
Banana Yellow
Grape Purple

How do we tell the computer about “Red” or “Yellow”? We encode them into numbers!

Method 1: Label Encoding 🔢

The Simple Way: Give each category a number.

Red    → 0
Yellow → 1
Purple → 2

Example:

Fruit Color (Before) Color (After)
Apple Red 0
Banana Yellow 1
Grape Purple 2

⚠️ Warning: The computer might think Purple (2) is “bigger” than Red (0). This can cause problems!

When to use: When your categories have a natural order (like Small < Medium < Large).


Method 2: One-Hot Encoding 🎯

The Smart Way: Create a separate column for each category with 0 or 1.

Think of it like a checklist:

  • Is it Red? ✓ or ✗
  • Is it Yellow? ✓ or ✗
  • Is it Purple? ✓ or ✗
Red    → [1, 0, 0]
Yellow → [0, 1, 0]
Purple → [0, 0, 1]

Example:

Fruit Is_Red Is_Yellow Is_Purple
Apple 1 0 0
Banana 0 1 0
Grape 0 0 1

When to use: When categories have no natural order (colors, countries, names).


Method 3: Ordinal Encoding 📊

For Ordered Categories: When your labels have a natural ranking.

T-Shirt Size:
Small  → 1
Medium → 2
Large  → 3
XL     → 4

Example:

Customer Size (Before) Size (After)
Alice Small 1
Bob Large 3
Carol Medium 2

The Difference from Label Encoding: Here, the order actually matters! Large (3) IS bigger than Small (1).


📏 Feature Scaling Methods

Why Scale?

Imagine comparing these two things:

  • Age: 25 years
  • Salary: $50,000

The salary number is HUGE compared to age. The computer might think salary is 2000x more important just because the number is bigger!

Scaling makes everything fair by putting all numbers on a similar range.


Method 1: Min-Max Scaling (Normalization) 📐

Squishes all values between 0 and 1.

Formula (in simple terms):

New Value = (Value - Minimum) / (Maximum - Minimum)

Example: Ages 20, 30, 40

  • Minimum = 20, Maximum = 40
  • Age 20 → (20-20)/(40-20) = 0
  • Age 30 → (30-20)/(40-20) = 0.5
  • Age 40 → (40-20)/(40-20) = 1

Result:

Original Age Scaled Age
20 0.0
30 0.5
40 1.0

When to use: When you want values between 0 and 1, and your data has no extreme outliers.


Method 2: Standard Scaling (Z-Score) 📊

Centers data around 0, with most values between -3 and +3.

The idea: How far is each value from the average?

Formula (simple version):

New Value = (Value - Average) / Spread

Example: Test Scores 60, 70, 80

  • Average = 70
  • If spread = 10
  • Score 60 → (60-70)/10 = -1 (below average)
  • Score 70 → (70-70)/10 = 0 (exactly average)
  • Score 80 → (80-70)/10 = +1 (above average)

When to use: When you have outliers or need data centered around zero.


Method 3: Robust Scaling 💪

Ignores extreme values (outliers)!

Imagine everyone in class scored 70-80, but one person scored 200 (maybe cheated?). Regular scaling would be thrown off by that 200.

Robust Scaling uses the middle values, so outliers don’t ruin everything.

When to use: When your data has crazy outliers you can’t remove.


🔍 Data Deduplication

The Duplicate Problem

You’re organizing your music playlist. Suddenly you notice:

  • “Happy” by Pharrell
  • “Happy” by Pharrell
  • “Happy - Pharrell Williams”

Same song, three times! Duplicates waste space and confuse our AI.


Finding Duplicates

Exact Duplicates: Rows that are 100% identical.

Name Age City
John 25 NYC
John 25 NYC
John 26 NYC

Fuzzy Duplicates: Almost the same, but with tiny differences.

Name
John Smith
John Smyth
Jon Smith

Removing Duplicates

Strategy 1: Keep First

Keep the first occurrence, delete the rest.

Strategy 2: Keep Last

Keep the most recent entry.

Strategy 3: Aggregate

If you have duplicate sales records,
add up the amounts instead of deleting.

Example:

Customer Purchase
Alice $50
Alice $30

After aggregation:

Customer Total Purchase
Alice $80

🔄 Data Transformations

Why Transform?

Sometimes data is weirdly shaped. Like having:

  • Most people earn $30,000-$70,000
  • A few billionaires earn $1,000,000,000

This skewed data confuses AI models. Transformations help fix the shape!


Transformation 1: Log Transformation 📉

Shrinks huge numbers while keeping small ones similar.

Before:

Salaries: $30K, $50K, $80K, $10M

After Log Transform:

Log values: 4.5, 4.7, 4.9, 7.0

The million-dollar salary no longer dominates!

graph TD A[Original Data<br/>30K, 50K, 10M] --> B[Apply Log] B --> C[Transformed<br/>4.5, 4.7, 7.0] C --> D[Much More Balanced!]

Transformation 2: Square Root Transformation √

Gentler than log, good for count data.

Example: Website Visits

Page Visits √Visits
Home 10000 100
About 100 10
Contact 25 5

Transformation 3: Box-Cox Transformation 📦

The Smart Transformer: Automatically finds the best transformation for your data!

Think of it like a shape-shifting power—it adjusts itself to make your data as “normal” (bell-curve shaped) as possible.

When to use: When you’re not sure which transformation to apply.


Transformation 4: Binning (Discretization) 🗑️

Groups continuous numbers into buckets.

Example: Age Groups

Age Age Group
5 Child
15 Teen
25 Adult
45 Adult
70 Senior

Bins:

  • 0-12: Child
  • 13-19: Teen
  • 20-59: Adult
  • 60+: Senior

🎯 Putting It All Together

Here’s your Data Preparation Recipe:

graph TD A[🛒 Raw Data] --> B[🏷️ Encode Categories] B --> C[📏 Scale Numbers] C --> D[🔍 Remove Duplicates] D --> E[🔄 Transform if Needed] E --> F[✨ Clean Data Ready!]

Quick Reference Table

Step What It Does Example
Categorical Encoding Words → Numbers “Red” → 0 or [1,0,0]
Feature Scaling Same range for all 50000 → 0.5
Deduplication Remove copies 3 Johns → 1 John
Transformation Fix weird shapes $10M → 7.0

🌟 Key Takeaways

  1. Encoding turns words into numbers computers understand
  2. Scaling makes all features equally important
  3. Deduplication removes wasteful copies
  4. Transformations fix weirdly shaped data

Remember: Good data preparation = Better AI predictions!

Just like a chef who carefully prepares ingredients creates amazing dishes, a data scientist who properly prepares data builds powerful models! 🍳➡️🤖


“Give me six hours to chop down a tree and I will spend the first four sharpening the axe.” — Abraham Lincoln

Translation: Spend time preparing your data well, and your AI will thank you!

Loading story...

No Story Available

This concept doesn't have a story yet.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Interactive Preview

Interactive - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Interactive Content

This concept doesn't have interactive content yet.

Cheatsheet Preview

Cheatsheet - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Cheatsheet Available

This concept doesn't have a cheatsheet yet.

Quiz Preview

Quiz - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Quiz Available

This concept doesn't have a quiz yet.