Imbalanced Data

Back

Loading concept...

The Unbalanced Seesaw: Mastering Imbalanced Data

Once Upon a Time in Data Land…

Imagine you’re at a playground with a seesaw. On one side, there are 99 elephants. On the other side, there’s just 1 tiny mouse. What happens? The seesaw crashes down on the elephant side! The poor mouse gets launched into the sky.

This is exactly what happens when your data is imbalanced—and it’s a BIG problem in Data Science.


What is Imbalanced Data?

Simple answer: When one group has WAY more examples than another.

Real examples:

  • Credit card fraud: 99.9% normal transactions, 0.1% fraud
  • Disease detection: 98% healthy patients, 2% sick patients
  • Email spam: 80% regular emails, 20% spam

Why is This a Problem?

Think about teaching a child to identify animals:

What You Show What They Learn
99 pictures of dogs “Everything is a dog!”
1 picture of a cat “What’s a cat?”

Your machine learning model does the same thing! It becomes lazy and just predicts the majority class because it’s right most of the time.

graph TD A["Training Data"] --> B{Is it balanced?} B -->|Yes: 50-50| C["Model learns both classes well"] B -->|No: 99-1| D["Model ignores minority class"] D --> E["Predicts majority class always"] E --> F["Misses all the important rare cases!"]

The Four Heroes: Solutions for Imbalanced Data

Meet our four heroes who save the day:

  1. Oversampling - Make copies of the rare examples
  2. SMOTE - Create new synthetic rare examples
  3. Undersampling - Remove some common examples
  4. Class Weights - Tell the model “rare = important!”

Hero #1: Oversampling

The Cookie Story

Imagine you’re making cookies. You have:

  • 100 chocolate chip cookies
  • Only 5 oatmeal cookies

Your friends only learn to recognize chocolate chip!

Solution: Make photocopies of those 5 oatmeal cookies until you have 100 of each!

How It Works

BEFORE Oversampling:
Class A: ●●●●●●●●●● (100 samples)
Class B: ●● (5 samples)

AFTER Oversampling:
Class A: ●●●●●●●●●● (100 samples)
Class B: ●●●●●●●●●● (100 samples)
         ↑ Duplicated!

Python Example

# Simple random oversampling
from sklearn.utils import resample

# minority_class has few samples
minority_upsampled = resample(
    minority_class,
    replace=True,        # Allow copies
    n_samples=len(majority_class)
)

Pros and Cons

Pros Cons
Simple to understand Creates exact duplicates
Easy to implement Can cause overfitting
Preserves all data Model memorizes copies

Hero #2: SMOTE (Synthetic Minority Oversampling)

The Magic Art Class Story

Instead of photocopying cookies, what if we drew NEW cookies that look similar but aren’t identical?

SMOTE is like a magic artist that:

  1. Looks at a rare example
  2. Finds its neighbors (similar examples)
  3. Draws a NEW example somewhere in between!

Visual Explanation

Original minority points: ● and ★

     ●
      \
       ◆ ← NEW synthetic point!
      /
     ★

SMOTE creates ◆ between ● and ★

How SMOTE Works (Step by Step)

graph TD A["Pick a minority sample"] --> B["Find K nearest neighbors"] B --> C["Pick one neighbor randomly"] C --> D["Draw a line between them"] D --> E["Place new point on that line"] E --> F["Repeat until balanced!"]

Python Example

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(
    X_train, y_train
)

Why SMOTE is Better Than Basic Oversampling

Oversampling SMOTE
Makes exact copies Creates NEW unique samples
Model memorizes Model generalizes
Risk of overfitting Reduces overfitting risk

Hero #3: Undersampling

The Birthday Party Story

You’re planning a party. You invited:

  • 100 kids who like pizza
  • 5 kids who like salad

Instead of forcing everyone to eat pizza, you:

  • Keep all 5 salad lovers
  • Randomly pick only 5 pizza lovers

Now it’s fair: 5 vs 5!

How It Works

BEFORE Undersampling:
Class A: ●●●●●●●●●● (100 samples)
Class B: ●● (5 samples)

AFTER Undersampling:
Class A: ●● (5 samples) ← Reduced!
Class B: ●● (5 samples)

Python Example

from imblearn.under_sampling import (
    RandomUnderSampler
)

rus = RandomUnderSampler(random_state=42)
X_balanced, y_balanced = rus.fit_resample(
    X_train, y_train
)

Pros and Cons

Pros Cons
Simple and fast Loses valuable data!
No synthetic data May lose important patterns
Reduces training time Risky with small datasets

When to Use Undersampling

  • You have LOTS of data in the majority class
  • Training time is a concern
  • The majority class has redundant examples

Hero #4: Class Weights

The Superhero Points Story

Imagine a game where:

  • Catching a common butterfly = 1 point
  • Catching a rare golden butterfly = 100 points!

You’d pay WAY more attention to golden butterflies, right?

Class weights do exactly this for your model!

How It Works

Instead of changing your data, you tell the model:

“Hey! When you make a mistake on the rare class, it’s 100x worse than making a mistake on the common class!”

graph TD A["Model makes prediction"] --> B{Was it correct?} B -->|Wrong on Common class| C["Small penalty: 1x"] B -->|Wrong on Rare class| D["HUGE penalty: 100x"] D --> E["Model learns to care about rare class!"]

Python Example

from sklearn.linear_model import (
    LogisticRegression
)

# Automatic weight calculation
model = LogisticRegression(
    class_weight='balanced'
)

# Or manual weights
model = LogisticRegression(
    class_weight={0: 1, 1: 99}
)

The Math Behind Balanced Weights

weight = total_samples / (n_classes × class_count)

Example: 1000 total, 950 normal, 50 fraud
- Normal weight = 1000 / (2 × 950) = 0.53
- Fraud weight = 1000 / (2 × 50) = 10.0

Fraud is weighted 19x more!

Why Class Weights are Awesome

Feature Benefit
No data modification Keep all your original data
Works with any model Most sklearn models support it
Computationally cheap No extra data to process
Easy to tune Just adjust the weights

Choosing Your Hero: Decision Guide

graph TD A["Imbalanced Data?"] --> B{How much data do you have?} B -->|Small dataset| C{Need more samples?} B -->|Large dataset| D["Try Undersampling first"] C -->|Yes| E["Use SMOTE"] C -->|No| F["Use Class Weights"] D --> G{Still not working?} G -->|Yes| H["Combine methods!"]

Quick Reference Table

Method Best When Avoid When
Oversampling Small minority class Prone to overfitting
SMOTE Need diverse samples Very noisy data
Undersampling Huge majority class Limited data
Class Weights Any situation Need actual samples

Combining Methods: The Ultimate Power Move

Real data scientists often combine these methods!

Popular Combinations

  1. SMOTE + Undersampling

    • Increase minority with SMOTE
    • Decrease majority with undersampling
    • Meet in the middle!
  2. SMOTE + Class Weights

    • Create synthetic samples
    • Still weight them appropriately

Python Example

from imblearn.combine import SMOTETomek

# SMOTE + Tomek link cleaning
smt = SMOTETomek(random_state=42)
X_balanced, y_balanced = smt.fit_resample(
    X_train, y_train
)

Measuring Success: Beyond Accuracy

Warning: Never trust accuracy with imbalanced data!

The Accuracy Trap

Dataset: 99 cats, 1 dog
Model predicts: "Everything is a cat!"
Accuracy: 99% (Sounds great!)
Dogs found: 0% (Terrible!)

Better Metrics

Metric What It Measures
Precision Of all predicted positives, how many are correct?
Recall Of all actual positives, how many did we find?
F1-Score Balance between precision and recall
AUC-ROC Overall ranking ability

Your Imbalanced Data Checklist

  • [ ] Check your class distribution first
  • [ ] Never trust accuracy alone
  • [ ] Try class weights (easiest!)
  • [ ] Try SMOTE if you need more data
  • [ ] Use F1-score or AUC for evaluation
  • [ ] Combine methods if needed

The Happy Ending

Remember our seesaw? With these four heroes:

  • Oversampling adds more mice (copies)
  • SMOTE creates new mice (synthetic)
  • Undersampling removes some elephants
  • Class Weights makes mice heavier

Now the seesaw is balanced, and everyone can play fairly!

You’re now ready to tackle any imbalanced dataset! Go forth and balance those classes! 🎯

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.