SMOTE creates synthetic minority samples by finding neighbors of existing points and generating new points between them.

Why can't you trust accuracy with imbalanced data?

A model can predict the majority class always and get 99% accuracy while missing all rare cases. Use F1-score or AUC instead.

What are class weights in machine learning?

Class weights penalize mistakes on rare classes more heavily, making the model pay attention to minority examples without changing data.

Imbalanced Data | Data Science Guide

Q: What is imbalanced data?

Imbalanced data is when one class has far more examples than another, like 99% normal transactions and only 0.1% fraud cases.

The Unbalanced Seesaw: Mastering Imbalanced Data

Once Upon a Time in Data Land…

Imagine you’re at a playground with a seesaw. On one side, there are 99 elephants. On the other side, there’s just 1 tiny mouse. What happens? The seesaw crashes down on the elephant side! The poor mouse gets launched into the sky.

This is exactly what happens when your data is imbalanced—and it’s a BIG problem in Data Science.

What is Imbalanced Data?

Simple answer: When one group has WAY more examples than another.

Real examples:

Credit card fraud: 99.9% normal transactions, 0.1% fraud
Disease detection: 98% healthy patients, 2% sick patients
Email spam: 80% regular emails, 20% spam

Why is This a Problem?

Think about teaching a child to identify animals:

What You Show	What They Learn
99 pictures of dogs	“Everything is a dog!”
1 picture of a cat	“What’s a cat?”

Your machine learning model does the same thing! It becomes lazy and just predicts the majority class because it’s right most of the time.

graph TD
    A["Training Data"] --> B{Is it balanced?}
    B -->|Yes: 50-50| C["Model learns both classes well"]
    B -->|No: 99-1| D["Model ignores minority class"]
    D --> E["Predicts majority class always"]
    E --> F["Misses all the important rare cases!"]

The Four Heroes: Solutions for Imbalanced Data

Meet our four heroes who save the day:

Oversampling - Make copies of the rare examples
SMOTE - Create new synthetic rare examples
Undersampling - Remove some common examples
Class Weights - Tell the model “rare = important!”

Hero #1: Oversampling

The Cookie Story

Imagine you’re making cookies. You have:

100 chocolate chip cookies
Only 5 oatmeal cookies

Your friends only learn to recognize chocolate chip!

Solution: Make photocopies of those 5 oatmeal cookies until you have 100 of each!

How It Works

BEFORE Oversampling:
Class A: ●●●●●●●●●● (100 samples)
Class B: ●● (5 samples)

AFTER Oversampling:
Class A: ●●●●●●●●●● (100 samples)
Class B: ●●●●●●●●●● (100 samples)
         ↑ Duplicated!

Python Example

# Simple random oversampling
from sklearn.utils import resample

# minority_class has few samples
minority_upsampled = resample(
    minority_class,
    replace=True,        # Allow copies
    n_samples=len(majority_class)
)

Pros and Cons

Pros	Cons
Simple to understand	Creates exact duplicates
Easy to implement	Can cause overfitting
Preserves all data	Model memorizes copies

Hero #2: SMOTE (Synthetic Minority Oversampling)

The Magic Art Class Story

Instead of photocopying cookies, what if we drew NEW cookies that look similar but aren’t identical?

SMOTE is like a magic artist that:

Looks at a rare example
Finds its neighbors (similar examples)
Draws a NEW example somewhere in between!

Visual Explanation

Original minority points: ● and ★

     ●
      \
       ◆ ← NEW synthetic point!
      /
     ★

SMOTE creates ◆ between ● and ★

How SMOTE Works (Step by Step)

graph TD
    A["Pick a minority sample"] --> B["Find K nearest neighbors"]
    B --> C["Pick one neighbor randomly"]
    C --> D["Draw a line between them"]
    D --> E["Place new point on that line"]
    E --> F["Repeat until balanced!"]

Python Example

from imblearn.over_sampling import SMOTE

smote = SMOTE(random_state=42)
X_balanced, y_balanced = smote.fit_resample(
    X_train, y_train
)

Why SMOTE is Better Than Basic Oversampling

Oversampling	SMOTE
Makes exact copies	Creates NEW unique samples
Model memorizes	Model generalizes
Risk of overfitting	Reduces overfitting risk

Hero #3: Undersampling

The Birthday Party Story

You’re planning a party. You invited:

100 kids who like pizza
5 kids who like salad

Instead of forcing everyone to eat pizza, you:

Keep all 5 salad lovers
Randomly pick only 5 pizza lovers

Now it’s fair: 5 vs 5!

How It Works

BEFORE Undersampling:
Class A: ●●●●●●●●●● (100 samples)
Class B: ●● (5 samples)

AFTER Undersampling:
Class A: ●● (5 samples) ← Reduced!
Class B: ●● (5 samples)

Python Example

from imblearn.under_sampling import (
    RandomUnderSampler
)

rus = RandomUnderSampler(random_state=42)
X_balanced, y_balanced = rus.fit_resample(
    X_train, y_train
)

Pros and Cons

Pros	Cons
Simple and fast	Loses valuable data!
No synthetic data	May lose important patterns
Reduces training time	Risky with small datasets

When to Use Undersampling

You have LOTS of data in the majority class
Training time is a concern
The majority class has redundant examples

Hero #4: Class Weights

The Superhero Points Story

Imagine a game where:

Catching a common butterfly = 1 point
Catching a rare golden butterfly = 100 points!

You’d pay WAY more attention to golden butterflies, right?

Class weights do exactly this for your model!

How It Works

Instead of changing your data, you tell the model:

“Hey! When you make a mistake on the rare class, it’s 100x worse than making a mistake on the common class!”

graph TD
    A["Model makes prediction"] --> B{Was it correct?}
    B -->|Wrong on Common class| C["Small penalty: 1x"]
    B -->|Wrong on Rare class| D["HUGE penalty: 100x"]
    D --> E["Model learns to care about rare class!"]

Python Example

from sklearn.linear_model import (
    LogisticRegression
)

# Automatic weight calculation
model = LogisticRegression(
    class_weight='balanced'
)

# Or manual weights
model = LogisticRegression(
    class_weight={0: 1, 1: 99}
)

The Math Behind Balanced Weights

weight = total_samples / (n_classes × class_count)

Example: 1000 total, 950 normal, 50 fraud
- Normal weight = 1000 / (2 × 950) = 0.53
- Fraud weight = 1000 / (2 × 50) = 10.0

Fraud is weighted 19x more!

Why Class Weights are Awesome

Feature	Benefit
No data modification	Keep all your original data
Works with any model	Most sklearn models support it
Computationally cheap	No extra data to process
Easy to tune	Just adjust the weights

Choosing Your Hero: Decision Guide

graph TD
    A["Imbalanced Data?"] --> B{How much data do you have?}
    B -->|Small dataset| C{Need more samples?}
    B -->|Large dataset| D["Try Undersampling first"]
    C -->|Yes| E["Use SMOTE"]
    C -->|No| F["Use Class Weights"]
    D --> G{Still not working?}
    G -->|Yes| H["Combine methods!"]

Quick Reference Table

Method	Best When	Avoid When
Oversampling	Small minority class	Prone to overfitting
SMOTE	Need diverse samples	Very noisy data
Undersampling	Huge majority class	Limited data
Class Weights	Any situation	Need actual samples

Combining Methods: The Ultimate Power Move

Real data scientists often combine these methods!

Popular Combinations

SMOTE + Undersampling
- Increase minority with SMOTE
- Decrease majority with undersampling
- Meet in the middle!
SMOTE + Class Weights
- Create synthetic samples
- Still weight them appropriately

Python Example

from imblearn.combine import SMOTETomek

# SMOTE + Tomek link cleaning
smt = SMOTETomek(random_state=42)
X_balanced, y_balanced = smt.fit_resample(
    X_train, y_train
)

Measuring Success: Beyond Accuracy

Warning: Never trust accuracy with imbalanced data!

The Accuracy Trap

Dataset: 99 cats, 1 dog
Model predicts: "Everything is a cat!"
Accuracy: 99% (Sounds great!)
Dogs found: 0% (Terrible!)

Better Metrics

Metric	What It Measures
Precision	Of all predicted positives, how many are correct?
Recall	Of all actual positives, how many did we find?
F1-Score	Balance between precision and recall
AUC-ROC	Overall ranking ability

Your Imbalanced Data Checklist

[ ] Check your class distribution first
[ ] Never trust accuracy alone
[ ] Try class weights (easiest!)
[ ] Try SMOTE if you need more data
[ ] Use F1-score or AUC for evaluation
[ ] Combine methods if needed

The Happy Ending

Remember our seesaw? With these four heroes:

Oversampling adds more mice (copies)
SMOTE creates new mice (synthetic)
Undersampling removes some elephants
Class Weights makes mice heavier

Now the seesaw is balanced, and everyone can play fairly!

You’re now ready to tackle any imbalanced dataset! Go forth and balance those classes! 🎯

Imbalanced Data

Unable to load concept

Coming Soon...

The Unbalanced Seesaw: Mastering Imbalanced Data

Once Upon a Time in Data Land…

What is Imbalanced Data?

Why is This a Problem?

The Four Heroes: Solutions for Imbalanced Data

Hero #1: Oversampling

The Cookie Story

How It Works

Python Example

Pros and Cons

Hero #2: SMOTE (Synthetic Minority Oversampling)

The Magic Art Class Story

Visual Explanation

How SMOTE Works (Step by Step)

Python Example

Why SMOTE is Better Than Basic Oversampling

Hero #3: Undersampling

The Birthday Party Story

How It Works

Python Example

Pros and Cons

When to Use Undersampling

Hero #4: Class Weights

The Superhero Points Story

How It Works

Python Example

The Math Behind Balanced Weights

Why Class Weights are Awesome

Choosing Your Hero: Decision Guide

Quick Reference Table

Combining Methods: The Ultimate Power Move

Popular Combinations

Python Example

Measuring Success: Beyond Accuracy

The Accuracy Trap

Better Metrics

Your Imbalanced Data Checklist

The Happy Ending

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue