Decision Trees

Loading concept...

🌳 Decision Trees: The Smart Question Game

Imagine you’re playing “20 Questions” with a friend. You ask yes/no questions to guess what they’re thinking. Decision Trees work exactly like that—but for computers!


What is a Decision Tree?

Think of a Decision Tree as a flowchart that helps you make decisions by asking simple questions, one at a time.

🎮 The Guessing Game Analogy

You’re guessing an animal:

  1. “Does it have fur?” → Yes
  2. “Does it bark?” → Yes
  3. “Is it a dog?” → YES! You got it!

That’s exactly how a Decision Tree works!

graph TD A[Does it have fur?] -->|Yes| B[Does it bark?] A -->|No| C[Does it have scales?] B -->|Yes| D[🐕 Dog!] B -->|No| E[🐱 Cat!] C -->|Yes| F[🐟 Fish!] C -->|No| G[🐸 Frog!]

Real Life Example:

  • Email spam filter asks: “Has weird links?” → “From unknown sender?” → SPAM!
  • Doctors ask: “Has fever?” → “Has cough?” → “COVID test needed”

The Decision Tree Algorithm

The algorithm builds this question tree automatically from data!

How It Works (Simple Version)

  1. Look at all your data (like pictures of cats and dogs)
  2. Find the BEST question that splits data most cleanly
  3. Ask that question first (put it at the top)
  4. Repeat for each branch until you can make predictions

🍕 Pizza Example

You want to predict: “Will a kid eat this pizza?”

Has Cheese Has Veggies Kid Eats?
Yes No Yes ✓
Yes Yes No ✗
No No No ✗
Yes No Yes ✓

Best first question: “Has Cheese?”

  • If No → Kid won’t eat (done!)
  • If Yes → Ask “Has Veggies?”
graph TD A[Has Cheese?] -->|No| B[Won't Eat 🙁] A -->|Yes| C[Has Veggies?] C -->|Yes| D[Won't Eat 🙁] C -->|No| E[Will Eat! 😋]

Information Gain & Entropy

How does the computer know which question is BEST?

🎲 Entropy = Messiness

Entropy measures how “mixed up” or messy your data is.

  • Low entropy = Very organized (all same thing)
  • High entropy = Very messy (all mixed up)

Example - A Bag of Marbles:

  • Bag A: 🔴🔴🔴🔴 (all red) → Entropy = 0 (super clean!)
  • Bag B: 🔴🔴🔵🔵 (half-half) → Entropy = 1 (very messy!)
  • Bag C: 🔴🔴🔴🔵 (mostly red) → Entropy = 0.81 (a bit messy)

📈 Information Gain = Cleanup Power

Information Gain tells us how much a question REDUCES messiness.

Good question = High information gain = Cleans up data well!

Simple Example:

  • Before asking “Has Cheese?”: Data is messy (2 Yes, 2 No eaters)
  • After asking: Each branch is cleaner!
  • Information Gain = Messiness Before − Messiness After

The algorithm picks the question with the HIGHEST information gain!


Gini Impurity

Another way to measure messiness—even simpler!

🎯 What is Gini Impurity?

Gini Impurity = Chance of guessing WRONG if you pick randomly

  • Gini = 0 → Perfect! All items are the same (never wrong)
  • Gini = 0.5 → Worst! 50-50 mix (wrong half the time)

🍎 Fruit Basket Example

Basket A: 🍎🍎🍎🍎 (4 apples)

  • Pick randomly, guess “apple” → Always right!
  • Gini = 0

Basket B: 🍎🍎🍊🍊 (2 apples, 2 oranges)

  • Pick randomly, guess any fruit → Wrong 50% of time
  • Gini = 0.5 😵

Basket C: 🍎🍎🍎🍊 (3 apples, 1 orange)

  • Pick randomly, guess “apple” → Wrong 25% of time
  • Gini = 0.375

Formula (Don’t Worry, It’s Easy!)

Gini = 1 - (probability of A)² - (probability of B)²

For Basket C:

Gini = 1 - (3/4)² - (1/4)²
     = 1 - 0.5625 - 0.0625
     = 0.375

Lower Gini = Better split!


Tree Pruning

What if your tree gets TOO specific?

🌲 The Overgrown Tree Problem

Imagine a tree that memorizes EVERY detail:

  • “Is it Tuesday?”
  • “Did the person have coffee?”
  • “Is their name Bob?”

This tree would be perfect on training data but terrible on new data!

This is called overfitting—like memorizing answers instead of learning.

✂️ Pruning = Trimming the Tree

Pruning removes branches that are too specific.

Before Pruning:

graph TD A[Age > 30?] -->|Yes| B[Income > 50k?] A -->|No| C[No Loan] B -->|Yes| D[Owns Car?] B -->|No| E[No Loan] D -->|Yes| F[Has Pet?] D -->|No| G[Loan OK] F -->|Yes| H[Loan OK] F -->|No| I[No Loan]

After Pruning:

graph TD A[Age > 30?] -->|Yes| B[Income > 50k?] A -->|No| C[No Loan] B -->|Yes| D[Loan OK] B -->|No| E[No Loan]

Two Types of Pruning

  1. Pre-pruning = Stop early (limit tree depth)

    • “Don’t ask more than 5 questions”
  2. Post-pruning = Trim after growing

    • Build full tree, then cut weak branches

Result: Simpler tree that works better on NEW data!


Random Forests

What’s better than one smart tree? A FOREST of trees!

🌲🌲🌲 The Wisdom of Crowds

Problem: One tree can make mistakes. Solution: Ask MANY trees and vote!

Real Life Example:

  • Don’t ask 1 friend for movie advice
  • Ask 10 friends and pick the most popular answer!

How Random Forests Work

  1. Create many trees (usually 100-1000)
  2. Each tree is slightly different (sees different data)
  3. Each tree votes on the answer
  4. Majority wins!
graph TD A[New Data] --> B[Tree 1: Cat] A --> C[Tree 2: Dog] A --> D[Tree 3: Cat] A --> E[Tree 4: Cat] A --> F[Tree 5: Dog] B --> G[Vote: 3 Cats, 2 Dogs] C --> G D --> G E --> G F --> G G --> H[🐱 Final Answer: Cat!]

Why “Random”?

Each tree gets:

  • Random subset of data (not all examples)
  • Random subset of features (not all questions)

This makes trees different from each other = better voting!


Bagging

The secret sauce behind Random Forests!

🎒 Bagging = Bootstrap + Aggregating

Bootstrap = Pick random samples WITH replacement

What does “with replacement” mean?

Imagine a bag with 5 balls: 🔴🔵🟢🟡🟣

Without replacement: Pick 3, can’t pick same one twice

  • Possible: 🔴🔵🟢

With replacement: Pick 3, put back after each pick

  • Possible: 🔴🔴🔵 (same ball can appear twice!)

How Bagging Works

  1. Original data: 100 examples
  2. Create Dataset 1: Pick 100 random samples (with replacement)
  3. Create Dataset 2: Pick 100 more random samples
  4. …repeat for each tree
  5. Train each tree on its own dataset
  6. Combine predictions (voting for classification)
graph TD A[Original Data: 100 examples] --> B[Sample 1: 100 picks] A --> C[Sample 2: 100 picks] A --> D[Sample 3: 100 picks] B --> E[Train Tree 1] C --> F[Train Tree 2] D --> G[Train Tree 3] E --> H[Combine Votes!] F --> H G --> H

Why Does Bagging Help?

  • Each tree sees slightly different data
  • Trees make different mistakes
  • Mistakes cancel out when voting!

Analogy: If 100 people guess your weight:

  • Some guess too high, some too low
  • Average of all guesses is usually pretty close!

Quick Summary

Concept One-Liner
Decision Tree A flowchart of yes/no questions
Algorithm Find best question, split, repeat
Entropy How messy/mixed your data is
Information Gain How much a question cleans up mess
Gini Impurity Chance of guessing wrong randomly
Pruning Trim tree to avoid memorizing
Random Forest Many trees voting together
Bagging Random sampling to create variety

🎯 Key Takeaways

  1. Decision Trees ask simple questions to make predictions
  2. Entropy/Gini measure how to pick the best questions
  3. Pruning keeps trees from getting too complicated
  4. Random Forests combine many trees for better answers
  5. Bagging creates variety by random sampling

Now you understand how computers can learn to make decisions—just like playing a really smart guessing game! 🎮🌳


Remember: A single tree can be fooled, but a forest is wise!

Loading story...

No Story Available

This concept doesn't have a story yet.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Interactive Preview

Interactive - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Interactive Content

This concept doesn't have interactive content yet.

Cheatsheet Preview

Cheatsheet - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Cheatsheet Available

This concept doesn't have a cheatsheet yet.

Quiz Preview

Quiz - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Quiz Available

This concept doesn't have a quiz yet.