Handling Missing Data

Loading concept...

🩹 Handling Missing Data in Pandas

The Story of the Forgetful Librarian

Imagine a librarian named Lily who keeps a record of all the books in her library. But Lily has a tiny problem—sometimes she forgets to write things down! Some book entries are missing their page counts, others are missing their authors.

Missing data in Pandas is exactly like Lily’s forgetful notes. And today, we’ll learn how to find, remove, or fill in those blank spots!


🔍 Detecting Missing Values

What Does “Missing” Look Like?

In Pandas, a missing value shows up as NaN (Not a Number) or None. Think of it like an empty box on a form—you know something should be there, but it’s blank.

import pandas as pd
import numpy as np

# Lily's book log with missing data
books = pd.DataFrame({
    'title': ['Python Basics', 'Data Science', 'AI Magic'],
    'pages': [200, np.nan, 350],
    'author': ['Ada', None, 'Grace']
})
print(books)

Output:

           title  pages author
0  Python Basics  200.0    Ada
1   Data Science    NaN   None
2       AI Magic  350.0  Grace

See those NaN and None? Those are Lily’s forgotten entries!

Finding the Blanks with isna() and isnull()

To find where the blanks are, use isna() or isnull() (they do the same thing!).

# True = missing, False = not missing
print(books.isna())

Output:

   title  pages  author
0  False  False   False
1  False   True    True
2  False  False   False

Row 1 has two missing values—pages and author!

Finding Non-Missing with notna()

Want to find what’s not missing? Use notna():

print(books.notna())

This flips the True/False—now True means “we have data here!”


🏷️ NA and pd.NA

Meet the New Kid: pd.NA

Pandas introduced pd.NA as a better way to represent missing data. It works with all data types—numbers, text, booleans, everything!

# Old way
old_missing = np.nan

# New way (cleaner!)
new_missing = pd.NA

Why pd.NA is Better

Imagine asking: “Is this missing value True or False?” With np.nan, you’d get confusing answers. With pd.NA, you get pd.NA—meaning “I don’t know, it’s missing!”

# pd.NA handles logic better
result = pd.NA | True   # Returns True
result = pd.NA & False  # Returns False
result = pd.NA | False  # Returns pd.NA (uncertain!)

Think of pd.NA as an honest friend who says “I don’t know” instead of guessing!


🗑️ Dropping Missing with dropna()

Sometimes, you just want to remove the rows or columns with blanks. That’s what dropna() does!

Drop Rows with Any Missing Value

# Remove any row that has a blank
clean_books = books.dropna()
print(clean_books)

Output:

           title  pages author
0  Python Basics  200.0    Ada
2       AI Magic  350.0  Grace

Row 1 had blanks, so it’s gone!

Drop Only If All Values Are Missing

# Only drop if ENTIRE row is blank
books.dropna(how='all')

Drop Rows Based on Specific Columns

# Only check 'pages' column for blanks
books.dropna(subset=['pages'])

Drop Columns Instead of Rows

# axis=1 means columns
books.dropna(axis=1)
graph TD A[DataFrame with NaN] --> B{dropna} B -->|how='any'| C[Remove if ANY blank] B -->|how='all'| D[Remove if ALL blank] B -->|subset| E[Check specific columns] B -->|axis=1| F[Remove columns, not rows]

✏️ Filling Missing with fillna()

Instead of removing blanks, what if we fill them in? Like Lily finally remembering and writing down the missing info!

Fill with a Single Value

# Fill all blanks with 0
books['pages'].fillna(0)

Output:

0    200.0
1      0.0
2    350.0

Fill with the Mean (Average)

# Fill with average page count
avg_pages = books['pages'].mean()
books['pages'].fillna(avg_pages)

Fill Different Columns with Different Values

books.fillna({
    'pages': 0,
    'author': 'Unknown'
})

Now “Unknown” appears where author was missing!


⬆️⬇️ Directional Fill Methods

What if you want to fill blanks using nearby values? Like copying from the cell above or below!

Forward Fill (ffill) - Copy from Above

temps = pd.Series([22, np.nan, np.nan, 25, np.nan])
temps.ffill()

Output:

0    22.0
1    22.0  ← copied from row 0
2    22.0  ← copied from row 1
3    25.0
4    25.0  ← copied from row 3

The blank looks “up” and copies!

Backward Fill (bfill) - Copy from Below

temps.bfill()

Output:

0    22.0
1    25.0  ← copied from row 3
2    25.0  ← copied from row 3
3    25.0
4     NaN  ← nothing below to copy!

The blank looks “down” and copies!

Limit How Many to Fill

# Only fill 1 blank in a row
temps.ffill(limit=1)
graph TD A[Blank Cell] --> B{Which direction?} B -->|ffill| C[Look UP and copy] B -->|bfill| D[Look DOWN and copy] C --> E[Fill blanks forward] D --> F[Fill blanks backward]

📈 Interpolating Missing Values

Interpolation is like being a detective. If you know the values before and after a blank, you can guess what’s in the middle!

Linear Interpolation

Imagine a line connecting two points—the missing value is somewhere on that line.

heights = pd.Series([100, np.nan, np.nan, 160])
heights.interpolate()

Output:

0    100.0
1    120.0  ← guessed! (100 + 160) / 3 steps
2    140.0  ← guessed!
3    160.0

The gaps are filled with evenly spaced values!

Different Interpolation Methods

# Time-based interpolation
df.interpolate(method='time')

# Polynomial interpolation (curved line)
df.interpolate(method='polynomial', order=2)

# Index-based (uses actual index values)
df.interpolate(method='index')

When to Use Interpolation

Situation Best Method
Steady growth linear
Time series data time
Curved patterns polynomial
Index matters index

🎯 Quick Decision Guide

graph TD A[Missing Data Found!] --> B{What to do?} B -->|Remove it| C[dropna] B -->|Fill with value| D[fillna] B -->|Copy neighbors| E[ffill/bfill] B -->|Smart guess| F[interpolate] C --> G[Rows or Columns?] D --> H[Single value or dict?] E --> I[Forward or Backward?] F --> J[Linear or Polynomial?]

💡 Pro Tips

  1. Check first! Always use isna().sum() to count blanks before deciding what to do.

  2. Don’t blindly fill! Filling with 0 might mess up calculations. Think about what makes sense for your data.

  3. Interpolation = Smart fill. For time-based data (like temperatures or stock prices), interpolation gives better results than simple filling.

  4. pd.NA is the future. When creating DataFrames from scratch, prefer pd.NA over np.nan.


🏆 You Did It!

You’ve learned how to:

  • Detect missing values with isna() and notna()
  • ✅ Understand pd.NA vs np.nan
  • Drop missing data with dropna()
  • Fill blanks with fillna()
  • ✅ Use directional fills (ffill, bfill)
  • Interpolate to make smart guesses

Lily the librarian is now organized, and so is your data! 📚✨

Loading story...

No Story Available

This concept doesn't have a story yet.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

Interactive Preview

Interactive - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Interactive Content

This concept doesn't have interactive content yet.

Cheatsheet Preview

Cheatsheet - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Cheatsheet Available

This concept doesn't have a cheatsheet yet.

Quiz Preview

Quiz - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.

No Quiz Available

This concept doesn't have a quiz yet.