🩹 Handling Missing Data in Pandas
The Story of the Forgetful Librarian
Imagine a librarian named Lily who keeps a record of all the books in her library. But Lily has a tiny problem—sometimes she forgets to write things down! Some book entries are missing their page counts, others are missing their authors.
Missing data in Pandas is exactly like Lily’s forgetful notes. And today, we’ll learn how to find, remove, or fill in those blank spots!
🔍 Detecting Missing Values
What Does “Missing” Look Like?
In Pandas, a missing value shows up as NaN (Not a Number) or None. Think of it like an empty box on a form—you know something should be there, but it’s blank.
import pandas as pd
import numpy as np
# Lily's book log with missing data
books = pd.DataFrame({
'title': ['Python Basics', 'Data Science', 'AI Magic'],
'pages': [200, np.nan, 350],
'author': ['Ada', None, 'Grace']
})
print(books)
Output:
title pages author
0 Python Basics 200.0 Ada
1 Data Science NaN None
2 AI Magic 350.0 Grace
See those NaN and None? Those are Lily’s forgotten entries!
Finding the Blanks with isna() and isnull()
To find where the blanks are, use isna() or isnull() (they do the same thing!).
# True = missing, False = not missing
print(books.isna())
Output:
title pages author
0 False False False
1 False True True
2 False False False
Row 1 has two missing values—pages and author!
Finding Non-Missing with notna()
Want to find what’s not missing? Use notna():
print(books.notna())
This flips the True/False—now True means “we have data here!”
🏷️ NA and pd.NA
Meet the New Kid: pd.NA
Pandas introduced pd.NA as a better way to represent missing data. It works with all data types—numbers, text, booleans, everything!
# Old way
old_missing = np.nan
# New way (cleaner!)
new_missing = pd.NA
Why pd.NA is Better
Imagine asking: “Is this missing value True or False?” With np.nan, you’d get confusing answers. With pd.NA, you get pd.NA—meaning “I don’t know, it’s missing!”
# pd.NA handles logic better
result = pd.NA | True # Returns True
result = pd.NA & False # Returns False
result = pd.NA | False # Returns pd.NA (uncertain!)
Think of pd.NA as an honest friend who says “I don’t know” instead of guessing!
🗑️ Dropping Missing with dropna()
Sometimes, you just want to remove the rows or columns with blanks. That’s what dropna() does!
Drop Rows with Any Missing Value
# Remove any row that has a blank
clean_books = books.dropna()
print(clean_books)
Output:
title pages author
0 Python Basics 200.0 Ada
2 AI Magic 350.0 Grace
Row 1 had blanks, so it’s gone!
Drop Only If All Values Are Missing
# Only drop if ENTIRE row is blank
books.dropna(how='all')
Drop Rows Based on Specific Columns
# Only check 'pages' column for blanks
books.dropna(subset=['pages'])
Drop Columns Instead of Rows
# axis=1 means columns
books.dropna(axis=1)
graph TD A[DataFrame with NaN] --> B{dropna} B -->|how='any'| C[Remove if ANY blank] B -->|how='all'| D[Remove if ALL blank] B -->|subset| E[Check specific columns] B -->|axis=1| F[Remove columns, not rows]
✏️ Filling Missing with fillna()
Instead of removing blanks, what if we fill them in? Like Lily finally remembering and writing down the missing info!
Fill with a Single Value
# Fill all blanks with 0
books['pages'].fillna(0)
Output:
0 200.0
1 0.0
2 350.0
Fill with the Mean (Average)
# Fill with average page count
avg_pages = books['pages'].mean()
books['pages'].fillna(avg_pages)
Fill Different Columns with Different Values
books.fillna({
'pages': 0,
'author': 'Unknown'
})
Now “Unknown” appears where author was missing!
⬆️⬇️ Directional Fill Methods
What if you want to fill blanks using nearby values? Like copying from the cell above or below!
Forward Fill (ffill) - Copy from Above
temps = pd.Series([22, np.nan, np.nan, 25, np.nan])
temps.ffill()
Output:
0 22.0
1 22.0 ← copied from row 0
2 22.0 ← copied from row 1
3 25.0
4 25.0 ← copied from row 3
The blank looks “up” and copies!
Backward Fill (bfill) - Copy from Below
temps.bfill()
Output:
0 22.0
1 25.0 ← copied from row 3
2 25.0 ← copied from row 3
3 25.0
4 NaN ← nothing below to copy!
The blank looks “down” and copies!
Limit How Many to Fill
# Only fill 1 blank in a row
temps.ffill(limit=1)
graph TD A[Blank Cell] --> B{Which direction?} B -->|ffill| C[Look UP and copy] B -->|bfill| D[Look DOWN and copy] C --> E[Fill blanks forward] D --> F[Fill blanks backward]
📈 Interpolating Missing Values
Interpolation is like being a detective. If you know the values before and after a blank, you can guess what’s in the middle!
Linear Interpolation
Imagine a line connecting two points—the missing value is somewhere on that line.
heights = pd.Series([100, np.nan, np.nan, 160])
heights.interpolate()
Output:
0 100.0
1 120.0 ← guessed! (100 + 160) / 3 steps
2 140.0 ← guessed!
3 160.0
The gaps are filled with evenly spaced values!
Different Interpolation Methods
# Time-based interpolation
df.interpolate(method='time')
# Polynomial interpolation (curved line)
df.interpolate(method='polynomial', order=2)
# Index-based (uses actual index values)
df.interpolate(method='index')
When to Use Interpolation
| Situation | Best Method |
|---|---|
| Steady growth | linear |
| Time series data | time |
| Curved patterns | polynomial |
| Index matters | index |
🎯 Quick Decision Guide
graph TD A[Missing Data Found!] --> B{What to do?} B -->|Remove it| C[dropna] B -->|Fill with value| D[fillna] B -->|Copy neighbors| E[ffill/bfill] B -->|Smart guess| F[interpolate] C --> G[Rows or Columns?] D --> H[Single value or dict?] E --> I[Forward or Backward?] F --> J[Linear or Polynomial?]
💡 Pro Tips
-
Check first! Always use
isna().sum()to count blanks before deciding what to do. -
Don’t blindly fill! Filling with 0 might mess up calculations. Think about what makes sense for your data.
-
Interpolation = Smart fill. For time-based data (like temperatures or stock prices), interpolation gives better results than simple filling.
-
pd.NAis the future. When creating DataFrames from scratch, preferpd.NAovernp.nan.
🏆 You Did It!
You’ve learned how to:
- ✅ Detect missing values with
isna()andnotna() - ✅ Understand
pd.NAvsnp.nan - ✅ Drop missing data with
dropna() - ✅ Fill blanks with
fillna() - ✅ Use directional fills (
ffill,bfill) - ✅ Interpolate to make smart guesses
Lily the librarian is now organized, and so is your data! 📚✨