๐งน Data Ordering & Cleaning in Pandas
The Messy Toy Box Story
Imagine you have a giant toy box with hundreds of toys thrown in randomly. Some toys are broken, some are duplicates (you have three identical red cars!), and everything is jumbled up.
Pandas is like having a magical cleaning robot that helps you:
- Sort toys by size, color, or name
- Find and remove duplicate toys
- Replace broken parts with new ones
Letโs learn how to tidy up our data toy box!
๐ฏ Sorting by Values
Think of it like: Arranging books on a shelf by their titles.
When you sort by values, you pick a column and arrange all rows based on that columnโs data.
import pandas as pd
# Our messy data
df = pd.DataFrame({
'name': ['Zara', 'Anna', 'Mike'],
'age': [25, 30, 22]
})
# Sort by age (smallest to biggest)
sorted_df = df.sort_values('age')
Result:
| name | age |
|---|---|
| Mike | 22 |
| Anna | 30 |
| Zara | 25 |
Waitโฆ thatโs wrong! Let me fix:
| name | age |
|---|---|
| Mike | 22 |
| Zara | 25 |
| Anna | 30 |
โจ Now ages go: 22 โ 25 โ 30
๐ข Sorting by Index
Think of it like: Arranging your music playlist by track number.
Every DataFrame has an invisible โrow numberโ called the index. Sorting by index puts rows back in their original order.
# After some shuffling, reset order
df_sorted = df.sort_index()
When to use:
- After filtering, rows might be out of order
- You want to restore original sequence
- Working with time-series data
graph TD A[Shuffled Data<br/>Index: 3, 1, 2] --> B[sort_index] B --> C[Ordered Data<br/>Index: 1, 2, 3]
๐ Sorting by Multiple Columns
Think of it like: Organizing a library first by genre, then by author name.
Sometimes one column isnโt enough. You want to sort by Country first, then by City within each country.
df = pd.DataFrame({
'country': ['USA', 'USA', 'UK', 'UK'],
'city': ['Boston', 'Austin', 'London', 'Bath'],
'pop': [700, 950, 9000, 90]
})
# Sort by country, then city
df.sort_values(['country', 'city'])
Result:
| country | city | pop |
|---|---|---|
| UK | Bath | 90 |
| UK | London | 9000 |
| USA | Austin | 950 |
| USA | Boston | 700 |
๐ First sorted by country (UK before USA), then cities within each country are alphabetical!
โฌ๏ธโฌ๏ธ Ascending and Descending Sort
Think of it like: Counting 1-2-3 (ascending) or 3-2-1 (descending).
- Ascending = smallest to biggest (AโZ, 1โ100)
- Descending = biggest to smallest (ZโA, 100โ1)
# Ascending (default) - low to high
df.sort_values('price', ascending=True)
# Descending - high to low
df.sort_values('price', ascending=False)
Pro tip for multiple columns:
# Country AโZ, but price highโlow
df.sort_values(
['country', 'price'],
ascending=[True, False]
)
graph TD A[Unsorted: 50, 10, 30] --> B{Direction?} B -->|ascending=True| C[10, 30, 50] B -->|ascending=False| D[50, 30, 10]
๐ Ranking Data
Think of it like: Giving medals in a race โ 1st place, 2nd place, 3rd place.
Ranking assigns a position number based on values. Unlike sorting, it keeps rows in place but adds a rank column.
df = pd.DataFrame({
'student': ['Ali', 'Bob', 'Cat'],
'score': [85, 92, 85]
})
# Add rank (highest score = rank 1)
df['rank'] = df['score'].rank(
ascending=False
)
Result:
| student | score | rank |
|---|---|---|
| Ali | 85 | 2.5 |
| Bob | 92 | 1.0 |
| Cat | 85 | 2.5 |
Wait, why 2.5?
Ali and Cat tied! By default, Pandas gives them the average of ranks 2 and 3 = 2.5
Other ranking methods:
method='min'โ Both get rank 2method='max'โ Both get rank 3method='first'โ First one gets 2, second gets 3
๐ Finding Duplicates
Think of it like: Spotting twin toys in your toy box.
Before removing duplicates, you need to find them first!
df = pd.DataFrame({
'fruit': ['apple', 'banana', 'apple'],
'price': [1.0, 0.5, 1.0]
})
# Find ALL duplicate rows
df.duplicated()
Result:
0 False โ First apple (original)
1 False โ Banana (unique)
2 True โ Second apple (duplicate!)
Check specific columns:
# Only check 'fruit' column
df.duplicated(subset=['fruit'])
Count total duplicates:
df.duplicated().sum() # Returns: 1
๐๏ธ Removing Duplicates
Think of it like: Keeping only one of each twin toy.
Once found, duplicates are easy to remove!
# Remove duplicate rows
clean_df = df.drop_duplicates()
Control which duplicate to keep:
# Keep first occurrence (default)
df.drop_duplicates(keep='first')
# Keep last occurrence
df.drop_duplicates(keep='last')
# Remove ALL duplicates
df.drop_duplicates(keep=False)
Remove based on specific columns:
# Keep first row for each unique fruit
df.drop_duplicates(subset=['fruit'])
graph TD A[3 Rows<br/>apple, banana, apple] --> B[drop_duplicates] B --> C[2 Rows<br/>apple, banana]
๐ Replacing Values
Think of it like: Swapping broken toy parts with new ones.
Sometimes data has wrong or messy values. Replace fixes them!
Simple replacement:
df = pd.DataFrame({
'status': ['active', 'inactive', 'actve']
})
# Fix typo: 'actve' โ 'active'
df['status'].replace('actve', 'active')
Multiple replacements:
# Replace several values at once
df.replace({
'actve': 'active',
'inactve': 'inactive'
})
Replace with dictionary mapping:
# Convert codes to labels
df['grade'].replace({
'A': 'Excellent',
'B': 'Good',
'C': 'Average'
})
Replace in specific columns:
df.replace({
'status': {'old': 'new'},
'grade': {'F': 'Fail'}
})
๐ Quick Reference Box
| Task | Code |
|---|---|
| Sort by column | df.sort_values('col') |
| Sort by index | df.sort_index() |
| Sort multiple | df.sort_values(['a','b']) |
| Descending | ascending=False |
| Rank values | df['col'].rank() |
| Find duplicates | df.duplicated() |
| Remove duplicates | df.drop_duplicates() |
| Replace values | df.replace('old','new') |
๐ You Did It!
You now know how to:
- โ Sort data any way you want
- โ Rank items like a competition judge
- โ Hunt down sneaky duplicates
- โ Clean up messy values
Your data toy box is now perfectly organized! ๐
Next time you see messy data, remember: Pandas is your magical cleaning robot. Just tell it what to sort, find, remove, or replace โ and watch the magic happen!