🐼 Pandas Operations: Your Data Kitchen Adventure!
Imagine you’re a chef in a magical kitchen. Your ingredients? Data! And Pandas is your super-smart kitchen assistant that helps you chop, mix, sort, and transform ingredients into delicious insights.
Today, we’ll learn five powerful kitchen tricks:
- 🔤 String Operations (chopping and slicing text)
- 📅 DateTime Operations (working with time)
- 👥 Groupby Operations (sorting into baskets)
- 📊 Aggregation (counting and measuring)
- ↔️ Wide vs Long Data (reshaping your table)
🔤 String Operations in Pandas
What Are String Operations?
Think of text as a piece of string (like yarn!). String operations let you cut, tie, stretch, and change that string however you want.
Simple Example:
- You have a list of names: “JOHN”, “mary”, “Bob”
- You want them all to look the same: “John”, “Mary”, “Bob”
- Pandas helps you fix them all at once!
The Magic .str Accessor
In Pandas, we use .str to work with text. It’s like saying “Hey Pandas, treat this as text!”
import pandas as pd
# Our messy data
names = pd.Series(['JOHN', 'mary', 'Bob'])
# Make all lowercase
names.str.lower()
# Result: ['john', 'mary', 'bob']
# Make all uppercase
names.str.upper()
# Result: ['JOHN', 'MARY', 'BOB']
# Proper case (first letter big)
names.str.title()
# Result: ['John', 'Mary', 'Bob']
Common String Tricks
| Trick | What It Does | Example |
|---|---|---|
.str.lower() |
All small letters | “HELLO” → “hello” |
.str.upper() |
All BIG letters | “hello” → “HELLO” |
.str.strip() |
Remove extra spaces | " hi " → “hi” |
.str.replace() |
Swap one thing for another | “cat” → “dog” |
.str.contains() |
Check if text has something | “hello” has “ell”? ✅ |
.str.len() |
Count characters | “hello” = 5 |
.str.split() |
Break text apart | “a-b-c” → [‘a’,‘b’,‘c’] |
Real Kitchen Example
# Messy ingredient list
ingredients = pd.Series([
' TOMATOES ',
'onions',
'GARLIC cloves'
])
# Clean them up!
clean = ingredients.str.strip()
clean = clean.str.lower()
clean = clean.str.title()
# Result: ['Tomatoes', 'Onions',
# 'Garlic Cloves']
Finding Things in Text
emails = pd.Series([
'john@gmail.com',
'mary@yahoo.com',
'bob@gmail.com'
])
# Find all Gmail users
gmail_users = emails.str.contains('gmail')
# Result: [True, False, True]
# Extract the domain
domains = emails.str.split('@').str[1]
# Result: ['gmail.com', 'yahoo.com',
# 'gmail.com']
📅 DateTime Operations
What Are DateTime Operations?
Time is tricky! Is “01/02/2024” January 2nd or February 1st? Pandas helps you work with dates and times without confusion.
Simple Example:
- You have birthdays as text: “1990-05-15”
- Pandas turns them into real dates you can calculate with!
- “How old is this person?” becomes easy!
Converting to DateTime
import pandas as pd
# Text dates
dates_text = pd.Series([
'2024-01-15',
'2024-06-20',
'2024-12-25'
])
# Convert to real dates
dates = pd.to_datetime(dates_text)
Extracting Parts of a Date
Like taking apart a clock to see the gears!
# Create a date column
df = pd.DataFrame({
'event': ['Birthday', 'Holiday', 'Meeting'],
'date': pd.to_datetime([
'2024-03-15',
'2024-12-25',
'2024-07-04'
])
})
# Extract parts using .dt accessor
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['weekday'] = df['date'].dt.day_name()
Result:
| event | year | month | day | weekday |
|---|---|---|---|---|
| Birthday | 2024 | 3 | 15 | Friday |
| Holiday | 2024 | 12 | 25 | Wednesday |
| Meeting | 2024 | 7 | 4 | Thursday |
Date Math
from datetime import timedelta
# Today's date
today = pd.Timestamp('2024-06-15')
# Add 7 days
next_week = today + timedelta(days=7)
# Result: 2024-06-22
# Difference between dates
date1 = pd.Timestamp('2024-01-01')
date2 = pd.Timestamp('2024-12-31')
difference = date2 - date1
# Result: 365 days
Common DateTime Properties
| Property | What It Gives You |
|---|---|
.dt.year |
The year (2024) |
.dt.month |
Month number (1-12) |
.dt.day |
Day of month (1-31) |
.dt.hour |
Hour (0-23) |
.dt.minute |
Minute (0-59) |
.dt.day_name() |
“Monday”, “Tuesday”… |
.dt.month_name() |
“January”, “February”… |
.dt.quarter |
Quarter (1-4) |
.dt.dayofweek |
0=Monday, 6=Sunday |
👥 Groupby Operations
What Is Groupby?
Imagine you have a big box of colored marbles. Groupby helps you sort them into separate piles by color, then do something with each pile!
Simple Example:
- You have sales data for different stores
- Group by store name
- Now you can see total sales per store!
graph TD A["🎨 All Marbles Mixed"] --> B["Groupby Color"] B --> C["🔴 Red Pile"] B --> D["🔵 Blue Pile"] B --> E["🟢 Green Pile"] C --> F["Count: 5"] D --> G["Count: 3"] E --> H["Count: 7"]
Basic Groupby
import pandas as pd
# Sales data
df = pd.DataFrame({
'store': ['A', 'B', 'A', 'B', 'A'],
'product': ['Apple', 'Apple', 'Banana',
'Banana', 'Apple'],
'sales': [100, 150, 80, 120, 90]
})
# Group by store
grouped = df.groupby('store')
# Total sales per store
grouped['sales'].sum()
# Store A: 270
# Store B: 270
Multiple Grouping
# Group by TWO things: store AND product
grouped = df.groupby(['store', 'product'])
# Sales for each store-product combo
grouped['sales'].sum()
# Store A, Apple: 190
# Store A, Banana: 80
# Store B, Apple: 150
# Store B, Banana: 120
Common Groupby Operations
| Operation | What It Does |
|---|---|
.sum() |
Add all values together |
.mean() |
Find the average |
.count() |
Count how many items |
.min() |
Find smallest value |
.max() |
Find largest value |
.first() |
Get first item |
.last() |
Get last item |
📊 Pandas Aggregation
What Is Aggregation?
Aggregation means combining many values into one summary number. Like counting all your toys instead of listing each one!
Simple Example:
- You have 50 test scores
- Aggregation gives you: Average = 85, Highest = 100, Lowest = 60
- One line of summary instead of 50 numbers!
The .agg() Function
import pandas as pd
df = pd.DataFrame({
'category': ['A', 'A', 'B', 'B'],
'value': [10, 20, 30, 40],
'count': [1, 2, 3, 4]
})
# Multiple aggregations at once!
result = df.groupby('category').agg({
'value': ['sum', 'mean'],
'count': ['min', 'max']
})
Result:
| category | value_sum | value_mean | count_min | count_max |
|---|---|---|---|---|
| A | 30 | 15.0 | 1 | 2 |
| B | 70 | 35.0 | 3 | 4 |
Named Aggregations (Cleaner Way!)
result = df.groupby('category').agg(
total_value=('value', 'sum'),
avg_value=('value', 'mean'),
min_count=('count', 'min'),
max_count=('count', 'max')
)
Custom Aggregation Functions
# Your own function!
def range_func(x):
return x.max() - x.min()
df.groupby('category')['value'].agg(range_func)
# A: 10 (20-10)
# B: 10 (40-30)
Aggregation Without Groupby
# Summarize entire DataFrame
df['value'].agg(['sum', 'mean', 'min', 'max'])
# Result:
# sum 100
# mean 25
# min 10
# max 40
↔️ Wide vs Long Data
What’s the Difference?
Think of a spreadsheet. Wide data spreads information across columns. Long data stacks everything in rows.
Wide Data (like a calendar):
| Name | Jan | Feb | Mar |
|---|---|---|---|
| Alice | 100 | 110 | 120 |
| Bob | 200 | 210 | 220 |
Long Data (like a receipt):
| Name | Month | Value |
|---|---|---|
| Alice | Jan | 100 |
| Alice | Feb | 110 |
| Alice | Mar | 120 |
| Bob | Jan | 200 |
| Bob | Feb | 210 |
| Bob | Mar | 220 |
When to Use Which?
graph TD A["Your Data"] --> B{What do you need?} B -->|Easy to read| C["Wide Format"] B -->|Easy to analyze| D["Long Format"] C --> E["Reports, Dashboards"] D --> F["Groupby, Plotting, ML"]
Going from Wide to Long: melt()
import pandas as pd
# Wide data
wide_df = pd.DataFrame({
'Name': ['Alice', 'Bob'],
'Jan': [100, 200],
'Feb': [110, 210],
'Mar': [120, 220]
})
# Melt to long format
long_df = pd.melt(
wide_df,
id_vars=['Name'], # Keep this column
var_name='Month', # New column for headers
value_name='Sales' # New column for values
)
Result:
| Name | Month | Sales |
|---|---|---|
| Alice | Jan | 100 |
| Alice | Feb | 110 |
| Alice | Mar | 120 |
| Bob | Jan | 200 |
| Bob | Feb | 210 |
| Bob | Mar | 220 |
Going from Long to Wide: pivot()
# Long data back to wide
wide_again = long_df.pivot(
index='Name', # Row labels
columns='Month', # Column headers
values='Sales' # Values to fill in
)
pivot_table() for Aggregation
When you have duplicate entries, use pivot_table():
# Data with duplicates
sales = pd.DataFrame({
'Store': ['A', 'A', 'B', 'B'],
'Product': ['X', 'X', 'X', 'Y'],
'Sales': [10, 20, 30, 40]
})
# Pivot with aggregation
result = pd.pivot_table(
sales,
index='Store',
columns='Product',
values='Sales',
aggfunc='sum' # Combine duplicates!
)
Result:
| Store | X | Y |
|---|---|---|
| A | 30 | NaN |
| B | 30 | 40 |
🎯 Quick Reference Summary
| Operation | Purpose | Key Function |
|---|---|---|
| String | Work with text | .str.method() |
| DateTime | Work with dates | .dt.property |
| Groupby | Split into groups | .groupby() |
| Aggregation | Summarize data | .agg() |
| Reshape | Change layout | melt(), pivot() |
🌟 You Did It!
You’ve just learned five powerful Pandas operations! Like a chef who now knows how to:
- ✂️ Chop text perfectly (String Operations)
- ⏰ Manage cooking times (DateTime)
- 🗂️ Sort ingredients by type (Groupby)
- 📏 Measure and count (Aggregation)
- 🔄 Rearrange your kitchen (Wide vs Long)
Now go cook up some amazing data insights! 🐼✨
