πΌ Pandas Fundamentals: Your Data Kitchen Adventure
Imagine youβre a chef in a magical kitchen. Instead of cooking food, youβre cooking DATA! Pandas is your super-powered cooking assistant that helps you organize, clean, and transform ingredients (data) into delicious meals (insights).
π― What Youβll Learn
Think of this journey like learning to cook in a restaurant kitchen:
- Pandas Series β Single ingredient containers
- Pandas DataFrame β Your recipe organizer
- Index Concepts β Labels on your containers
- Reading & Writing Data β Getting ingredients in and out
- Data Selection & Filtering β Picking the right ingredients
- Handling Missing Data β Dealing with empty containers
- Data Type Conversion β Transforming ingredients
π¦ Pandas Series: Your Single-Column Container
What is a Series?
Think of a Series like a single column of labeled boxes on a shelf. Each box has:
- A label (index) on the outside
- One thing inside (the value)
Real Life Example:
- Your piggy bank slots labeled by month
- Each slot has coins inside
import pandas as pd
# Create a Series - like labeling jars
fruits = pd.Series(
[5, 3, 8, 2],
index=['apples', 'bananas', 'oranges', 'grapes']
)
print(fruits)
Output:
apples 5
bananas 3
oranges 8
grapes 2
dtype: int64
Quick Operations on Series
# Get total fruits
total = fruits.sum() # 18
# Find average
average = fruits.mean() # 4.5
# Get specific fruit
apple_count = fruits['apples'] # 5
π Pandas DataFrame: Your Super Spreadsheet
What is a DataFrame?
A DataFrame is like a table with rows and columns. Think of it as:
- A notebook with many columns
- Each column is a Series
- Like a mini Excel inside Python!
graph TD A["DataFrame"] --> B["Column 1<br>Series"] A --> C["Column 2<br>Series"] A --> D["Column 3<br>Series"] B --> E["Row 0"] B --> F["Row 1"] B --> G["Row 2"]
Creating a DataFrame
# Like making a class roster
students = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [10, 11, 10],
'grade': ['A', 'B', 'A']
})
print(students)
Output:
name age grade
0 Alice 10 A
1 Bob 11 B
2 Charlie 10 A
DataFrame from a List
data = [
['Pizza', 10],
['Burger', 8],
['Salad', 5]
]
menu = pd.DataFrame(
data,
columns=['food', 'price']
)
π·οΈ Pandas Index Concepts: Your Labeling System
What is an Index?
The index is like name tags on lockers. It helps you find things fast!
graph TD A["Index = Labels"] --> B["0: First Row"] A --> C["1: Second Row"] A --> D["2: Third Row"] E["Custom Index"] --> F["Alice: First Row"] E --> G["Bob: Second Row"] E --> H["Charlie: Third Row"]
Default vs Custom Index
# Default index: 0, 1, 2, 3...
df = pd.DataFrame({'score': [85, 90, 78]})
# Index: 0, 1, 2
# Custom index: meaningful labels
df_custom = pd.DataFrame(
{'score': [85, 90, 78]},
index=['Alice', 'Bob', 'Charlie']
)
Working with Index
# Set a column as index
df = df.set_index('name')
# Reset back to numbers
df = df.reset_index()
# Access by index label
df.loc['Alice']
# Access by position
df.iloc[0]
π Reading and Writing Data: Import & Export
Reading Data Files
Think of this as opening a recipe book and copying recipes into your kitchen.
graph TD A["External Files"] --> B["CSV Files"] A --> C["Excel Files"] A --> D["JSON Files"] B --> E["pd.read_csv"] C --> F["pd.read_excel"] D --> G["pd.read_json"] E --> H["DataFrame in Python"] F --> H G --> H
Reading CSV (Most Common!)
# Read a CSV file
df = pd.read_csv('students.csv')
# See first 5 rows
print(df.head())
# See last 3 rows
print(df.tail(3))
Reading Excel
# Read Excel file
df = pd.read_excel('data.xlsx')
# Read specific sheet
df = pd.read_excel(
'data.xlsx',
sheet_name='Sheet2'
)
Writing Data Out
# Save to CSV
df.to_csv('output.csv', index=False)
# Save to Excel
df.to_excel('output.xlsx', index=False)
# Save to JSON
df.to_json('output.json')
π― Data Selection and Filtering: Finding What You Need
Selecting Columns
# Single column (returns Series)
names = df['name']
# Multiple columns (returns DataFrame)
subset = df[['name', 'age']]
Selecting Rows
# By position (iloc)
first_row = df.iloc[0]
first_three = df.iloc[0:3]
# By label (loc)
alice_row = df.loc['Alice']
Filtering with Conditions
This is like asking questions to your data!
# Who is older than 10?
older = df[df['age'] > 10]
# Who got grade A?
grade_a = df[df['grade'] == 'A']
# Combine conditions with &
smart_young = df[
(df['grade'] == 'A') &
(df['age'] == 10)
]
graph TD A["All Data"] --> B{age > 10?} B -->|Yes| C["Keep Row"] B -->|No| D["Skip Row"] C --> E["Filtered Data"]
Quick Selection Examples
# First 5 rows
df.head()
# Last 5 rows
df.tail()
# Random sample of 3 rows
df.sample(3)
# Rows 10 to 20
df.iloc[10:21]
π³οΈ Handling Missing Data: Dealing with Empty Boxes
What is Missing Data?
Missing data shows as NaN (Not a Number). Itβs like:
- Empty boxes in your storage
- Blank cells in a spreadsheet
- Information we donβt have yet
Finding Missing Data
# Check for missing values
print(df.isnull())
# Count missing in each column
print(df.isnull().sum())
# Check if any value is missing
print(df.isnull().any())
Dealing with Missing Data
# Option 1: Remove rows with missing
df_clean = df.dropna()
# Option 2: Fill with a value
df_filled = df.fillna(0)
# Option 3: Fill with average
df['age'] = df['age'].fillna(
df['age'].mean()
)
# Option 4: Fill with previous value
df_ffill = df.fillna(method='ffill')
graph TD A["Missing Data?"] --> B{How to handle?} B --> C["dropna<br>Remove it"] B --> D["fillna<br>Replace it"] D --> E["With 0"] D --> F["With mean"] D --> G["With previous"]
π Data Type Conversion: Transforming Your Data
Why Convert Types?
Sometimes data comes in wrong format:
- Numbers stored as text β123β
- Dates stored as text β2024-01-15β
- Categories stored as text
Checking Data Types
# See all column types
print(df.dtypes)
# Common types:
# int64 = whole numbers
# float64 = decimal numbers
# object = text/mixed
# bool = True/False
# datetime64 = dates
Converting Types
# Text to number
df['price'] = df['price'].astype(int)
# Number to text
df['id'] = df['id'].astype(str)
# Text to datetime
df['date'] = pd.to_datetime(df['date'])
# Text to category (saves memory!)
df['color'] = df['color'].astype('category')
Handling Conversion Errors
# Safe conversion with errors='coerce'
# Bad values become NaN instead of error
df['age'] = pd.to_numeric(
df['age'],
errors='coerce'
)
graph TD A["Original Type"] --> B{Convert to?} B --> C["int/float<br>astype or to_numeric"] B --> D["string<br>astype str"] B --> E["datetime<br>pd.to_datetime"] B --> F["category<br>astype category"]
π Quick Reference Card
| Task | Code |
|---|---|
| Create Series | pd.Series([1,2,3]) |
| Create DataFrame | pd.DataFrame({'a':[1,2]}) |
| Read CSV | pd.read_csv('file.csv') |
| Write CSV | df.to_csv('out.csv') |
| Select column | df['column'] |
| Filter rows | df[df['age'] > 18] |
| Check missing | df.isnull().sum() |
| Fill missing | df.fillna(0) |
| Convert type | df['col'].astype(int) |
π You Did It!
You now understand the 7 fundamental pillars of Pandas:
- β Series - Single columns of data
- β DataFrame - Tables with rows and columns
- β Index - Labels for fast lookup
- β Read/Write - Getting data in and out
- β Selection/Filtering - Finding specific data
- β Missing Data - Handling empty values
- β Type Conversion - Transforming data types
Next step: Practice with real data! Try loading a CSV file and explore it using what you learned.
Remember: Every data scientist started exactly where you are now. Keep practicing, keep exploring, and youβll master Pandas in no time! πΌβ¨
