The Magical Drawer Organizer: Mastering tidyr
Your Dataโs Best Friend
Imagine you have a messy toy box. Some toys are scattered everywhere, some are mixed together, and some are missing! tidyr is like a magical drawer organizer that helps you put everything in the right place.
๐ฏ What is tidyr?
tidyr is part of the Tidyverse family in R. It helps you:
- Reshape your data (like folding clothes different ways)
- Split information apart (like separating mixed crayons)
- Join pieces together (like combining puzzle pieces)
- Handle missing stuff (like finding lost toys)
๐ PIVOT FUNCTIONS: The Shape-Shifters
The Bookshelf Analogy
Think of your data like books on a shelf. Sometimes you want them standing tall (long format), and sometimes you want them lying flat (wide format). Pivot functions help you change how your data โsitsโ!
pivot_longer(): Making Data Tall
What Does It Do?
pivot_longer() takes wide data and makes it tall. Itโs like taking books lying flat and standing them up one by one.
When Do You Use It?
When you have column names that are actually values (like months: Jan, Feb, Mar as columns).
Simple Example
Before (Wide - books lying flat):
| Student | Math | Science | Art |
|---|---|---|---|
| Emma | 90 | 85 | 95 |
| Liam | 80 | 90 | 88 |
After (Long - books standing tall):
| Student | Subject | Score |
|---|---|---|
| Emma | Math | 90 |
| Emma | Science | 85 |
| Emma | Art | 95 |
| Liam | Math | 80 |
| Liam | Science | 90 |
| Liam | Art | 88 |
The Code
library(tidyr)
# Create wide data
grades <- tibble(
Student = c("Emma", "Liam"),
Math = c(90, 80),
Science = c(85, 90),
Art = c(95, 88)
)
# Make it tall!
tall_grades <- grades %>%
pivot_longer(
cols = c(Math, Science, Art),
names_to = "Subject",
values_to = "Score"
)
Breaking It Down
| Argument | What It Means | Likeโฆ |
|---|---|---|
cols |
Which columns to pivot | Which books to stand up |
names_to |
New column for old column names | Label for the book titles |
values_to |
New column for the values | Label for the book contents |
pivot_wider(): Making Data Wide
What Does It Do?
pivot_wider() takes tall data and spreads it wide. Itโs like taking standing books and laying them flat side by side.
When Do You Use It?
When you want each unique value to become its own column.
Simple Example
Before (Long):
| City | Month | Temp |
|---|---|---|
| Tokyo | Jan | 5 |
| Tokyo | Feb | 7 |
| Paris | Jan | 4 |
| Paris | Feb | 5 |
After (Wide):
| City | Jan | Feb |
|---|---|---|
| Tokyo | 5 | 7 |
| Paris | 4 | 5 |
The Code
# Create long data
temps <- tibble(
City = c("Tokyo", "Tokyo",
"Paris", "Paris"),
Month = c("Jan", "Feb",
"Jan", "Feb"),
Temp = c(5, 7, 4, 5)
)
# Make it wide!
wide_temps <- temps %>%
pivot_wider(
names_from = Month,
values_from = Temp
)
Breaking It Down
| Argument | What It Means | Likeโฆ |
|---|---|---|
names_from |
Column whose values become new column names | Book titles become shelf labels |
values_from |
Column whose values fill the new columns | Book contents fill the shelves |
๐ Quick Comparison
graph TD A["Wide Data"] -->|pivot_longer| B["Long Data"] B -->|pivot_wider| A style A fill:#ffd700,stroke:#333 style B fill:#87ceeb,stroke:#333
Remember:
- Longer = More rows, fewer columns
- Wider = More columns, fewer rows
โ๏ธ SEPARATE AND UNITE: The Splitters and Joiners
The Birthday Cake Analogy
Sometimes you have a whole cake (combined data), and you need to cut it into slices (separate). Other times, you have slices you want to put back together into a cake (unite)!
separate(): Cutting the Cake
What Does It Do?
separate() splits one column into multiple columns. Like cutting a cake into pieces!
Simple Example
Before:
| id | name_age |
|---|---|
| 1 | Emma_12 |
| 2 | Liam_10 |
| 3 | Olivia_11 |
After:
| id | name | age |
|---|---|---|
| 1 | Emma | 12 |
| 2 | Liam | 10 |
| 3 | Olivia | 11 |
The Code
# Data with combined info
kids <- tibble(
id = 1:3,
name_age = c("Emma_12",
"Liam_10",
"Olivia_11")
)
# Separate into two columns
kids_split <- kids %>%
separate(
col = name_age,
into = c("name", "age"),
sep = "_"
)
Key Arguments
| Argument | What It Means |
|---|---|
col |
Column to split |
into |
Names for new columns |
sep |
What to split on (โ_โ, โ-โ, etc.) |
separate_rows(): One-to-Many Split
What Does It Do?
When one cell has multiple values, separate_rows() creates multiple rows for each value!
Simple Example
Before:
| student | hobbies |
|---|---|
| Emma | swim,read,dance |
| Liam | game,code |
After:
| student | hobbies |
|---|---|
| Emma | swim |
| Emma | read |
| Emma | dance |
| Liam | game |
| Liam | code |
The Code
# Data with lists in cells
hobbies <- tibble(
student = c("Emma", "Liam"),
hobbies = c("swim,read,dance",
"game,code")
)
# Separate into rows
hobbies_long <- hobbies %>%
separate_rows(hobbies, sep = ",")
unite(): Joining the Pieces
What Does It Do?
unite() combines multiple columns into one column. The opposite of separate!
Simple Example
Before:
| first | last | grade |
|---|---|---|
| Emma | Smith | A |
| Liam | Jones | B |
After:
| full_name | grade |
|---|---|
| Emma_Smith | A |
| Liam_Jones | B |
The Code
# Data with separate columns
names <- tibble(
first = c("Emma", "Liam"),
last = c("Smith", "Jones"),
grade = c("A", "B")
)
# Unite into one column
names_joined <- names %>%
unite(
col = "full_name",
first, last,
sep = "_"
)
Key Arguments
| Argument | What It Means |
|---|---|
col |
Name for new combined column |
... |
Columns to combine |
sep |
What to put between values |
โ๏ธ Separate vs Unite
graph LR A["One Column"] -->|separate| B["Multiple Columns"] B -->|unite| A style A fill:#ff9999,stroke:#333 style B fill:#99ff99,stroke:#333
๐ MISSING VALUE HANDLING: Finding the Lost Toys
The Treasure Hunt Analogy
Sometimes toys get lost under the bed (missing values). tidyr helps you:
- Find the missing toys
- Fill in the empty spots
- Remove the unfixable gaps
Understanding NA
In R, NA means โNot Availableโ - a missing value.
# A vector with missing value
ages <- c(10, NA, 12, 8, NA)
Itโs like a puzzle with missing pieces!
drop_na(): Removing Missing Values
What Does It Do?
drop_na() removes rows that have missing values. Like throwing away incomplete puzzle pieces.
Simple Example
Before:
| name | age | score |
|---|---|---|
| Emma | 10 | 95 |
| Liam | NA | 88 |
| Olivia | 11 | NA |
| Noah | 9 | 92 |
After (drop_na()):
| name | age | score |
|---|---|---|
| Emma | 10 | 95 |
| Noah | 9 | 92 |
The Code
# Data with missing values
students <- tibble(
name = c("Emma", "Liam",
"Olivia", "Noah"),
age = c(10, NA, 11, 9),
score = c(95, 88, NA, 92)
)
# Drop ALL rows with ANY NA
clean_data <- students %>%
drop_na()
# Drop rows with NA only in 'age'
clean_age <- students %>%
drop_na(age)
Be careful! drop_na() removes entire rows. Use wisely!
fill(): Filling the Gaps
What Does It Do?
fill() takes the previous (or next) value and copies it into empty spots. Like filling a coloring book by continuing the same color!
Simple Example
Before:
| year | quarter | sales |
|---|---|---|
| 2023 | Q1 | 100 |
| NA | Q2 | 120 |
| NA | Q3 | 115 |
| 2024 | Q1 | 130 |
After (fill year):
| year | quarter | sales |
|---|---|---|
| 2023 | Q1 | 100 |
| 2023 | Q2 | 120 |
| 2023 | Q3 | 115 |
| 2024 | Q1 | 130 |
The Code
# Data with gaps
sales <- tibble(
year = c(2023, NA, NA, 2024),
quarter = c("Q1", "Q2", "Q3", "Q1"),
sales = c(100, 120, 115, 130)
)
# Fill down (default)
filled <- sales %>%
fill(year)
# Fill up
filled_up <- sales %>%
fill(year, .direction = "up")
Direction Options
| Direction | What It Does |
|---|---|
"down" |
Copy from above (default) |
"up" |
Copy from below |
"downup" |
Down first, then up |
"updown" |
Up first, then down |
replace_na(): Swap NA for Something Else
What Does It Do?
replace_na() replaces all NAs with a value you choose. Like putting a placeholder toy where the missing one was!
Simple Example
Before:
| item | price |
|---|---|
| Apple | 1.50 |
| Banana | NA |
| Cherry | 2.00 |
After (replace NA with 0):
| item | price |
|---|---|
| Apple | 1.50 |
| Banana | 0.00 |
| Cherry | 2.00 |
The Code
# Data with NA prices
fruits <- tibble(
item = c("Apple", "Banana", "Cherry"),
price = c(1.50, NA, 2.00)
)
# Replace NA with 0
fruits_fixed <- fruits %>%
mutate(
price = replace_na(price, 0)
)
# Or replace multiple columns
fruits_fixed2 <- fruits %>%
replace_na(list(price = 0))
complete(): Making Explicit Gaps
What Does It Do?
complete() makes sure all combinations exist. Missing ones get NA. Like making sure every puzzle piece slot exists, even if empty!
Simple Example
Before:
| student | subject | score |
|---|---|---|
| Emma | Math | 90 |
| Emma | Art | 95 |
| Liam | Math | 85 |
After (complete all combinations):
| student | subject | score |
|---|---|---|
| Emma | Math | 90 |
| Emma | Art | 95 |
| Liam | Math | 85 |
| Liam | Art | NA |
The Code
# Incomplete data
scores <- tibble(
student = c("Emma", "Emma", "Liam"),
subject = c("Math", "Art", "Math"),
score = c(90, 95, 85)
)
# Complete all combinations
full_data <- scores %>%
complete(student, subject)
# With a fill value
full_filled <- scores %>%
complete(
student, subject,
fill = list(score = 0)
)
๐ Quick Reference
| Function | What It Does | When to Use |
|---|---|---|
drop_na() |
Removes rows with NA | When you must have complete data |
fill() |
Copies previous/next value | When data should repeat |
replace_na() |
Swaps NA for a value | When you want a default |
complete() |
Creates missing combinations | When you need all possibilities |
๐ฏ Missing Values Flow
graph TD A["Data with NAs"] --> B{What do you need?} B -->|Remove them| C["drop_na"] B -->|Copy nearby values| D["fill"] B -->|Use default value| E["replace_na"] B -->|Show all combinations| F["complete"] style A fill:#ffcccc,stroke:#333 style C fill:#99ff99,stroke:#333 style D fill:#99ff99,stroke:#333 style E fill:#99ff99,stroke:#333 style F fill:#99ff99,stroke:#333
๐ PUTTING IT ALL TOGETHER
A Real-World Example
Letโs clean up messy survey data!
library(tidyr)
library(dplyr)
# Messy survey data
survey <- tibble(
response_id = c(1, 2, 3),
name_date = c("Emma_2023-01-15",
"Liam_2023-01-16",
"Olivia_2023-01-17"),
Q1 = c(5, NA, 4),
Q2 = c(4, 3, NA),
Q3 = c(5, 4, 5)
)
# Clean it up!
clean_survey <- survey %>%
# Separate name and date
separate(name_date,
c("name", "date"),
sep = "_") %>%
# Make questions long
pivot_longer(
cols = starts_with("Q"),
names_to = "question",
values_to = "rating"
) %>%
# Fill missing with median
mutate(
rating = replace_na(rating, 4)
)
๐ You Did It!
You now know how to:
- Reshape data with pivot_longer and pivot_wider
- Split and join columns with separate and unite
- Handle missing values with drop_na, fill, replace_na, and complete
Remember the analogies:
- ๐ Pivots = Books standing or lying on a shelf
- ๐ Separate/Unite = Cutting and rejoining cake
- ๐ Missing Values = Finding lost toys
Happy data cleaning! ๐
