tidyr

Back

Loading concept...

The Magical Drawer Organizer: Mastering tidyr

Your Dataโ€™s Best Friend

Imagine you have a messy toy box. Some toys are scattered everywhere, some are mixed together, and some are missing! tidyr is like a magical drawer organizer that helps you put everything in the right place.


๐ŸŽฏ What is tidyr?

tidyr is part of the Tidyverse family in R. It helps you:

  • Reshape your data (like folding clothes different ways)
  • Split information apart (like separating mixed crayons)
  • Join pieces together (like combining puzzle pieces)
  • Handle missing stuff (like finding lost toys)

๐Ÿ”„ PIVOT FUNCTIONS: The Shape-Shifters

The Bookshelf Analogy

Think of your data like books on a shelf. Sometimes you want them standing tall (long format), and sometimes you want them lying flat (wide format). Pivot functions help you change how your data โ€œsitsโ€!


pivot_longer(): Making Data Tall

What Does It Do?

pivot_longer() takes wide data and makes it tall. Itโ€™s like taking books lying flat and standing them up one by one.

When Do You Use It?

When you have column names that are actually values (like months: Jan, Feb, Mar as columns).

Simple Example

Before (Wide - books lying flat):

Student Math Science Art
Emma 90 85 95
Liam 80 90 88

After (Long - books standing tall):

Student Subject Score
Emma Math 90
Emma Science 85
Emma Art 95
Liam Math 80
Liam Science 90
Liam Art 88

The Code

library(tidyr)

# Create wide data
grades <- tibble(
  Student = c("Emma", "Liam"),
  Math = c(90, 80),
  Science = c(85, 90),
  Art = c(95, 88)
)

# Make it tall!
tall_grades <- grades %>%
  pivot_longer(
    cols = c(Math, Science, Art),
    names_to = "Subject",
    values_to = "Score"
  )

Breaking It Down

Argument What It Means Likeโ€ฆ
cols Which columns to pivot Which books to stand up
names_to New column for old column names Label for the book titles
values_to New column for the values Label for the book contents

pivot_wider(): Making Data Wide

What Does It Do?

pivot_wider() takes tall data and spreads it wide. Itโ€™s like taking standing books and laying them flat side by side.

When Do You Use It?

When you want each unique value to become its own column.

Simple Example

Before (Long):

City Month Temp
Tokyo Jan 5
Tokyo Feb 7
Paris Jan 4
Paris Feb 5

After (Wide):

City Jan Feb
Tokyo 5 7
Paris 4 5

The Code

# Create long data
temps <- tibble(
  City = c("Tokyo", "Tokyo",
           "Paris", "Paris"),
  Month = c("Jan", "Feb",
            "Jan", "Feb"),
  Temp = c(5, 7, 4, 5)
)

# Make it wide!
wide_temps <- temps %>%
  pivot_wider(
    names_from = Month,
    values_from = Temp
  )

Breaking It Down

Argument What It Means Likeโ€ฆ
names_from Column whose values become new column names Book titles become shelf labels
values_from Column whose values fill the new columns Book contents fill the shelves

๐Ÿ”„ Quick Comparison

graph TD A["Wide Data"] -->|pivot_longer| B["Long Data"] B -->|pivot_wider| A style A fill:#ffd700,stroke:#333 style B fill:#87ceeb,stroke:#333

Remember:

  • Longer = More rows, fewer columns
  • Wider = More columns, fewer rows

โœ‚๏ธ SEPARATE AND UNITE: The Splitters and Joiners

The Birthday Cake Analogy

Sometimes you have a whole cake (combined data), and you need to cut it into slices (separate). Other times, you have slices you want to put back together into a cake (unite)!


separate(): Cutting the Cake

What Does It Do?

separate() splits one column into multiple columns. Like cutting a cake into pieces!

Simple Example

Before:

id name_age
1 Emma_12
2 Liam_10
3 Olivia_11

After:

id name age
1 Emma 12
2 Liam 10
3 Olivia 11

The Code

# Data with combined info
kids <- tibble(
  id = 1:3,
  name_age = c("Emma_12",
               "Liam_10",
               "Olivia_11")
)

# Separate into two columns
kids_split <- kids %>%
  separate(
    col = name_age,
    into = c("name", "age"),
    sep = "_"
  )

Key Arguments

Argument What It Means
col Column to split
into Names for new columns
sep What to split on (โ€œ_โ€, โ€œ-โ€, etc.)

separate_rows(): One-to-Many Split

What Does It Do?

When one cell has multiple values, separate_rows() creates multiple rows for each value!

Simple Example

Before:

student hobbies
Emma swim,read,dance
Liam game,code

After:

student hobbies
Emma swim
Emma read
Emma dance
Liam game
Liam code

The Code

# Data with lists in cells
hobbies <- tibble(
  student = c("Emma", "Liam"),
  hobbies = c("swim,read,dance",
              "game,code")
)

# Separate into rows
hobbies_long <- hobbies %>%
  separate_rows(hobbies, sep = ",")

unite(): Joining the Pieces

What Does It Do?

unite() combines multiple columns into one column. The opposite of separate!

Simple Example

Before:

first last grade
Emma Smith A
Liam Jones B

After:

full_name grade
Emma_Smith A
Liam_Jones B

The Code

# Data with separate columns
names <- tibble(
  first = c("Emma", "Liam"),
  last = c("Smith", "Jones"),
  grade = c("A", "B")
)

# Unite into one column
names_joined <- names %>%
  unite(
    col = "full_name",
    first, last,
    sep = "_"
  )

Key Arguments

Argument What It Means
col Name for new combined column
... Columns to combine
sep What to put between values

โœ‚๏ธ Separate vs Unite

graph LR A["One Column"] -->|separate| B["Multiple Columns"] B -->|unite| A style A fill:#ff9999,stroke:#333 style B fill:#99ff99,stroke:#333

๐Ÿ” MISSING VALUE HANDLING: Finding the Lost Toys

The Treasure Hunt Analogy

Sometimes toys get lost under the bed (missing values). tidyr helps you:

  • Find the missing toys
  • Fill in the empty spots
  • Remove the unfixable gaps

Understanding NA

In R, NA means โ€œNot Availableโ€ - a missing value.

# A vector with missing value
ages <- c(10, NA, 12, 8, NA)

Itโ€™s like a puzzle with missing pieces!


drop_na(): Removing Missing Values

What Does It Do?

drop_na() removes rows that have missing values. Like throwing away incomplete puzzle pieces.

Simple Example

Before:

name age score
Emma 10 95
Liam NA 88
Olivia 11 NA
Noah 9 92

After (drop_na()):

name age score
Emma 10 95
Noah 9 92

The Code

# Data with missing values
students <- tibble(
  name = c("Emma", "Liam",
           "Olivia", "Noah"),
  age = c(10, NA, 11, 9),
  score = c(95, 88, NA, 92)
)

# Drop ALL rows with ANY NA
clean_data <- students %>%
  drop_na()

# Drop rows with NA only in 'age'
clean_age <- students %>%
  drop_na(age)

Be careful! drop_na() removes entire rows. Use wisely!


fill(): Filling the Gaps

What Does It Do?

fill() takes the previous (or next) value and copies it into empty spots. Like filling a coloring book by continuing the same color!

Simple Example

Before:

year quarter sales
2023 Q1 100
NA Q2 120
NA Q3 115
2024 Q1 130

After (fill year):

year quarter sales
2023 Q1 100
2023 Q2 120
2023 Q3 115
2024 Q1 130

The Code

# Data with gaps
sales <- tibble(
  year = c(2023, NA, NA, 2024),
  quarter = c("Q1", "Q2", "Q3", "Q1"),
  sales = c(100, 120, 115, 130)
)

# Fill down (default)
filled <- sales %>%
  fill(year)

# Fill up
filled_up <- sales %>%
  fill(year, .direction = "up")

Direction Options

Direction What It Does
"down" Copy from above (default)
"up" Copy from below
"downup" Down first, then up
"updown" Up first, then down

replace_na(): Swap NA for Something Else

What Does It Do?

replace_na() replaces all NAs with a value you choose. Like putting a placeholder toy where the missing one was!

Simple Example

Before:

item price
Apple 1.50
Banana NA
Cherry 2.00

After (replace NA with 0):

item price
Apple 1.50
Banana 0.00
Cherry 2.00

The Code

# Data with NA prices
fruits <- tibble(
  item = c("Apple", "Banana", "Cherry"),
  price = c(1.50, NA, 2.00)
)

# Replace NA with 0
fruits_fixed <- fruits %>%
  mutate(
    price = replace_na(price, 0)
  )

# Or replace multiple columns
fruits_fixed2 <- fruits %>%
  replace_na(list(price = 0))

complete(): Making Explicit Gaps

What Does It Do?

complete() makes sure all combinations exist. Missing ones get NA. Like making sure every puzzle piece slot exists, even if empty!

Simple Example

Before:

student subject score
Emma Math 90
Emma Art 95
Liam Math 85

After (complete all combinations):

student subject score
Emma Math 90
Emma Art 95
Liam Math 85
Liam Art NA

The Code

# Incomplete data
scores <- tibble(
  student = c("Emma", "Emma", "Liam"),
  subject = c("Math", "Art", "Math"),
  score = c(90, 95, 85)
)

# Complete all combinations
full_data <- scores %>%
  complete(student, subject)

# With a fill value
full_filled <- scores %>%
  complete(
    student, subject,
    fill = list(score = 0)
  )

๐Ÿ” Quick Reference

Function What It Does When to Use
drop_na() Removes rows with NA When you must have complete data
fill() Copies previous/next value When data should repeat
replace_na() Swaps NA for a value When you want a default
complete() Creates missing combinations When you need all possibilities

๐ŸŽฏ Missing Values Flow

graph TD A["Data with NAs"] --> B{What do you need?} B -->|Remove them| C["drop_na"] B -->|Copy nearby values| D["fill"] B -->|Use default value| E["replace_na"] B -->|Show all combinations| F["complete"] style A fill:#ffcccc,stroke:#333 style C fill:#99ff99,stroke:#333 style D fill:#99ff99,stroke:#333 style E fill:#99ff99,stroke:#333 style F fill:#99ff99,stroke:#333

๐Ÿ† PUTTING IT ALL TOGETHER

A Real-World Example

Letโ€™s clean up messy survey data!

library(tidyr)
library(dplyr)

# Messy survey data
survey <- tibble(
  response_id = c(1, 2, 3),
  name_date = c("Emma_2023-01-15",
                "Liam_2023-01-16",
                "Olivia_2023-01-17"),
  Q1 = c(5, NA, 4),
  Q2 = c(4, 3, NA),
  Q3 = c(5, 4, 5)
)

# Clean it up!
clean_survey <- survey %>%
  # Separate name and date
  separate(name_date,
           c("name", "date"),
           sep = "_") %>%
  # Make questions long
  pivot_longer(
    cols = starts_with("Q"),
    names_to = "question",
    values_to = "rating"
  ) %>%
  # Fill missing with median
  mutate(
    rating = replace_na(rating, 4)
  )

๐ŸŽ‰ You Did It!

You now know how to:

  1. Reshape data with pivot_longer and pivot_wider
  2. Split and join columns with separate and unite
  3. Handle missing values with drop_na, fill, replace_na, and complete

Remember the analogies:

  • ๐Ÿ“š Pivots = Books standing or lying on a shelf
  • ๐ŸŽ‚ Separate/Unite = Cutting and rejoining cake
  • ๐Ÿ” Missing Values = Finding lost toys

Happy data cleaning! ๐Ÿš€

Loading story...

Story - Premium Content

Please sign in to view this story and start learning.

Upgrade to Premium to unlock full access to all stories.

Stay Tuned!

Story is coming soon.

Story Preview

Story - Premium Content

Please sign in to view this concept and start learning.

Upgrade to Premium to unlock full access to all content.