🚀 Advanced R Programming: Performance
Imagine your R code is a chef in a kitchen. A fast chef knows where every ingredient is, uses the right tools, and never wastes a single move. Today, we’ll teach your R code to cook like a pro!
🍳 The Kitchen Analogy
Think of your computer like a kitchen:
- Memory = Your counter space (where you prep food)
- Vectorization = Using a food processor instead of chopping by hand
- Optimization = Finding the fastest recipe
- Profiling = Using a timer to see what takes longest
Let’s make your R code a master chef!
📦 Memory Management
What is Memory?
Memory is like your kitchen counter. You only have so much space!
Simple Example:
- If you put too many bowls on the counter, you can’t work
- Your computer works the same way with data
Why Does Memory Matter?
# Bad: Making copies everywhere
x <- 1:1000000
y <- x # Looks innocent...
y[1] <- 0 # R copies the WHOLE thing!
When you change y, R makes a complete copy of x. That’s like photocopying a whole cookbook just to fix one typo!
Smart Memory Tips
graph TD A["Create Data"] --> B{Need to Modify?} B -->|Yes| C["Modify In-Place"] B -->|No| D["Share Memory"] C --> E["Efficient!"] D --> E
Tip 1: Remove What You Don’t Need
# Free up space!
rm(big_data)
gc() # Garbage collection = cleaning
Tip 2: Check Your Memory Usage
# How big is my data?
object.size(my_data)
# See all objects
ls()
Tip 3: Pre-allocate Space
# Bad: Growing a vector
result <- c()
for(i in 1:1000) {
result <- c(result, i) # Slow!
}
# Good: Pre-allocate
result <- numeric(1000)
for(i in 1:1000) {
result[i] <- i # Fast!
}
🎯 Key Insight: Pre-allocation is like setting out all your bowls before cooking. No running to the cabinet mid-recipe!
⚡ Vectorization Benefits
What is Vectorization?
Imagine you need to peel 100 potatoes:
- Loop way: Peel one, put down, pick up next, peel…
- Vectorized way: Machine peels all 100 at once!
R is AMAZING at doing things all at once.
The Magic of Vectors
# Slow loop way (peeling one by one)
numbers <- 1:1000000
result <- numeric(1000000)
for(i in 1:1000000) {
result[i] <- numbers[i] * 2
}
# Fast vectorized way (all at once!)
result <- numbers * 2
The vectorized way is 10-100x faster!
Why Vectorization is Fast
graph TD A["Loop"] --> B["Check i"] B --> C["Get value"] C --> D["Calculate"] D --> E["Store"] E --> F["Repeat 1M times"] G["Vectorized"] --> H["Send all to CPU"] H --> I["CPU does all at once"] I --> J["Done!"]
Common Vectorized Functions
| Instead of Loop | Use This |
|---|---|
for + sum() |
sum(x) |
for + mean() |
mean(x) |
for + comparison |
x > 5 |
for + math |
x * 2 |
Real Example:
# Find all numbers > 50
numbers <- 1:100
# Loop (slow)
big_ones <- c()
for(n in numbers) {
if(n > 50) big_ones <- c(big_ones, n)
}
# Vectorized (fast!)
big_ones <- numbers[numbers > 50]
🎯 Key Insight: If you’re writing a loop in R, ask yourself: “Can I do this all at once?”
🏎️ Performance Optimization
The Golden Rules
- Measure first, optimize second
- Don’t optimize code that runs once
- Make it work, then make it fast
Common Speed Killers
graph TD A["Slow Code"] --> B["Growing Objects"] A --> C["Unnecessary Copies"] A --> D["Loops over Vectors"] A --> E["Reading Files Repeatedly"]
Quick Wins
1. Use Built-in Functions
# Slow
my_sum <- 0
for(x in numbers) my_sum <- my_sum + x
# Fast (built-in!)
my_sum <- sum(numbers)
2. Avoid Growing Objects
# Bad: List grows each time
results <- list()
for(i in 1:1000) {
results[[i]] <- do_something(i)
}
# Better: Use lapply
results <- lapply(1:1000, do_something)
3. Read Data Once
# Bad: Reading in a loop
for(i in 1:10) {
data <- read.csv("file.csv") # Re-reads!
}
# Good: Read once, use many
data <- read.csv("file.csv")
for(i in 1:10) {
process(data) # Uses cached data
}
The apply Family
| Function | When to Use |
|---|---|
lapply |
Apply to each list item |
sapply |
Same, but simplify result |
vapply |
Same, but specify output |
mapply |
Multiple inputs |
# Square each number
numbers <- list(1, 2, 3, 4, 5)
squares <- lapply(numbers, function(x) x^2)
# Result: list(1, 4, 9, 16, 25)
🎯 Key Insight: R’s built-in functions are written in C. They’re MUCH faster than loops!
🔍 Profiling Code
What is Profiling?
Profiling = Finding the slow parts of your code
Like using a stopwatch to time each step of cooking!
The Simple Way: system.time()
# How long does this take?
system.time({
result <- sum(1:10000000)
})
# user system elapsed
# 0.05 0.00 0.05
- user: CPU time for your code
- system: CPU time for system tasks
- elapsed: Real wall-clock time
The Pro Way: Rprof()
# Start profiling
Rprof("my_profile.out")
# Run your code
my_slow_function()
# Stop profiling
Rprof(NULL)
# See results
summaryRprof("my_profile.out")
Visual Profiling with profvis
# Install once
install.packages("profvis")
# Profile with pretty pictures!
library(profvis)
profvis({
# Your code here
data <- read.csv("big_file.csv")
result <- process(data)
})
This shows a beautiful flame graph of where time is spent!
Reading Profile Results
graph TD A["Total Time: 10 sec"] --> B["read_data: 6 sec"] A --> C["process: 3 sec"] A --> D["save: 1 sec"] B --> E["Focus here first!"]
The 80/20 Rule: Usually 20% of your code takes 80% of the time. Find that 20%!
Benchmarking with microbenchmark
library(microbenchmark)
# Compare two approaches
microbenchmark(
loop = {
s <- 0
for(i in 1:1000) s <- s + i
},
vectorized = sum(1:1000),
times = 100
)
This runs each version 100 times and shows you statistics!
🎯 Key Insight: Never guess where your code is slow. Measure it!
🎓 Putting It All Together
The Performance Checklist
graph TD A["Slow Code?"] --> B["Profile First"] B --> C["Find Bottleneck"] C --> D{What's Slow?} D -->|Memory| E["Check Copies"] D -->|Loops| F["Try Vectorization"] D -->|Functions| G["Use Built-ins"] E --> H["Optimize"] F --> H G --> H H --> I["Profile Again"] I --> J{Fast Enough?} J -->|No| B J -->|Yes| K["Done!"]
Real-World Example
Before (Slow):
# Process 1 million rows
result <- c()
for(i in 1:nrow(data)) {
if(data$value[i] > 100) {
result <- c(result, data$value[i] * 2)
}
}
After (Fast):
# Vectorized approach
result <- data$value[data$value > 100] * 2
Speed Improvement: 100x faster!
🌟 Summary
| Topic | Key Takeaway |
|---|---|
| Memory | Pre-allocate, remove unused, avoid copies |
| Vectorization | Do things all at once, not one by one |
| Optimization | Use built-ins, avoid growing objects |
| Profiling | Measure first, then optimize |
Your New Superpowers
- âś… You know how to check memory usage
- âś… You can write vectorized code
- âś… You understand the apply family
- âś… You can profile and find slow code
“The best code is code that doesn’t waste a single CPU cycle—just like the best chef doesn’t waste a single ingredient!”
Go forth and write FAST R code! 🚀
