📊 Distribution Analysis in R: Your Data Detective Toolkit
Imagine you’re a detective. Your job? To understand the story hidden inside numbers. Today, we’ll learn three super-powers that help us crack the case: Quantiles, Correlation, and Scaling!
🎯 The Big Picture
Think of your data like a classroom of kids standing in a line from shortest to tallest.
- Quantiles help you find who’s in the middle, who’s really tall, and who’s really short.
- Correlation tells you if tall kids also tend to have big feet (do two things go together?).
- Scaling is like converting everyone’s height to the same measuring tape so we can compare fairly.
📏 Part 1: Quantile Functions — Finding the Landmarks
What Are Quantiles?
Imagine 100 kids standing in a line from shortest to tallest. Quantiles are like signposts telling you where you are in the line.
- 25th percentile (Q1): 25 kids are shorter than you
- 50th percentile (Median): You’re right in the middle!
- 75th percentile (Q3): 75 kids are shorter than you
🧪 R Example: Finding Quantiles
# Heights of 10 students (in cm)
heights <- c(120, 125, 130, 135, 140,
145, 150, 155, 160, 165)
# Find the median (50th percentile)
median(heights)
# Result: 142.5
# Find Q1, Median, Q3
quantile(heights, c(0.25, 0.50, 0.75))
# 25% 50% 75%
# 128.75 142.5 156.25
📦 The quantile() Function
quantile(x, probs)
x= your dataprobs= which percentiles you want (0 to 1)
🎁 Special Quantiles You’ll Use Often
| Name | Percentile | R Code |
|---|---|---|
| Minimum | 0% | quantile(x, 0) |
| Q1 | 25% | quantile(x, 0.25) |
| Median | 50% | quantile(x, 0.50) |
| Q3 | 75% | quantile(x, 0.75) |
| Maximum | 100% | quantile(x, 1) |
🔍 Quick Summary with fivenum()
fivenum(heights)
# [1] 120 128.75 142.5 156.25 165
This gives you: Min, Q1, Median, Q3, Max — all at once!
🔗 Part 2: Correlation and Covariance — Do Things Move Together?
The Story
Imagine two friends: Height and Shoe Size. When one friend gets bigger, does the other friend also get bigger? That’s what correlation measures!
Three Types of Relationships
graph TD A["Two Variables"] --> B["Positive Correlation"] A --> C["Negative Correlation"] A --> D["No Correlation"] B --> E["📈 Both go up together"] C --> F["📉 One up, other down"] D --> G["🎲 No pattern"]
🧪 R Example: Correlation
# Heights and shoe sizes of 5 people
height <- c(150, 160, 170, 180, 190)
shoe_size <- c(6, 7, 8, 9, 10)
# Calculate correlation
cor(height, shoe_size)
# Result: 1 (perfect positive!)
Understanding Correlation Values
| Value | Meaning | Example |
|---|---|---|
| +1 | Perfect positive | Height ↔ Shoe size |
| 0 | No relationship | Shoe size ↔ Favorite color |
| -1 | Perfect negative | Speed ↔ Travel time |
🎯 Covariance: Correlation’s Cousin
Covariance also measures if things move together, but it’s in the original units (harder to interpret).
# Calculate covariance
cov(height, shoe_size)
# Result: 50
# Correlation is easier to understand!
# It's always between -1 and +1
When to Use Which?
- Correlation (
cor()): When you want to know HOW STRONG the relationship is (always -1 to +1) - Covariance (
cov()): When you need the actual units (used in advanced math)
📊 Correlation Matrix: Many Variables at Once
# Three variables
age <- c(25, 30, 35, 40, 45)
income <- c(30, 45, 60, 75, 90)
savings <- c(5, 12, 20, 30, 42)
# Create a data frame
data <- data.frame(age, income, savings)
# See all correlations at once!
cor(data)
# age income savings
# age 1.00 1.00 1.00
# income 1.00 1.00 1.00
# savings 1.00 1.00 1.00
⚖️ Part 3: Scaling Data — Making Fair Comparisons
The Problem
Imagine comparing:
- A test score: 85 out of 100
- A race time: 12 seconds
- A weight: 50 kg
How do you compare these? They’re all in different units! Scaling puts everything on the same measuring stick.
Two Popular Scaling Methods
graph TD A["Scaling Methods"] --> B["Z-Score / Standardization"] A --> C["Min-Max Normalization"] B --> D["Mean = 0, SD = 1"] C --> E["Range 0 to 1"]
🧪 Z-Score Scaling with scale()
The Z-score tells you: “How many steps away from average are you?”
# Test scores
scores <- c(70, 80, 90, 100, 110)
# Scale the data
scaled_scores <- scale(scores)
print(scaled_scores)
# [,1]
# [1,] -1.2649111
# [2,] -0.6324555
# [3,] 0.0000000
# [4,] 0.6324555
# [5,] 1.2649111
Understanding Z-Scores
| Z-Score | What It Means |
|---|---|
| 0 | Exactly average |
| +1 | One step above average |
| -1 | One step below average |
| +2 | Very high (rare!) |
| -2 | Very low (rare!) |
🎯 Min-Max Scaling: 0 to 1
This squishes all values between 0 and 1.
# Custom min-max function
min_max <- function(x) {
(x - min(x)) / (max(x) - min(x))
}
# Apply it
scores <- c(70, 80, 90, 100, 110)
min_max(scores)
# [1] 0.00 0.25 0.50 0.75 1.00
When to Use Each?
| Method | Best For | Example |
|---|---|---|
| Z-Score | Comparing how unusual values are | “How far from average?” |
| Min-Max | When you need 0-1 range | Machine learning inputs |
🎯 Quick Reference: All Functions
# QUANTILES
quantile(x, 0.5) # Get median
quantile(x, c(0.25, 0.75)) # Get Q1 and Q3
fivenum(x) # Min, Q1, Median, Q3, Max
# CORRELATION & COVARIANCE
cor(x, y) # Correlation (-1 to +1)
cov(x, y) # Covariance (original units)
cor(data_frame) # Correlation matrix
# SCALING
scale(x) # Z-score standardization
# Custom min-max: (x - min) / (max - min)
🏆 You Did It!
You just learned three powerful detective tools:
- Quantiles: Find where values stand in the lineup
- Correlation: Discover if things move together
- Scaling: Put everything on the same measuring stick
Now you can analyze any dataset like a pro! 🎉
🧠 Remember This Story
A teacher wanted to understand her class better. She used quantiles to find who scored in the top 25%. She checked correlation to see if study hours predicted test scores. Finally, she used scaling to fairly compare math and art scores. The end!
