π The Sorting Hat for Data: Unsupervised Learning in R
Imagine you have a giant box of LEGO pieces β all different colors, shapes, and sizes. Nobody tells you how to sort them. But somehow, you figure out which pieces belong together. Thatβs exactly what unsupervised learning does with data!
π The Big Picture
Unsupervised learning is like being a detective with no clues about what youβre looking for. You just look at the data and find hidden patterns all by yourself.
No labels. No answers. Just discovery.
βββββββββββββββββββββββββββββββββββββββ
β Supervised Learning: β
β "Here are cats and dogs. Learn!" β
βββββββββββββββββββββββββββββββββββββββ€
β Unsupervised Learning: β
β "Here's data. Find patterns!" β
βββββββββββββββββββββββββββββββββββββββ
Our Analogy: Think of data as guests at a party. Unsupervised learning figures out who naturally belongs together β without anyone wearing name tags!
π Distance Measures: How Close Are Two Points?
Before we group anything, we need to answer: How do we measure βclosenessβ?
The Ruler for Data
Imagine youβre in a city. How far is the coffee shop?
Euclidean Distance β The bird flies straight there (shortest path)
# Two friends' locations
alice <- c(2, 3) # x=2, y=3
bob <- c(5, 7) # x=5, y=7
# How far apart?
dist_euclidean <- sqrt(
(5-2)^2 + (7-3)^2
)
# Result: 5 units
Manhattan Distance β Walk on streets (like NYC blocks)
# Same friends
dist_manhattan <- abs(5-2) + abs(7-3)
# Result: 7 units (3 blocks + 4 blocks)
Quick Comparison
| Distance Type | How It Works | Best For |
|---|---|---|
| Euclidean | Straight line | Continuous data |
| Manhattan | Grid path | City-block problems |
| Cosine | Angle between | Text similarity |
Rβs Built-in Help
# Create sample data
points <- matrix(
c(0,0, 3,4, 1,1),
nrow = 3, byrow = TRUE
)
# Calculate all distances at once
dist(points, method = "euclidean")
π§ Principal Component Analysis (PCA)
The Problem: Too Many Ingredients!
Imagine a recipe with 100 ingredients. Most of them barely change the taste. PCA finds the ingredients that matter most.
The Magic Trick
PCA is like taking a 3D photo of a sculpture from the best angle β capturing most of the beauty in 2D.
graph TD A["100 Features"] --> B["PCA Magic"] B --> C["2-3 Important Features"] C --> D["Same Story, Simpler!"]
How It Works (Simple Version)
- Find the direction where data spreads the most
- Thatβs PC1 β your first βsuper-ingredientβ
- Find the next direction (perpendicular to PC1)
- Thatβs PC2 β your second βsuper-ingredientβ
- Keep going until youβve captured enough variation
PCA in R
# Sample: Student scores in 5 subjects
scores <- data.frame(
math = c(90,85,75,95,70),
physics = c(88,82,78,92,68),
chemistry = c(85,80,72,90,65),
english = c(70,75,85,65,90),
history = c(72,78,88,60,92)
)
# Apply PCA
pca_result <- prcomp(
scores,
scale. = TRUE # Standardize!
)
# See importance
summary(pca_result)
What You Get
# PC1 might be "Science ability"
# PC2 might be "Humanities ability"
# Two numbers tell the whole story!
π― Key Insight
If PC1 + PC2 explain 90% of variation, you can ignore the other components. Thatβs the power of dimensionality reduction!
π― K-Means Clustering
The Party Organizer
Imagine youβre organizing a party with 50 guests. You want to seat them at K tables so similar people sit together.
K-Means does exactly this!
The Algorithm Dance
graph TD A["Step 1: Pick K random centers"] --> B["Step 2: Assign each point to nearest center"] B --> C["Step 3: Move center to middle of its group"] C --> D{Points moved?} D -->|Yes| B D -->|No| E["Done! K clusters found"]
K-Means in R
# Customer data: age and spending
customers <- data.frame(
age = c(25,30,35,50,55,60,22,28),
spending = c(200,220,180,400,450,380,190,210)
)
# Create 2 groups
km <- kmeans(customers, centers = 2)
# See who's in which group
km$cluster
# Maybe: Young Savers vs. Mature Spenders
Choosing K: The Elbow Method
How many groups? Look for the βelbowβ in the plot!
# Try different K values
wss <- sapply(1:6, function(k) {
kmeans(customers, k)$tot.withinss
})
# Plot it
plot(1:6, wss, type = "b",
xlab = "Number of Clusters",
ylab = "Total Within Sum of Squares")
# Look for the bend!
β οΈ Watch Out!
| Problem | Solution |
|---|---|
| Results change each run | Set set.seed(123) |
| Different scales | Standardize first! |
| Weird shapes | Try other methods |
π³ Hierarchical Clustering
Building a Family Tree for Data
Remember the K-Means party? Hierarchical clustering is different. It builds a family tree showing how everyone is related.
Two Approaches
Bottom-Up (Agglomerative) β Start with individuals, merge into families
Top-Down (Divisive) β Start with everyone, split into groups
The Merging Process
graph TD A["Each point is its own cluster"] --> B["Find two closest clusters"] B --> C["Merge them into one"] C --> D{One cluster left?} D -->|No| B D -->|Yes| E["Tree complete!"]
Hierarchical Clustering in R
# Same customer data
customers <- data.frame(
age = c(25,30,35,50,55,60,22,28),
spending = c(200,220,180,400,450,380,190,210)
)
# Calculate distances
d <- dist(customers)
# Build the tree
hc <- hclust(d, method = "complete")
# Draw it!
plot(hc)
Cutting the Tree
# Want 2 groups? Cut at height that gives 2
groups <- cutree(hc, k = 2)
# Or cut at specific height
groups <- cutree(hc, h = 150)
Linkage Methods
| Method | How It Measures Distance |
|---|---|
| Single | Closest pair |
| Complete | Farthest pair |
| Average | Mean of all pairs |
| Wardβs | Minimizes variance |
When to Use What?
K-Means:
β Know how many groups
β Large datasets
β Spherical clusters
Hierarchical:
β Don't know group count
β Want to see relationships
β Smaller datasets
π¨ Putting It All Together
The Complete Workflow
# 1. Load and prepare data
data <- scale(mydata) # Standardize!
# 2. Maybe reduce dimensions first
pca <- prcomp(data)
data_reduced <- pca$x[, 1:2] # Use PC1 & PC2
# 3. Choose clustering method
# For unknown K:
hc <- hclust(dist(data_reduced))
plot(hc) # Look at tree
# For known K:
km <- kmeans(data_reduced, centers = 3)
Visualization
# Color by cluster
plot(
data_reduced,
col = km$cluster,
pch = 19,
main = "My Clusters!"
)
# Add cluster centers
points(
km$centers,
col = 1:3, pch = 8, cex = 2
)
π Quick Summary
| Technique | What It Does | Think Of It As |
|---|---|---|
| Distance Measures | Measures closeness | A ruler for data |
| PCA | Reduces dimensions | Finding best camera angle |
| K-Means | Groups into K clusters | Seating party guests |
| Hierarchical | Builds cluster tree | Creating family tree |
π Youβre Ready!
You now understand the four pillars of unsupervised learning in R:
- Measure distances between points
- Simplify with PCA when needed
- Group with K-Means or Hierarchical clustering
- Visualize and interpret your discoveries
Remember: Thereβs no βright answerβ in unsupervised learning. Youβre an explorer discovering patterns in data. Sometimes the most interesting findings are the ones nobody expected!
βThe goal is to turn data into information, and information into insight.β β Carly Fiorina
Happy clustering! π
