Linear Regression in R: Building Your First Prediction Machine
The Story of the Fortune Teller
Imagine you have a magical crystal ball. But this crystal ball is special - it learns from the past to predict the future. You tell it: “When I study 2 hours, I score 70 marks. When I study 4 hours, I score 85 marks.” The crystal ball notices a pattern and says: “Ah! More study hours = higher scores. Let me draw a line through your data!”
That’s Linear Regression - drawing the best possible straight line through your data points to make predictions.
Formula Objects: Teaching R What to Predict
Before we can predict anything, we need to tell R what we want to predict and what we’ll use to predict it. We do this with a formula.
The Magic Recipe
A formula in R looks like this:
y ~ x
Think of it as saying: “I want to predict y using x”
The ~ symbol (called tilde) means “depends on” or “is predicted by”.
Real Examples
# Predict score based on hours studied
score ~ hours
# Predict house price based on size
price ~ size
# Multiple predictors? No problem!
salary ~ experience + education
# Everything as a predictor
mpg ~ .
Quick Reference
| Formula | Meaning |
|---|---|
y ~ x |
y depends on x |
y ~ x1 + x2 |
y depends on x1 AND x2 |
y ~ . |
y depends on ALL other columns |
y ~ x - 1 |
No intercept (line through origin) |
Linear Regression: Drawing the Best Line
Now the exciting part! We use lm() - which stands for Linear Model.
Your First Regression
# Create some data
study_data <- data.frame(
hours = c(1, 2, 3, 4, 5),
score = c(55, 60, 70, 75, 85)
)
# Build the model
my_model <- lm(score ~ hours,
data = study_data)
That’s it! R just drew the best possible line through your data.
What Just Happened?
graph TD A["Your Data Points"] --> B["lm function"] B --> C["Finds Best Line"] C --> D["y = intercept + slope Ă— x"] D --> E["Ready to Predict!"]
See Your Line
# What's the equation?
my_model
# Output:
# (Intercept) hours
# 47.0 7.5
This tells us: score = 47 + 7.5 Ă— hours
So if you study 6 hours: score = 47 + 7.5 Ă— 6 = 92 marks!
Regression Summary: The Full Report Card
Want to know how good your prediction line is? Use summary().
Getting the Summary
summary(my_model)
Understanding the Output
The summary shows you several important things:
1. Coefficients Table
Estimate Std.Error t value Pr(>|t|)
(Intercept) 47.00 4.18 11.24 0.0015 **
hours 7.50 1.12 6.71 0.0067 **
- Estimate: The actual numbers in your equation
- Std. Error: How uncertain we are about each number
- t value: How confident we are (bigger = more confident)
- Pr(>|t|): The p-value (smaller = more significant)
2. R-squared: Your Model’s Grade
Multiple R-squared: 0.937
Think of R-squared as a percentage score:
- 0.937 = 93.7% of the variation in scores is explained by study hours
- Higher is better (but not always 1.0!)
graph TD A["R² = 0"] --> B["Line explains nothing"] C["R² = 0.5"] --> D["Line explains 50%"] E["R² = 1.0"] --> F["Perfect prediction"]
Quick Summary Cheatsheet
| Metric | Good Sign |
|---|---|
| p-value | < 0.05 |
| R-squared | > 0.7 |
| Residual SE | Low number |
Prediction from Models: Using Your Crystal Ball
Now the fun part - making predictions!
The predict() Function
# New data to predict
new_students <- data.frame(
hours = c(6, 7, 8)
)
# Make predictions
predict(my_model,
newdata = new_students)
# Output: 92.0 99.5 107.0
Predictions with Confidence
Not 100% sure about your predictions? Get a range!
# Confidence interval
predict(my_model,
newdata = new_students,
interval = "confidence")
# fit lwr upr
# 1 92.0 84.2 99.8
# 2 99.5 89.1 109.9
# 3 107.0 93.2 120.8
- fit: Your prediction
- lwr: Lower bound (95% confident it’s above this)
- upr: Upper bound (95% confident it’s below this)
Prediction vs Confidence Intervals
# For the mean response
interval = "confidence"
# For a single new observation
interval = "prediction"
Prediction intervals are wider because individual values vary more than averages.
Regression Diagnostics: Is Your Model Healthy?
Just like a doctor checks your health, we need to check our model’s health.
The 4 Key Checks
# Create 4 diagnostic plots
par(mfrow = c(2, 2))
plot(my_model)
This gives you 4 important plots:
1. Residuals vs Fitted
- Should look like random scatter
- No patterns allowed!
2. Normal Q-Q
- Points should follow the diagonal line
- Checks if errors are normally distributed
3. Scale-Location
- Should be a flat horizontal band
- Checks for constant variance
4. Residuals vs Leverage
- Identifies influential outliers
- Watch for points outside dashed lines
graph TD A["Run Diagnostics"] --> B{Patterns?} B -->|No Pattern| C["Model is Healthy!"] B -->|Pattern Found| D["Model Needs Help"] D --> E["Transform Data"] D --> F["Add Variables"] D --> G["Remove Outliers"]
Residual Analysis: Learning from Mistakes
A residual is the difference between what actually happened and what your model predicted.
Calculate Residuals
# Get residuals
residuals(my_model)
# Or equivalently
my_model$residuals
Residual = Actual - Predicted
# If actual score = 70
# And predicted = 67
# Residual = 70 - 67 = 3
What Residuals Tell Us
| Pattern | What It Means |
|---|---|
| Random scatter around 0 | Model is good! |
| Curve pattern | Need polynomial term |
| Funnel shape | Variance is not constant |
| Outliers | Some unusual data points |
Visualizing Residuals
# Simple residual plot
plot(my_model$fitted.values,
my_model$residuals)
abline(h = 0, col = "red")
# Should look like random dots
# around the red line
Checking Normality
# Histogram of residuals
hist(residuals(my_model))
# Should look like a bell curve
# Shapiro-Wilk test
shapiro.test(residuals(my_model))
# p-value > 0.05 means normal
Influence Measures: Finding the Troublemakers
Some data points have more power than others. One weird point can pull your whole line in the wrong direction!
Three Key Measures
1. Leverage (Hat Values) How far a point is from the center of your x values.
hatvalues(my_model)
High leverage points are at the edges of your data.
2. Cook’s Distance The overall influence of each point on your model.
cooks.distance(my_model)
Rule of thumb: Watch points where Cook’s D > 4/n
3. DFBETAS How much each point changes each coefficient.
dfbetas(my_model)
Visual Detection
# Plot Cook's distance
plot(my_model, which = 4)
# Points with high bars are
# influential!
The influence.measures() Function
Get everything at once:
influence.measures(my_model)
# Shows:
# - dfb.1_: Change in intercept
# - dfb.hour: Change in slope
# - dffit: Overall fit change
# - cov.r: Covariance ratio
# - cook.d: Cook's distance
# - hat: Leverage
What To Do With Influential Points?
graph TD A["Find Influential Point"] --> B{Is it valid data?} B -->|Yes, valid| C["Keep it, report it"] B -->|Data entry error| D["Fix or remove"] B -->|Outlier| E["Run model with and without"] E --> F["Compare results"]
Putting It All Together: The Complete Workflow
# 1. FORMULA: Define what to predict
formula <- score ~ hours
# 2. FIT: Build the model
model <- lm(formula, data = my_data)
# 3. SUMMARY: Check performance
summary(model)
# 4. DIAGNOSE: Check assumptions
par(mfrow = c(2,2))
plot(model)
# 5. RESIDUALS: Analyze errors
hist(residuals(model))
# 6. INFLUENCE: Find outliers
influence.measures(model)
# 7. PREDICT: Make predictions!
predict(model, newdata = new_data)
Key Takeaways
- Formula Objects (
y ~ x) tell R what to predict - lm() builds the linear model
- summary() shows how well it works
- predict() makes new predictions
- Diagnostics check if assumptions are met
- Residuals show where the model makes mistakes
- Influence Measures find powerful outliers
Your Model is Your Crystal Ball
You’ve learned to:
- Write formulas that tell R your prediction goal
- Build models that learn patterns from data
- Read summaries to know how good your predictions are
- Predict new values with confidence intervals
- Check your model’s health with diagnostics
- Analyze residuals to spot problems
- Find influential points that might cause trouble
Now go predict the future! Just remember: your predictions are only as good as the pattern in your data. If the future breaks the pattern, even the best model won’t see it coming.
Happy Modeling!
