What is linear regression in R?

Linear regression draws the best straight line through data points to make predictions. In R, use lm() to build models.

What does the lm() function do in R?

The lm() function builds a linear model by finding the best-fit line through your data. It returns coefficients for predictions.

What is R-squared in regression?

R-squared is a percentage score showing how much variation your model explains. Higher values (closer to 1) mean better fit.

What are residuals in linear regression?

Residuals are the difference between actual values and predicted values. Random scatter around zero indicates a good model.

Linear Regression in R | Building Prediction Models

Linear Regression in R: Building Your First Prediction Machine

The Story of the Fortune Teller

Imagine you have a magical crystal ball. But this crystal ball is special - it learns from the past to predict the future. You tell it: “When I study 2 hours, I score 70 marks. When I study 4 hours, I score 85 marks.” The crystal ball notices a pattern and says: “Ah! More study hours = higher scores. Let me draw a line through your data!”

That’s Linear Regression - drawing the best possible straight line through your data points to make predictions.

Formula Objects: Teaching R What to Predict

Before we can predict anything, we need to tell R what we want to predict and what we’ll use to predict it. We do this with a formula.

The Magic Recipe

A formula in R looks like this:

y ~ x

Think of it as saying: “I want to predict y using x”

The ~ symbol (called tilde) means “depends on” or “is predicted by”.

Real Examples

# Predict score based on hours studied
score ~ hours

# Predict house price based on size
price ~ size

# Multiple predictors? No problem!
salary ~ experience + education

# Everything as a predictor
mpg ~ .

Quick Reference

Formula	Meaning
`y ~ x`	y depends on x
`y ~ x1 + x2`	y depends on x1 AND x2
`y ~ .`	y depends on ALL other columns
`y ~ x - 1`	No intercept (line through origin)

Linear Regression: Drawing the Best Line

Now the exciting part! We use lm() - which stands for Linear Model.

Your First Regression

# Create some data
study_data <- data.frame(
  hours = c(1, 2, 3, 4, 5),
  score = c(55, 60, 70, 75, 85)
)

# Build the model
my_model <- lm(score ~ hours,
               data = study_data)

That’s it! R just drew the best possible line through your data.

What Just Happened?

graph TD
    A["Your Data Points"] --> B["lm function"]
    B --> C["Finds Best Line"]
    C --> D["y = intercept + slope × x"]
    D --> E["Ready to Predict!"]

See Your Line

# What's the equation?
my_model

# Output:
# (Intercept)    hours
#      47.0       7.5

This tells us: score = 47 + 7.5 × hours

So if you study 6 hours: score = 47 + 7.5 × 6 = 92 marks!

Regression Summary: The Full Report Card

Want to know how good your prediction line is? Use summary().

Getting the Summary

summary(my_model)

Understanding the Output

The summary shows you several important things:

1. Coefficients Table

            Estimate Std.Error t value Pr(>|t|)
(Intercept)   47.00     4.18    11.24   0.0015 **
hours          7.50     1.12     6.71   0.0067 **

Estimate: The actual numbers in your equation
Std. Error: How uncertain we are about each number
t value: How confident we are (bigger = more confident)
Pr(>|t|): The p-value (smaller = more significant)

2. R-squared: Your Model’s Grade

Multiple R-squared: 0.937

Think of R-squared as a percentage score:

0.937 = 93.7% of the variation in scores is explained by study hours
Higher is better (but not always 1.0!)

graph TD
    A["R² = 0"] --> B["Line explains nothing"]
    C["R² = 0.5"] --> D["Line explains 50%"]
    E["R² = 1.0"] --> F["Perfect prediction"]

Quick Summary Cheatsheet

Metric	Good Sign
p-value	< 0.05
R-squared	> 0.7
Residual SE	Low number

Prediction from Models: Using Your Crystal Ball

Now the fun part - making predictions!

The predict() Function

# New data to predict
new_students <- data.frame(
  hours = c(6, 7, 8)
)

# Make predictions
predict(my_model,
        newdata = new_students)

# Output: 92.0  99.5  107.0

Predictions with Confidence

Not 100% sure about your predictions? Get a range!

# Confidence interval
predict(my_model,
        newdata = new_students,
        interval = "confidence")

#      fit    lwr    upr
# 1  92.0   84.2   99.8
# 2  99.5   89.1  109.9
# 3 107.0   93.2  120.8

fit: Your prediction
lwr: Lower bound (95% confident it’s above this)
upr: Upper bound (95% confident it’s below this)

Prediction vs Confidence Intervals

# For the mean response
interval = "confidence"

# For a single new observation
interval = "prediction"

Prediction intervals are wider because individual values vary more than averages.

Regression Diagnostics: Is Your Model Healthy?

Just like a doctor checks your health, we need to check our model’s health.

The 4 Key Checks

# Create 4 diagnostic plots
par(mfrow = c(2, 2))
plot(my_model)

This gives you 4 important plots:

1. Residuals vs Fitted

Should look like random scatter
No patterns allowed!

2. Normal Q-Q

Points should follow the diagonal line
Checks if errors are normally distributed

3. Scale-Location

Should be a flat horizontal band
Checks for constant variance

4. Residuals vs Leverage

Identifies influential outliers
Watch for points outside dashed lines

graph TD
    A["Run Diagnostics"] --> B{Patterns?}
    B -->|No Pattern| C["Model is Healthy!"]
    B -->|Pattern Found| D["Model Needs Help"]
    D --> E["Transform Data"]
    D --> F["Add Variables"]
    D --> G["Remove Outliers"]

Residual Analysis: Learning from Mistakes

A residual is the difference between what actually happened and what your model predicted.

Calculate Residuals

# Get residuals
residuals(my_model)

# Or equivalently
my_model$residuals

Residual = Actual - Predicted

# If actual score = 70
# And predicted = 67
# Residual = 70 - 67 = 3

What Residuals Tell Us

Pattern	What It Means
Random scatter around 0	Model is good!
Curve pattern	Need polynomial term
Funnel shape	Variance is not constant
Outliers	Some unusual data points

Visualizing Residuals

# Simple residual plot
plot(my_model$fitted.values,
     my_model$residuals)
abline(h = 0, col = "red")

# Should look like random dots
# around the red line

Checking Normality

# Histogram of residuals
hist(residuals(my_model))

# Should look like a bell curve

# Shapiro-Wilk test
shapiro.test(residuals(my_model))
# p-value > 0.05 means normal

Influence Measures: Finding the Troublemakers

Some data points have more power than others. One weird point can pull your whole line in the wrong direction!

Three Key Measures

1. Leverage (Hat Values) How far a point is from the center of your x values.

hatvalues(my_model)

High leverage points are at the edges of your data.

2. Cook’s Distance The overall influence of each point on your model.

cooks.distance(my_model)

Rule of thumb: Watch points where Cook’s D > 4/n

3. DFBETAS How much each point changes each coefficient.

dfbetas(my_model)

Visual Detection

# Plot Cook's distance
plot(my_model, which = 4)

# Points with high bars are
# influential!

The influence.measures() Function

Get everything at once:

influence.measures(my_model)

# Shows:
# - dfb.1_: Change in intercept
# - dfb.hour: Change in slope
# - dffit: Overall fit change
# - cov.r: Covariance ratio
# - cook.d: Cook's distance
# - hat: Leverage

What To Do With Influential Points?

graph TD
    A["Find Influential Point"] --> B{Is it valid data?}
    B -->|Yes, valid| C["Keep it, report it"]
    B -->|Data entry error| D["Fix or remove"]
    B -->|Outlier| E["Run model with and without"]
    E --> F["Compare results"]

Putting It All Together: The Complete Workflow

# 1. FORMULA: Define what to predict
formula <- score ~ hours

# 2. FIT: Build the model
model <- lm(formula, data = my_data)

# 3. SUMMARY: Check performance
summary(model)

# 4. DIAGNOSE: Check assumptions
par(mfrow = c(2,2))
plot(model)

# 5. RESIDUALS: Analyze errors
hist(residuals(model))

# 6. INFLUENCE: Find outliers
influence.measures(model)

# 7. PREDICT: Make predictions!
predict(model, newdata = new_data)

Key Takeaways

Formula Objects (y ~ x) tell R what to predict
lm() builds the linear model
summary() shows how well it works
predict() makes new predictions
Diagnostics check if assumptions are met
Residuals show where the model makes mistakes
Influence Measures find powerful outliers

Your Model is Your Crystal Ball

You’ve learned to:

Write formulas that tell R your prediction goal
Build models that learn patterns from data
Read summaries to know how good your predictions are
Predict new values with confidence intervals
Check your model’s health with diagnostics
Analyze residuals to spot problems
Find influential points that might cause trouble

Now go predict the future! Just remember: your predictions are only as good as the pattern in your data. If the future breaks the pattern, even the best model won’t see it coming.

Happy Modeling!

Linear Regression

Unable to load concept

Coming Soon...

Linear Regression in R: Building Your First Prediction Machine

The Story of the Fortune Teller

Formula Objects: Teaching R What to Predict

The Magic Recipe

Real Examples

Quick Reference

Linear Regression: Drawing the Best Line

Your First Regression

What Just Happened?

See Your Line

Regression Summary: The Full Report Card

Getting the Summary

Understanding the Output

Quick Summary Cheatsheet

Prediction from Models: Using Your Crystal Ball

The predict() Function

Predictions with Confidence

Prediction vs Confidence Intervals

Regression Diagnostics: Is Your Model Healthy?

The 4 Key Checks

Residual Analysis: Learning from Mistakes

Calculate Residuals

Residual = Actual - Predicted

What Residuals Tell Us

Visualizing Residuals

Checking Normality

Influence Measures: Finding the Troublemakers

Three Key Measures

Visual Detection

The influence.measures() Function

What To Do With Influential Points?

Putting It All Together: The Complete Workflow

Key Takeaways

Your Model is Your Crystal Ball

Story - Premium Content

Stay Tuned!

Story - Premium Content

Interactive - Premium Content

Interactive - Premium Content

Stay Tuned!

Cheatsheet - Premium Content

Cheatsheet - Premium Content

Stay Tuned!

Quiz - Premium Content

Quiz - Premium Content

Stay Tuned!

Flashcard - Premium Content

Flashcard - Premium Content

Stay Tuned!

Sign in Required

Report an Issue