Regression

Estimating models and predicting values with R

Jürgen Wilbert

University of Münster

2025-10-30

Prerequisite

You have basic knowledge on statistical regression

Goal

You know how to fit regression models in R

Regression

In a statistical regression you explain a variable (criteria or dependent variable) by one or more other variables (predictors or independent variables).
You regress the criteria variable on the predictor variables.
E.g.: You want to explain the life expectancy by certain aspects of behavior (diet, physical activity), environmental aspects (pollution, health care).
Therefore, you try to set up a mathematical function which tries to estimate what life expectancy a person has given all the other variables.

Regression formula

\(y_i = \beta_0 + \beta_1X_i + e_i\)

\(y\) = Criteria variable
\(i\) = Subject number (measurement number)
\(\beta_0\) = Intercept of \(y\)
\(\beta_1\) = Weight of predictor \(X\)
\(X\) = Predictor variable
\(e\) = Error term

`lm()` function

The lm() function fits a regression model.
lm(formula, data)
Formulas are a basic data type that is applied in many R functions.
Basic structur: dependent variable ~ explanatory variables
(e.g. y ~ x1 + x2)
data takes a dataframe

lm(dist ~ speed, data = cars)

Example

fit <- lm(dist ~ speed, data = cars)
fit


Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed  
    -17.579        3.932

\(dist_i = -17.579 + 3.932 * speed_i\)

Use `summary()` to get detailed analyses

summary(fit)


Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Modelfit:
\(F(1, 48) = 89.57; p < .001 ; R² = .65\)

Task

Take the mtcars dataset.
Calculate the correlation of mileage mpg and car weight wt.
Plot a scatterplot with ggplot (mileage on the x-axis and weight on the y-axis).
Add a regression line with (geom_smooth(method = "lm")).
Regress mileage mpg on car weight wt (that is, predict mileage by means of weight).

cor(mtcars$mpg, mtcars$wt)

[1] -0.8676594

fit <- lm(mpg ~ wt, data = mtcars)
summary(fit)


Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

ggplot(mtcars, aes(x = mpg, y = wt)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Task

Take the temp_carbon dataset from the dslabs library.
Make yourself familiar with the dataset: ?temp_carbon.
Plot a scatterplot with ggplot (year on the x-axis and temp_anomaly on the y-axis).
Add a regression line with (geom_smooth(method = "lm")).
Regress temp_anomaly on year (that is, predict temp_anomaly by means of year).

library(dslabs)
fit <- lm(temp_anomaly ~ year, 
          data = temp_carbon)
summary(fit)


Call:
lm(formula = temp_anomaly ~ year, data = temp_carbon)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32598 -0.11846 -0.01062  0.11185  0.43367 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.409e+01  6.750e-01  -20.87   <2e-16 ***
year         7.259e-03  3.463e-04   20.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1638 on 137 degrees of freedom
  (129 observations deleted due to missingness)
Multiple R-squared:  0.7623,    Adjusted R-squared:  0.7606 
F-statistic: 439.4 on 1 and 137 DF,  p-value: < 2.2e-16

ggplot(temp_carbon, aes(x = year, 
                        y = temp_anomaly)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Multiple predictors

A formula can take multiple predictors:
y ~ x1 + x2
An Interaction describes a relation between two (or more) predictors where the influence of one predictor changes with the value of the other predictor (e.g. weight and smoking influence blood pressure. And the influence of smoking is even higher for those who are overweight).
An interaction is modeled with a : sign:
y ~ x1 + x2 + x1:x2

Example

Predict mileage 'mpg' by weight 'wt' and number of cylinders 'cyl' and its interaction

fit <- lm(mpg ~ wt + cyl + wt:cyl, data = mtcars)
summary(fit)


Call:
lm(formula = mpg ~ wt + cyl + wt:cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2288 -1.3495 -0.5042  1.4647  5.2344 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  54.3068     6.1275   8.863 1.29e-09 ***
wt           -8.6556     2.3201  -3.731 0.000861 ***
cyl          -3.8032     1.0050  -3.784 0.000747 ***
wt:cyl        0.8084     0.3273   2.470 0.019882 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.368 on 28 degrees of freedom
Multiple R-squared:  0.8606,    Adjusted R-squared:  0.8457 
F-statistic: 57.62 on 3 and 28 DF,  p-value: 4.231e-12

Task

Take the gapminder dataset from the dslabs library.
Predict infant mortality (infant_mortality) by year (year) and average number of children per woman (fertility).
Take the interaction into account.
How to interpret the estimates?

library(dslabs)
fit <- lm(infant_mortality ~ year + fertility + fertility:year, data = gapminder)
summary(fit)


Call:
lm(formula = infant_mortality ~ year + fertility + fertility:year, 
    data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-81.327 -11.540  -1.282  11.575 141.101 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -5.902e+02  8.588e+01  -6.873 6.72e-12 ***
year            2.872e-01  4.308e-02   6.666 2.77e-11 ***
fertility       3.574e+02  1.869e+01  19.130  < 2e-16 ***
year:fertility -1.708e-01  9.404e-03 -18.168  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.84 on 9088 degrees of freedom
  (1453 observations deleted due to missingness)
Multiple R-squared:  0.7071,    Adjusted R-squared:  0.707 
F-statistic:  7312 on 3 and 9088 DF,  p-value: < 2.2e-16

Although the previous computation is correct (!). The estimates is misleading. The intercept, for example, depicts the average dependent variable when all predictors are zero (i.e. year is 0 and fertility is 0).
We can mend this by centering the predictor variables.
Centering is achieved by subtracting the mean of a variable from each measurement of that variable. That is, the value 0 now depicts the average.

dat <- gapminder
dat$year <- dat$year - mean(dat$year, na.rm = TRUE)
# alternatively, use the scale function to center a variable:
dat$fertility <- scale(dat$fertility, scale = FALSE)

dat <- gapminder
dat$year <- scale(dat$year, scale = FALSE)
dat$fertility <- scale(dat$fertility, scale = FALSE)
fit <- lm(infant_mortality ~ year + fertility + fertility:year, data = dat)
summary(fit)


Call:
lm(formula = infant_mortality ~ year + fertility + fertility:year, 
    data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-81.327 -11.540  -1.282  11.575 141.101 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    53.452151   0.304781  175.38   <2e-16 ***
year           -0.410454   0.019561  -20.98   <2e-16 ***
fertility      17.796992   0.150837  117.99   <2e-16 ***
year:fertility -0.170849   0.009404  -18.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.84 on 9088 degrees of freedom
  (1453 observations deleted due to missingness)
Multiple R-squared:  0.7071,    Adjusted R-squared:  0.707 
F-statistic:  7312 on 3 and 9088 DF,  p-value: < 2.2e-16

Categorical predictors: comparing groups

Predictors can be categorical variables.
Examples of categorical variables are: students with vs. without special educational needs; gender; Haircolor.
They have to be factors to recognize them as categorical.

Example

We predict mpg by wt and the transmission type am and their interaction.
The transmission type is a categorical variable. By default am is not a factor. So we have to turn am into a factor first.

mtcars$wt_centered <- scale(mtcars$wt, scale = FALSE)
mtcars$am_factor <- factor(mtcars$am, labels = c("Automatic", "Manual"))
fit <- lm(mpg ~ wt_centered + am_factor + wt_centered:am_factor, data = mtcars)
summary(fit)


Call:
lm(formula = mpg ~ wt_centered + am_factor + wt_centered:am_factor, 
    data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6004 -1.5446 -0.5325  0.9012  6.0909 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  19.2358     0.7357  26.147  < 2e-16 ***
wt_centered                  -3.7859     0.7856  -4.819 4.55e-05 ***
am_factorManual              -2.1677     1.4189  -1.528  0.13779    
wt_centered:am_factorManual  -5.2984     1.4447  -3.667  0.00102 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared:  0.833, Adjusted R-squared:  0.8151 
F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

Task

heights$sex <- factor(heights$sex)
fit <- lm(height ~ sex, data = heights)
summary(fit)


Call:
lm(formula = height ~ sex, data = heights)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.3148  -2.3148  -0.3148   2.6852  14.0606 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  64.9394     0.2363  274.82   <2e-16 ***
sexMale       4.3753     0.2687   16.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.645 on 1048 degrees of freedom
Multiple R-squared:  0.2019,    Adjusted R-squared:  0.2012 
F-statistic: 265.1 on 1 and 1048 DF,  p-value: < 2.2e-16

Contrasts

In the previous example, we compared the effect of an automatic against a manual gearshift. This is called a contrast. In this case, our intercept represented the mpg for the automatic gearshift and the predictor represented the change in the mpg values for manual gearshifts.
This type of contrast is called a “treatment” contrast.
An alternative approach would be to calculate the average of ‘mpg’ across both types of gearshifts and the predictor represents how much the type of gearshift influences ‘mpg’.
This type of contrast is called a Helmert contrast.

Example

Fitting lm(mpg ~ am_factor, data = mtcars) with a treatment contrast:

    (Intercept) am_factorManual 
      17.147368        7.244939

Fitting lm(mpg ~ am_factor, data = mtcars) with a Helmert contrast:

(Intercept)  am_factor1 
   20.76984     3.62247

The constrasts() function gets and sets contrasts for factors
Treatment contrasts for two factor levels:

contrasts(mtcars$am_factor) <- contr.treatment(2)

Helmert contrasts for two factor levels:

contrasts(mtcars$am_factor) <- contr.helmert(2)

Task

Take the mtcars dataset
Predict mpg by wt and the transmission type am and their interaction
By default am is not a factor. Please create a factor transmission for am first
Set the contrasts of the factor to Helmert and calculate the model.
Set the contrasts of the factor to Treatment and calculate the model
Compare the results of the two models and discuss them with you seatmate

mtcars$transmission <- factor(mtcars$am, labels = c("Automatic", "Manual"))
mtcars$wt_centered <- scale(mtcars$wt, scale = FALSE)
contrasts(mtcars$transmission) <- contr.helmert(2)
fit1 <- lm(mpg ~ wt_centered + transmission + wt_centered:transmission, data = mtcars)
summary(fit1)


Call:
lm(formula = mpg ~ wt_centered + transmission + wt_centered:transmission, 
    data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6004 -1.5446 -0.5325  0.9012  6.0909 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                18.1520     0.7094  25.586  < 2e-16 ***
wt_centered                -6.4351     0.7223  -8.909 1.16e-09 ***
transmission1              -1.0839     0.7094  -1.528  0.13779    
wt_centered:transmission1  -2.6492     0.7223  -3.667  0.00102 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared:  0.833, Adjusted R-squared:  0.8151 
F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

contrasts(mtcars$transmission) <- contr.treatment(2)
fit2 <- lm(mpg ~ wt_centered + transmission + wt_centered:transmission, data = mtcars)
summary(fit2)


Call:
lm(formula = mpg ~ wt_centered + transmission + wt_centered:transmission, 
    data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6004 -1.5446 -0.5325  0.9012  6.0909 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                19.2358     0.7357  26.147  < 2e-16 ***
wt_centered                -3.7859     0.7856  -4.819 4.55e-05 ***
transmission2              -2.1677     1.4189  -1.528  0.13779    
wt_centered:transmission2  -5.2984     1.4447  -3.667  0.00102 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared:  0.833, Adjusted R-squared:  0.8151 
F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

Design example

library(sjPlot)
tab_model(fit1)

	mpg
Predictors	Estimates	CI	p
(Intercept)	18.15	16.70 – 19.61	<0.001
wt centered	-6.44	-7.91 – -4.96	<0.001
transmission [1]	-1.08	-2.54 – 0.37	0.138
wt centered × transmission [1]	-2.65	-4.13 – -1.17	0.001
Observations	32
R² / R² adjusted	0.833 / 0.815

tab_model(fit1, fit2)

	mpg			mpg
Predictors	Estimates	CI	p	Estimates	CI	p
(Intercept)	18.15	16.70 – 19.61	<0.001	19.24	17.73 – 20.74	<0.001
wt centered	-6.44	-7.91 – -4.96	<0.001	-3.79	-5.40 – -2.18	<0.001
transmission [1]	-1.08	-2.54 – 0.37	0.138
wt centered × transmission [1]	-2.65	-4.13 – -1.17	0.001
transmission [2]				-2.17	-5.07 – 0.74	0.138
wt centered × transmission [2]				-5.30	-8.26 – -2.34	0.001
Observations	32			32
R² / R² adjusted	0.833 / 0.815			0.833 / 0.815

tab_model(fit1, fit2,
  show.ci = FALSE, show.se = TRUE,string.se = "se", string.est = "B",
  dv.labels = c("Model with helmert contrast", "Model with treatment contrast")
)

	Model with helmert contrast			Model with treatment contrast
Predictors	B	se	p	B	se	p
(Intercept)	18.15	0.71	<0.001	19.24	0.74	<0.001
wt centered	-6.44	0.72	<0.001	-3.79	0.79	<0.001
transmission [1]	-1.08	0.71	0.138
wt centered × transmission [1]	-2.65	0.72	0.001
transmission [2]				-2.17	1.42	0.138
wt centered × transmission [2]				-5.30	1.44	0.001
Observations	32			32
R² / R² adjusted	0.833 / 0.815			0.833 / 0.815

Multilevel regressions

Multilevel regression formula

\(y_{ij} = \beta_{0j} + \beta_{1j} X_{ij} + e_{ij}\)

Level 2:
\(\beta_{0j} = \gamma_{01}W_j + \upsilon_{0j}\)
\(\beta_{1j} = \gamma_{10} + \upsilon_{1j}\)

\(y\) = Criteria variable
\(i\) = Subject number (measurement number)
\(\beta_0\) = Intercept of \(y\)
\(\beta_1\) = Weight of predictor \(X\)
\(X\) = Predictor variable
\(e\) = Error term
\(j\) = Level 2 group number
\(\gamma_{00}\) = Intercept
\(W\) = Level 2 predictor (grouping variable)
\(\gamma_{01}\) = Weight for \(W\)
\(\gamma_{10}\) = Weight of the predictor
\(\upsilon_{0j}\) = Error term for intercept \(\upsilon_{0j}\) = Error term for slope

Regression

Prerequisite

Goal

Regression

Regression formula

lm() function

Example

Example

Example

Use summary() to get detailed analyses

Task

Task

Multiple predictors

Example

Task

Categorical predictors: comparing groups

Example

Task

Contrasts

Example

Task

Design example

Multilevel regressions

Multilevel regression formula

`lm()` function

Use `summary()` to get detailed analyses