Estimating models and predicting values with R
University of Münster
2025-10-30
You have basic knowledge on statistical regression
You know how to fit regression models in R
\(y_i = \beta_0 + \beta_1X_i + e_i\)
\(y\) = Criteria variable
\(i\) = Subject number (measurement number)
\(\beta_0\) = Intercept of \(y\)
\(\beta_1\) = Weight of predictor \(X\)
\(X\) = Predictor variable
\(e\) = Error term
lm() functionlm() function fits a regression model.lm(formula, data)data takes a dataframelm(dist ~ speed, data = cars)
\(dist_i = -17.579 + 3.932 * speed_i\)
summary() to get detailed analyses
Call:
lm(formula = dist ~ speed, data = cars)
Residuals:
Min 1Q Median 3Q Max
-29.069 -9.525 -2.272 9.215 43.201
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.5791 6.7584 -2.601 0.0123 *
speed 3.9324 0.4155 9.464 1.49e-12 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438
F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12
Modelfit:
\(F(1, 48) = 89.57; p < .001 ; R² = .65\)
mtcars dataset.mpg and car weight wt.geom_smooth(method = "lm")).mpg on car weight wt (that is, predict mileage by means of weight).[1] -0.8676594
Call:
lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10
temp_carbon dataset from the dslabs library.?temp_carbon.geom_smooth(method = "lm")).temp_anomaly on year (that is, predict temp_anomaly by means of year).
Call:
lm(formula = temp_anomaly ~ year, data = temp_carbon)
Residuals:
Min 1Q Median 3Q Max
-0.32598 -0.11846 -0.01062 0.11185 0.43367
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.409e+01 6.750e-01 -20.87 <2e-16 ***
year 7.259e-03 3.463e-04 20.96 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1638 on 137 degrees of freedom
(129 observations deleted due to missingness)
Multiple R-squared: 0.7623, Adjusted R-squared: 0.7606
F-statistic: 439.4 on 1 and 137 DF, p-value: < 2.2e-16
formula can take multiple predictors:y ~ x1 + x2: sign:Predict mileage 'mpg' by weight 'wt' and number of cylinders 'cyl' and its interaction
Call:
lm(formula = mpg ~ wt + cyl + wt:cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.2288 -1.3495 -0.5042 1.4647 5.2344
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.3068 6.1275 8.863 1.29e-09 ***
wt -8.6556 2.3201 -3.731 0.000861 ***
cyl -3.8032 1.0050 -3.784 0.000747 ***
wt:cyl 0.8084 0.3273 2.470 0.019882 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.368 on 28 degrees of freedom
Multiple R-squared: 0.8606, Adjusted R-squared: 0.8457
F-statistic: 57.62 on 3 and 28 DF, p-value: 4.231e-12
gapminder dataset from the dslabs library.infant_mortality) by year (year) and average number of children per woman (fertility).library(dslabs)
fit <- lm(infant_mortality ~ year + fertility + fertility:year, data = gapminder)
summary(fit)
Call:
lm(formula = infant_mortality ~ year + fertility + fertility:year,
data = gapminder)
Residuals:
Min 1Q Median 3Q Max
-81.327 -11.540 -1.282 11.575 141.101
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -5.902e+02 8.588e+01 -6.873 6.72e-12 ***
year 2.872e-01 4.308e-02 6.666 2.77e-11 ***
fertility 3.574e+02 1.869e+01 19.130 < 2e-16 ***
year:fertility -1.708e-01 9.404e-03 -18.168 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.84 on 9088 degrees of freedom
(1453 observations deleted due to missingness)
Multiple R-squared: 0.7071, Adjusted R-squared: 0.707
F-statistic: 7312 on 3 and 9088 DF, p-value: < 2.2e-16
dat <- gapminder
dat$year <- scale(dat$year, scale = FALSE)
dat$fertility <- scale(dat$fertility, scale = FALSE)
fit <- lm(infant_mortality ~ year + fertility + fertility:year, data = dat)
summary(fit)
Call:
lm(formula = infant_mortality ~ year + fertility + fertility:year,
data = dat)
Residuals:
Min 1Q Median 3Q Max
-81.327 -11.540 -1.282 11.575 141.101
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 53.452151 0.304781 175.38 <2e-16 ***
year -0.410454 0.019561 -20.98 <2e-16 ***
fertility 17.796992 0.150837 117.99 <2e-16 ***
year:fertility -0.170849 0.009404 -18.17 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 25.84 on 9088 degrees of freedom
(1453 observations deleted due to missingness)
Multiple R-squared: 0.7071, Adjusted R-squared: 0.707
F-statistic: 7312 on 3 and 9088 DF, p-value: < 2.2e-16
factors to recognize them as categorical.mpg by wt and the transmission type am and their interaction.am is not a factor. So we have to turn am into a factor first.mtcars$wt_centered <- scale(mtcars$wt, scale = FALSE)
mtcars$am_factor <- factor(mtcars$am, labels = c("Automatic", "Manual"))
fit <- lm(mpg ~ wt_centered + am_factor + wt_centered:am_factor, data = mtcars)
summary(fit)
Call:
lm(formula = mpg ~ wt_centered + am_factor + wt_centered:am_factor,
data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.6004 -1.5446 -0.5325 0.9012 6.0909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.2358 0.7357 26.147 < 2e-16 ***
wt_centered -3.7859 0.7856 -4.819 4.55e-05 ***
am_factorManual -2.1677 1.4189 -1.528 0.13779
wt_centered:am_factorManual -5.2984 1.4447 -3.667 0.00102 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared: 0.833, Adjusted R-squared: 0.8151
F-statistic: 46.57 on 3 and 28 DF, p-value: 5.209e-11
Call:
lm(formula = height ~ sex, data = heights)
Residuals:
Min 1Q Median 3Q Max
-19.3148 -2.3148 -0.3148 2.6852 14.0606
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 64.9394 0.2363 274.82 <2e-16 ***
sexMale 4.3753 0.2687 16.28 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.645 on 1048 degrees of freedom
Multiple R-squared: 0.2019, Adjusted R-squared: 0.2012
F-statistic: 265.1 on 1 and 1048 DF, p-value: < 2.2e-16
In the previous example, we compared the effect of an automatic against a manual gearshift. This is called a contrast. In this case, our intercept represented the mpg for the automatic gearshift and the predictor represented the change in the mpg values for manual gearshifts.
This type of contrast is called a “treatment” contrast.
An alternative approach would be to calculate the average of ‘mpg’ across both types of gearshifts and the predictor represents how much the type of gearshift influences ‘mpg’.
This type of contrast is called a Helmert contrast.
Fitting lm(mpg ~ am_factor, data = mtcars) with a treatment contrast:
(Intercept) am_factorManual
17.147368 7.244939
Fitting lm(mpg ~ am_factor, data = mtcars) with a Helmert contrast:
(Intercept) am_factor1
20.76984 3.62247
constrasts() function gets and sets contrasts for factorsmtcars datasetmpg by wt and the transmission type am and their interactionam is not a factor. Please create a factor transmission for am firstmtcars$transmission <- factor(mtcars$am, labels = c("Automatic", "Manual"))
mtcars$wt_centered <- scale(mtcars$wt, scale = FALSE)
contrasts(mtcars$transmission) <- contr.helmert(2)
fit1 <- lm(mpg ~ wt_centered + transmission + wt_centered:transmission, data = mtcars)
summary(fit1)
Call:
lm(formula = mpg ~ wt_centered + transmission + wt_centered:transmission,
data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.6004 -1.5446 -0.5325 0.9012 6.0909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 18.1520 0.7094 25.586 < 2e-16 ***
wt_centered -6.4351 0.7223 -8.909 1.16e-09 ***
transmission1 -1.0839 0.7094 -1.528 0.13779
wt_centered:transmission1 -2.6492 0.7223 -3.667 0.00102 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared: 0.833, Adjusted R-squared: 0.8151
F-statistic: 46.57 on 3 and 28 DF, p-value: 5.209e-11
contrasts(mtcars$transmission) <- contr.treatment(2)
fit2 <- lm(mpg ~ wt_centered + transmission + wt_centered:transmission, data = mtcars)
summary(fit2)
Call:
lm(formula = mpg ~ wt_centered + transmission + wt_centered:transmission,
data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-3.6004 -1.5446 -0.5325 0.9012 6.0909
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.2358 0.7357 26.147 < 2e-16 ***
wt_centered -3.7859 0.7856 -4.819 4.55e-05 ***
transmission2 -2.1677 1.4189 -1.528 0.13779
wt_centered:transmission2 -5.2984 1.4447 -3.667 0.00102 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared: 0.833, Adjusted R-squared: 0.8151
F-statistic: 46.57 on 3 and 28 DF, p-value: 5.209e-11
| mpg | mpg | |||||
| Predictors | Estimates | CI | p | Estimates | CI | p |
| (Intercept) | 18.15 | 16.70 – 19.61 | <0.001 | 19.24 | 17.73 – 20.74 | <0.001 |
| wt centered | -6.44 | -7.91 – -4.96 | <0.001 | -3.79 | -5.40 – -2.18 | <0.001 |
| transmission [1] | -1.08 | -2.54 – 0.37 | 0.138 | |||
| wt centered × transmission [1] |
-2.65 | -4.13 – -1.17 | 0.001 | |||
| transmission [2] | -2.17 | -5.07 – 0.74 | 0.138 | |||
| wt centered × transmission [2] |
-5.30 | -8.26 – -2.34 | 0.001 | |||
| Observations | 32 | 32 | ||||
| R2 / R2 adjusted | 0.833 / 0.815 | 0.833 / 0.815 | ||||
tab_model(fit1, fit2,
show.ci = FALSE, show.se = TRUE,string.se = "se", string.est = "B",
dv.labels = c("Model with helmert contrast", "Model with treatment contrast")
)| Model with helmert contrast | Model with treatment contrast | |||||
| Predictors | B | se | p | B | se | p |
| (Intercept) | 18.15 | 0.71 | <0.001 | 19.24 | 0.74 | <0.001 |
| wt centered | -6.44 | 0.72 | <0.001 | -3.79 | 0.79 | <0.001 |
| transmission [1] | -1.08 | 0.71 | 0.138 | |||
| wt centered × transmission [1] |
-2.65 | 0.72 | 0.001 | |||
| transmission [2] | -2.17 | 1.42 | 0.138 | |||
| wt centered × transmission [2] |
-5.30 | 1.44 | 0.001 | |||
| Observations | 32 | 32 | ||||
| R2 / R2 adjusted | 0.833 / 0.815 | 0.833 / 0.815 | ||||
\(y_{ij} = \beta_{0j} + \beta_{1j} X_{ij} + e_{ij}\)
Level 2:
\(\beta_{0j} = \gamma_{01}W_j + \upsilon_{0j}\)
\(\beta_{1j} = \gamma_{10} + \upsilon_{1j}\)
\(y\) = Criteria variable
\(i\) = Subject number (measurement number)
\(\beta_0\) = Intercept of \(y\)
\(\beta_1\) = Weight of predictor \(X\)
\(X\) = Predictor variable
\(e\) = Error term
\(j\) = Level 2 group number
\(\gamma_{00}\) = Intercept
\(W\) = Level 2 predictor (grouping variable)
\(\gamma_{01}\) = Weight for \(W\)
\(\gamma_{10}\) = Weight of the predictor
\(\upsilon_{0j}\) = Error term for intercept \(\upsilon_{0j}\) = Error term for slope
Jürgen Wilbert - Introduction to R