Regression

  • In a statistical regression you explain a variable (criteria or dependent variable) by one or more other variables (predictors or independent variables).
  • You regress the criteria variable on the predictor variables.
  • E.g.: You want to explain the life expectancy by certain aspects of behavior (diet, physical activity), environmental aspects (pollution, health care).
  • Therefore, you try to set up a mathematical function which tries to estimate what life expectancy a person has given all the other variables.

Regression formula

\(y_i = \beta_0 + \beta_1X_i + e_i\)

\(y\) = Criteria variable
\(i\) = Subject number (measurement number)
\(\beta_0\) = Intercept of \(y\)
\(\beta_1\) = Weight of predictor \(X\)
\(X\) = Predictor variable
\(e\) = Error term

lm() function

  • The lm() function fits a regression model.
  • lm(formula, data)
  • Formulas are a basic data type that is applied in many R functions.
  • Basic structur: dependent variable ~ explanatory variables
    (e.g. y ~ x1 + x2)
  • data takes a dataframe

lm(dist ~ speed, data = cars)

Example

Example

Example

fit <- lm(dist ~ speed, data = cars)
fit

Call:
lm(formula = dist ~ speed, data = cars)

Coefficients:
(Intercept)        speed  
    -17.579        3.932  

\(dist_i = -17.579 + 3.932 * speed_i\)

Use summary() to get detailed analyses

summary(fit)

Call:
lm(formula = dist ~ speed, data = cars)

Residuals:
    Min      1Q  Median      3Q     Max 
-29.069  -9.525  -2.272   9.215  43.201 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.5791     6.7584  -2.601   0.0123 *  
speed         3.9324     0.4155   9.464 1.49e-12 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.38 on 48 degrees of freedom
Multiple R-squared:  0.6511,    Adjusted R-squared:  0.6438 
F-statistic: 89.57 on 1 and 48 DF,  p-value: 1.49e-12

Modelfit:
\(F(1, 48) = 89.57; p < .001 ; R² = .65\)

Task

  • Take the mtcars dataset.
  • Calculate the correlation of mileage mpg and car weight wt.
  • Plot a scatterplot with ggplot (mileage on the x-axis and weight on the y-axis).
  • Add a regression line with (geom_smooth(method = "lm")).
  • Regress mileage mpg on car weight wt (that is, predict mileage by means of weight).

cor(mtcars$mpg, mtcars$wt)
[1] -0.8676594
fit <- lm(mpg ~ wt, data = mtcars)
summary(fit)

Call:
lm(formula = mpg ~ wt, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.5432 -2.3647 -0.1252  1.4096  6.8727 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  37.2851     1.8776  19.858  < 2e-16 ***
wt           -5.3445     0.5591  -9.559 1.29e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared:  0.7528,    Adjusted R-squared:  0.7446 
F-statistic: 91.38 on 1 and 30 DF,  p-value: 1.294e-10

ggplot(mtcars, aes(x = mpg, y = wt)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Task

  • Take the temp_carbon dataset from the dslabs library.
  • Make yourself familiar with the dataset: ?temp_carbon.
  • Plot a scatterplot with ggplot (year on the x-axis and temp_anomaly on the y-axis).
  • Add a regression line with (geom_smooth(method = "lm")).
  • Regress temp_anomaly on year (that is, predict temp_anomaly by means of year).

library(dslabs)
fit <- lm(temp_anomaly ~ year, 
          data = temp_carbon)
summary(fit)

Call:
lm(formula = temp_anomaly ~ year, data = temp_carbon)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.32598 -0.11846 -0.01062  0.11185  0.43367 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -1.409e+01  6.750e-01  -20.87   <2e-16 ***
year         7.259e-03  3.463e-04   20.96   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1638 on 137 degrees of freedom
  (129 observations deleted due to missingness)
Multiple R-squared:  0.7623,    Adjusted R-squared:  0.7606 
F-statistic: 439.4 on 1 and 137 DF,  p-value: < 2.2e-16

ggplot(temp_carbon, aes(x = year, 
                        y = temp_anomaly)) + 
  geom_point() + 
  geom_smooth(method = "lm")

Multiple predictors

  • A formula can take multiple predictors:
    y ~ x1 + x2
  • An Interaction describes a relation between two (or more) predictors where the influence of one predictor changes with the value of the other predictor (e.g. weight and smoking influence blood pressure. And the influence of smoking is even higher for those who are overweight).
  • An interaction is modeled with a : sign:
    y ~ x1 + x2 + x1:x2

Example

Predict mileage 'mpg' by weight 'wt' and number of cylinders 'cyl' and its interaction

fit <- lm(mpg ~ wt + cyl + wt:cyl, data = mtcars)
summary(fit)

Call:
lm(formula = mpg ~ wt + cyl + wt:cyl, data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.2288 -1.3495 -0.5042  1.4647  5.2344 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  54.3068     6.1275   8.863 1.29e-09 ***
wt           -8.6556     2.3201  -3.731 0.000861 ***
cyl          -3.8032     1.0050  -3.784 0.000747 ***
wt:cyl        0.8084     0.3273   2.470 0.019882 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.368 on 28 degrees of freedom
Multiple R-squared:  0.8606,    Adjusted R-squared:  0.8457 
F-statistic: 57.62 on 3 and 28 DF,  p-value: 4.231e-12

Task

  • Take the gapminder dataset from the dslabs library.
  • Predict infant mortality (infant_mortality) by year (year) and average number of children per woman (fertility).
  • Take the interaction into account.
  • How to interpret the estimates?
library(dslabs)
fit <- lm(infant_mortality ~ year + fertility + fertility:year, data = gapminder)
summary(fit)

Call:
lm(formula = infant_mortality ~ year + fertility + fertility:year, 
    data = gapminder)

Residuals:
    Min      1Q  Median      3Q     Max 
-81.327 -11.540  -1.282  11.575 141.101 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)    -5.902e+02  8.588e+01  -6.873 6.72e-12 ***
year            2.872e-01  4.308e-02   6.666 2.77e-11 ***
fertility       3.574e+02  1.869e+01  19.130  < 2e-16 ***
year:fertility -1.708e-01  9.404e-03 -18.168  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.84 on 9088 degrees of freedom
  (1453 observations deleted due to missingness)
Multiple R-squared:  0.7071,    Adjusted R-squared:  0.707 
F-statistic:  7312 on 3 and 9088 DF,  p-value: < 2.2e-16
  • Although the previous computation is correct (!). The estimates is misleading. The intercept, for example, depicts the average dependent variable when all predictors are zero (i.e. year is 0 and fertility is 0).
  • We can mend this by centering the predictor variables.
  • Centering is achieved by subtracting the mean of a variable from each measurement of that variable. That is, the value 0 now depicts the average.
dat <- gapminder
dat$year <- dat$year - mean(dat$year, na.rm = TRUE)
# alternatively, use the scale function to center a variable:
dat$fertility <- scale(dat$fertility, scale = FALSE)

dat <- gapminder
dat$year <- scale(dat$year, scale = FALSE)
dat$fertility <- scale(dat$fertility, scale = FALSE)
fit <- lm(infant_mortality ~ year + fertility + fertility:year, data = dat)
summary(fit)

Call:
lm(formula = infant_mortality ~ year + fertility + fertility:year, 
    data = dat)

Residuals:
    Min      1Q  Median      3Q     Max 
-81.327 -11.540  -1.282  11.575 141.101 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    53.452151   0.304781  175.38   <2e-16 ***
year           -0.410454   0.019561  -20.98   <2e-16 ***
fertility      17.796992   0.150837  117.99   <2e-16 ***
year:fertility -0.170849   0.009404  -18.17   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 25.84 on 9088 degrees of freedom
  (1453 observations deleted due to missingness)
Multiple R-squared:  0.7071,    Adjusted R-squared:  0.707 
F-statistic:  7312 on 3 and 9088 DF,  p-value: < 2.2e-16

Categorical predictors: comparing groups

  • Predictors can be categorical variables.
  • Examples of categorical variables are: students with vs. without special educational needs; gender; Haircolor.
  • They have to be factors to recognize them as categorical.

Example

  • We predict mpg by wt and the transmission type am and their interaction.
  • The transmission type is a categorical variable. By default am is not a factor. So we have to turn am into a factor first.
mtcars$wt_centered <- scale(mtcars$wt, scale = FALSE)
mtcars$am_factor <- factor(mtcars$am, labels = c("Automatic", "Manual"))
fit <- lm(mpg ~ wt_centered + am_factor + wt_centered:am_factor, data = mtcars)
summary(fit)

Call:
lm(formula = mpg ~ wt_centered + am_factor + wt_centered:am_factor, 
    data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6004 -1.5446 -0.5325  0.9012  6.0909 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                  19.2358     0.7357  26.147  < 2e-16 ***
wt_centered                  -3.7859     0.7856  -4.819 4.55e-05 ***
am_factorManual              -2.1677     1.4189  -1.528  0.13779    
wt_centered:am_factorManual  -5.2984     1.4447  -3.667  0.00102 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared:  0.833, Adjusted R-squared:  0.8151 
F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

Task

heights$sex <- factor(heights$sex)
fit <- lm(height ~ sex, data = heights)
summary(fit)

Call:
lm(formula = height ~ sex, data = heights)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.3148  -2.3148  -0.3148   2.6852  14.0606 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  64.9394     0.2363  274.82   <2e-16 ***
sexMale       4.3753     0.2687   16.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.645 on 1048 degrees of freedom
Multiple R-squared:  0.2019,    Adjusted R-squared:  0.2012 
F-statistic: 265.1 on 1 and 1048 DF,  p-value: < 2.2e-16

Contrasts

  • In the previous example, we compared the effect of an automatic against a manual gearshift. This is called a contrast. In this case, our intercept represented the mpg for the automatic gearshift and the predictor represented the change in the mpg values for manual gearshifts.

  • This type of contrast is called a “treatment” contrast.

  • An alternative approach would be to calculate the average of ‘mpg’ across both types of gearshifts and the predictor represents how much the type of gearshift influences ‘mpg’.

  • This type of contrast is called a Helmert contrast.

Example

Fitting lm(mpg ~ am_factor, data = mtcars) with a treatment contrast:

    (Intercept) am_factorManual 
      17.147368        7.244939 

Fitting lm(mpg ~ am_factor, data = mtcars) with a Helmert contrast:

(Intercept)  am_factor1 
   20.76984     3.62247 
  • The constrasts() function gets and sets contrasts for factors
  • Treatment contrasts for two factor levels:
contrasts(mtcars$am_factor) <- contr.treatment(2)
  • Helmert contrasts for two factor levels:
contrasts(mtcars$am_factor) <- contr.helmert(2)

Task

  • Take the mtcars dataset
  • Predict mpg by wt and the transmission type am and their interaction
  • By default am is not a factor. Please create a factor transmission for am first
  • Set the contrasts of the factor to Helmert and calculate the model.
  • Set the contrasts of the factor to Treatment and calculate the model
  • Compare the results of the two models and discuss them with you seatmate
mtcars$transmission <- factor(mtcars$am, labels = c("Automatic", "Manual"))
mtcars$wt_centered <- scale(mtcars$wt, scale = FALSE)
contrasts(mtcars$transmission) <- contr.helmert(2)
fit1 <- lm(mpg ~ wt_centered + transmission + wt_centered:transmission, data = mtcars)
summary(fit1)

Call:
lm(formula = mpg ~ wt_centered + transmission + wt_centered:transmission, 
    data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6004 -1.5446 -0.5325  0.9012  6.0909 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                18.1520     0.7094  25.586  < 2e-16 ***
wt_centered                -6.4351     0.7223  -8.909 1.16e-09 ***
transmission1              -1.0839     0.7094  -1.528  0.13779    
wt_centered:transmission1  -2.6492     0.7223  -3.667  0.00102 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared:  0.833, Adjusted R-squared:  0.8151 
F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11
contrasts(mtcars$transmission) <- contr.treatment(2)
fit2 <- lm(mpg ~ wt_centered + transmission + wt_centered:transmission, data = mtcars)
summary(fit2)

Call:
lm(formula = mpg ~ wt_centered + transmission + wt_centered:transmission, 
    data = mtcars)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6004 -1.5446 -0.5325  0.9012  6.0909 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)    
(Intercept)                19.2358     0.7357  26.147  < 2e-16 ***
wt_centered                -3.7859     0.7856  -4.819 4.55e-05 ***
transmission2              -2.1677     1.4189  -1.528  0.13779    
wt_centered:transmission2  -5.2984     1.4447  -3.667  0.00102 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.591 on 28 degrees of freedom
Multiple R-squared:  0.833, Adjusted R-squared:  0.8151 
F-statistic: 46.57 on 3 and 28 DF,  p-value: 5.209e-11

Design example

library(sjPlot)
tab_model(fit1)
  mpg
Predictors Estimates CI p
(Intercept) 18.15 16.70 – 19.61 <0.001
wt centered -6.44 -7.91 – -4.96 <0.001
transmission [1] -1.08 -2.54 – 0.37 0.138
wt centered ×
transmission [1]
-2.65 -4.13 – -1.17 0.001
Observations 32
R2 / R2 adjusted 0.833 / 0.815

tab_model(fit1, fit2)
  mpg mpg
Predictors Estimates CI p Estimates CI p
(Intercept) 18.15 16.70 – 19.61 <0.001 19.24 17.73 – 20.74 <0.001
wt centered -6.44 -7.91 – -4.96 <0.001 -3.79 -5.40 – -2.18 <0.001
transmission [1] -1.08 -2.54 – 0.37 0.138
wt centered ×
transmission [1]
-2.65 -4.13 – -1.17 0.001
transmission [2] -2.17 -5.07 – 0.74 0.138
wt centered ×
transmission [2]
-5.30 -8.26 – -2.34 0.001
Observations 32 32
R2 / R2 adjusted 0.833 / 0.815 0.833 / 0.815

tab_model(fit1, fit2,
  show.ci = FALSE, show.se = TRUE,string.se = "se", string.est = "B",
  dv.labels = c("Model with helmert contrast", "Model with treatment contrast")
)
  Model with helmert contrast Model with treatment contrast
Predictors B se p B se p
(Intercept) 18.15 0.71 <0.001 19.24 0.74 <0.001
wt centered -6.44 0.72 <0.001 -3.79 0.79 <0.001
transmission [1] -1.08 0.71 0.138
wt centered ×
transmission [1]
-2.65 0.72 0.001
transmission [2] -2.17 1.42 0.138
wt centered ×
transmission [2]
-5.30 1.44 0.001
Observations 32 32
R2 / R2 adjusted 0.833 / 0.815 0.833 / 0.815

Multilevel regression formula

\(y_{ij} = \beta_{0j} + \beta_{1j} X_{ij} + e_{ij}\)

Level 2:
\(\beta_{0j} = \gamma_{01}W_j + \upsilon_{0j}\)
\(\beta_{1j} = \gamma_{10} + \upsilon_{1j}\)

\(y\) = Criteria variable
\(i\) = Subject number (measurement number)
\(\beta_0\) = Intercept of \(y\)
\(\beta_1\) = Weight of predictor \(X\)
\(X\) = Predictor variable
\(e\) = Error term
\(j\) = Level 2 group number
\(\gamma_{00}\) = Intercept
\(W\) = Level 2 predictor (grouping variable)
\(\gamma_{01}\) = Weight for \(W\)
\(\gamma_{10}\) = Weight of the predictor
\(\upsilon_{0j}\) = Error term for intercept \(\upsilon_{0j}\) = Error term for slope