Some basic statistical functions

  • min(), max(): Minimum and maximum
  • mean(), median(): Mean and median
  • sd(), var(): Standard deviation and variance
  • mad(): Median absolute deviation
  • quantile(): Percentile / quantile

Task

  • Take the mtcars dataset
  • Create a new variable lpk (liters per 100km) from mpg
    (Formula: lpk = 282.5 / mpg)
  • Calculate grouped by cylinders (cyl): mean, sd, median, mad, min, max of lpk
  • Round all values by one decimal
  • Hint: Use the tidyverse or dplyr library

:-)

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(cyl) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) %>%
  round(1) 
cyl mean sd median mad min max
4 10.9 1.8 10.9 2.3 8.3 13.2
6 14.4 1.1 14.3 1.3 13.2 15.9
8 19.3 3.8 18.6 1.9 14.7 27.2

What is happening here?

mtcars <- mutate(mtcars, 
  transmission = factor(am, labels = c("Manual", "Automatic"))
)

Now, lets calculate the stats for the transmission type:

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(transmission) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) 
transmission mean sd median mad min max
Manual 17.35861 4.339156 16.32948 3.344815 11.577869 27.16346
Automatic 12.33851 3.341021 12.39035 3.028011 8.333333 18.83333

… and round the results

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(transmission) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) %>%
  round(1)
Error in Math.data.frame(list(transmission = 1:2, mean = c(17.3586131968484, : 
non-numeric-alike variable(s) in data frame: transmission

ups :-( … what went wrong?

Solution: only round variables that are numeric:

  • is.numeric() returns a TRUE or FALSE for a vector (is.numeric(1:5); is.numeric(c("A", "B")))
  • across() specifies for mutate()which variable to select.

mutate(across(.cols, .fns, ... )):

  • .cols := columns so select
  • .fns := a function to apply
  • … := arguments to that function

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(transmission) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) %>%
  mutate(across(where(is.numeric), round, 1))
transmission mean sd median mad min max
Manual 17.4 4.3 16.3 3.3 11.6 27.2
Automatic 12.3 3.3 12.4 3.0 8.3 18.8

Tables in base R

  • table(): Shows frequencies of nominal scaled variables.
  • prop.table(): Calculates proportions from frequency tables.
  • addmargins(): Adds margins to tables.

Example one dimensional

tab <- table(mtcars$cyl) # frequencies
tab

 4  6  8 
11  7 14 
prop.table(tab) # proportions

      4       6       8 
0.34375 0.21875 0.43750 
prop.table(tab) * 100 # percentages 

     4      6      8 
34.375 21.875 43.750 

Example two dimensional

tab <- table(mtcars$cyl, mtcars$am)
tab
   
     0  1
  4  3  8
  6  4  3
  8 12  2
prop.table(tab) * 100
   
         0      1
  4  9.375 25.000
  6 12.500  9.375
  8 37.500  6.250

Example two dimensional with percentages by rows and columns

tab <- table(mtcars$cyl, mtcars$am)
prop.table(tab, margin = 1) * 100  # sum of each row is 100%
   
           0        1
  4 27.27273 72.72727
  6 57.14286 42.85714
  8 85.71429 14.28571
prop.table(tab, margin = 2) * 100 # sum of each column is 100%
   
           0        1
  4 15.78947 61.53846
  6 21.05263 23.07692
  8 63.15789 15.38462

Example with added margins

tab <- table(mtcars$cyl, mtcars$am)
tab <- prop.table(tab, margin = 2) * 100
addmargins(tab)
     
              0         1       Sum
  4    15.78947  61.53846  77.32794
  6    21.05263  23.07692  44.12955
  8    63.15789  15.38462  78.54251
  Sum 100.00000 100.00000 200.00000

Task

  • Take the heights dataset from the dslabs library.
  • Create a cross table depicting: The percentage of females and males that are above and below a height of 170 cm.

library(dslabs)

heights$category[heights$height * 2.54 <= 170] <- "% <= 170 cm"
heights$category[heights$height * 2.54 > 170] <- "% > 170 cm"

tab <- table(heights$sex, heights$category)
tab <- prop.table(tab, margin = 1) * 100
tab <- round(tab, 1)
tab
        
         % <= 170 cm % > 170 cm
  Female        71.0       29.0
  Male          19.8       80.2

Tables with dplyr

mtcars %>% 
  group_by(cyl, am) %>% 
  summarise(n = n())
cyl am n
4 0 3
4 1 8
6 0 4
6 1 3
8 0 12
8 1 2

Shortcut for simple count tables

# Because it is shorter, I don't use pipes here (but I could!)
count(mtcars, cyl) 
cyl n
4 11
6 7
8 14
count(mtcars, cyl, am) 
cyl am n
4 0 3
4 1 8
6 0 4
6 1 3
8 0 12
8 1 2

Tables with dplyr: pivot_wider()

pivot_wider():
creates separate variables from the levels of a factor variable and the values of a second variable

mtcars %>% count(cyl, am)
cyl am n
4 0 3
4 1 8
6 0 4
6 1 3
8 0 12
8 1 2
mtcars %>% count(cyl, am) %>%
  pivot_wider(names_from = "am", values_from = "n")
cyl 0 1
4 3 8
6 4 3
8 12 2

Tables with dplyr: A bit nicer!

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  count(cyl, am) %>% 
  pivot_wider(names_from = "am", values_from = "n") %>%
  rename("Cylinders" = "cyl")
Cylinders Manual Automatic
4 3 8
6 4 3
8 12 2

Task

Recreate the previous task. This time using dplyr:

  • Take the heights dataset from the dslabs library.
  • Create a cross table depicting: The percentage of females and males that are above and below a height of 170 cm.

This is a hard task …

Hint 1: ifelse(height * 2.54 <= 170, "size_a", "size_b")
Hint 2:
size_a / (size_a + size_b) * 100
size_b / (size_a + size_b) * 100
are the row proportions.

library(dslabs)

heights %>%
  mutate(category = ifelse(height * 2.54 <= 170, "size_a", "size_b")) %>% 
  count(sex, category) %>%
  pivot_wider(names_from = "category", values_from = "n") %>%
  mutate(
    "% <= 170" = size_a / (size_a + size_b) * 100,
    "% > 170" = size_b / (size_a + size_b) * 100,
    across(where(is.numeric), round, 1)
  ) %>%
  select(1, 4, 5)
sex % <= 170 % > 170
Female 71.0 29.0
Male 19.8 80.2

Tables with dplyr: For values of a third variable

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  group_by(cyl, am) %>% 
  summarise(
    n = n(), 
    M = mean(mpg), 
    SD = sd(mpg)
  ) 
cyl am n M SD
4 Manual 3 22.90000 1.4525839
4 Automatic 8 28.07500 4.4838599
6 Manual 4 19.12500 1.6317169
6 Automatic 3 20.56667 0.7505553
8 Manual 12 15.05000 2.7743959
8 Automatic 2 15.40000 0.5656854

pivot_wider() can take values from several variables:

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  group_by(cyl, am) %>% 
  summarise(n = n(), M = mean(mpg), SD = sd(mpg)) %>%
  pivot_wider(names_from = "am", values_from = c("n", "M", "SD")) %>%
  round(1)
cyl n_Manual n_Automatic M_Manual M_Automatic SD_Manual SD_Automatic
4 3 8 22.9 28.1 1.5 4.5
6 4 3 19.1 20.6 1.6 0.8
8 12 2 15.1 15.4 2.8 0.6

Set argument names_vary = "slowest" for a different ordering of variables:

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  group_by(cyl, am) %>% 
  summarise(n = n(), M = mean(mpg), SD = sd(mpg)) %>%
  pivot_wider(names_from = "am", values_from = c("n", "M", "SD"), names_vary = "slowest") %>%
  round(1)
cyl n_Manual M_Manual SD_Manual n_Automatic M_Automatic SD_Automatic
4 3 22.9 1.5 8 28.1 4.5
6 4 19.1 1.6 3 20.6 0.8
8 12 15.1 2.8 2 15.4 0.6

Task

  • Take the storms dataset and calculate n and mean of the wind speed (wind) by month (month) and storm classification (status) in a crosstable.

storms %>% 
  group_by(month, status) %>%
  summarise(
    n = n(),
    M = mean(wind, na.rm = TRUE)
  ) %>%
  pivot_wider(names_from = "status", values_from = c("n", "M")) %>%
  round()
month n_extratropical n_hurricane n_other low n_subtropical storm n_tropical depression n_tropical storm n_subtropical depression n_disturbance n_tropical wave M_extratropical M_hurricane M_other low M_subtropical storm M_tropical depression M_tropical storm M_subtropical depression M_disturbance M_tropical wave
1 29 5 5 6 2 23 NA NA NA 54 70 25 48 30 48 NA NA NA
4 40 NA NA 3 1 18 4 NA NA 38 NA NA 43 30 43 30 NA NA
5 18 NA 49 20 49 60 5 NA NA 42 NA 29 40 29 42 30 NA NA
6 130 18 84 12 213 282 35 35 NA 36 70 24 40 27 44 26 35 NA
7 135 221 181 6 399 645 11 46 7 36 81 23 35 26 46 23 30 34
8 275 1038 319 23 975 1696 36 25 55 36 86 25 40 27 45 24 31 27
9 800 2464 473 72 1331 2522 34 41 41 43 89 27 40 28 46 28 26 30
10 527 803 226 66 429 1049 22 16 NA 42 85 26 47 28 46 29 31 NA
11 183 221 85 48 149 464 4 8 8 47 82 25 47 28 48 30 24 28
12 14 33 31 42 21 71 NA NA NA 44 68 28 52 30 44 NA NA NA

Design example:

Replicate and try to understand the following table example:

library(kableExtra)
tab <- mtcars %>% 
  group_by(cyl, am) %>% 
  summarise(n = n(), M = mean(mpg), SD = sd(mpg)) %>%
  pivot_wider(names_from = "am", values_from = c("n", "M", "SD"), names_vary = "slowest") 

names(tab) <- c("Cylinders", "n", "M", "SD", "n", "M", "SD")

kable(tab, 
  caption =  "Table 1.<br>Miles per gallon by cylinders and gearshift", 
  digits = 1, 
  full_width = FALSE
) %>% 
  kable_classic() %>%
  add_header_above(c(" " = 1, "Automatic" = 3, "Manual" = 3))

Table 1.
Miles per gallon by cylinders and gearshift
Automatic
Manual
Cylinders n M SD n M SD
4 3 22.9 1.5 8 28.1 4.5
6 4 19.1 1.6 3 20.6 0.8
8 12 15.1 2.8 2 15.4 0.6

Some statistical functions:

  • t.test(): Calculating a t-test.

  • wilcox.test(): Calculating Wilcox test / U-Test.

  • chisq.test(): Calculating a Pearson \(X^2\)-test.

  • Some more we will not address today:

    • lm(): Regression analyses
    • cor.test(): Calculating a correlation test.
    • binom.test(): Binomial test.
    • fisher.test(): Fisher exact test for count data.
    • ks.test(): One and two sample Kolmogorov-Smirnov Tests.
    • shapiro.test(): Shapiro-Wilk Normality Test.
    • aov(): Analysis of variance

chisq.test()

  • Pearson’s \(X^2\) test is a very versatile test to calculate whether a distribution of observed frequencies equates expected frequencies.
  • A very common application is to test whether frequencies of observations in two variables are related. (E.g. number of patients that improved in a treatment vs. control group).
  • The chisq.test() functions takes a two dimensional frequency table and calculates a \(X^2\) test.

Example with data from an intervention study

# get an external data example
dat <- read.csv("https://goo.gl/j6lRXD")
# Create a two distribution table
tab <- table(dat$treatment, dat$improvement)
tab
             
              improved not-improved
  not-treated       26           29
  treated           35           15
# Test for non random distribution:
chisq.test(tab) # Alternatively: chisq.test(dat$treatment, dat$improvement)

    Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 4.6626, df = 1, p-value = 0.03083

Significantly more patients improved in the treatment condition (\(X^2(1)=4.7, p < .05\))

Task

  • Take the starwars dataset from the tidyverse package.
  • Use table() to get the distribution eye_color by hair_color
  • Apply the chisq.test() to calculate a X² test to test for an even distribution.

:-)

tab <- table(starwars$eye_color, starwars$hair_color)
chisq.test(tab)

    Pearson's Chi-squared test

data:  tab
X-squared = 185.19, df = 140, p-value = 0.006316

t.test()

  • Student’s t-test analysis whether two samples originate from the same normal distribution.

  • It is used to test for mean differences in two groups.

  • Arguments of the t.test() function:

    • x and y: Each variable provides data from a samples.
    • formula: If you have one vector with all data (e.g. values) and a second vector with grouping information (e.g. group) use: values ~ group.
    • data: If you work with formula: The name of a data frame.
    • paired: If set TRUE, the test assumes repeated measures of one sample instead of two independent samples.

Example

Does one of two drugs increase hours of sleep better?

extra: change in sleep duration, group : drug given

sleep
extra group ID
0.7 1 1
-1.6 1 2
-0.2 1 3
-1.2 1 4
-0.1 1 5
3.4 1 6
3.7 1 7
0.8 1 8
0.0 1 9
2.0 1 10
1.9 2 1
0.8 2 2
1.1 2 3
0.1 2 4
-0.1 2 5
4.4 2 6
5.5 2 7
1.6 2 8
4.6 2 9
3.4 2 10

Example

# Applying a T-test by providing x and y
x <- sleep$extra[sleep$group == 1]
y <- sleep$extra[sleep$group == 2]
t.test(x, y)

# Applying a T-test with a formula
t.test(extra ~ group, data = sleep)

Example


    Welch Two Sample t-test

data:  extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -3.3654832  0.2054832
sample estimates:
mean in group 1 mean in group 2 
           0.75            2.33 

Group 2 shows an increase in sleep length of \(\Delta M=1.58\) (\(t(17.8)=1.86, p = .08\))

wilcox.test()

  • Calculates a Wilcoxon rank sum test. Also known as Mann Whitney U-Test
  • This test is applied as an alternative to a t-Test when data are assumed to be non-normal distributed.
  • It takes the same arguments as t.test()

Task

  • Take the sleep dataset from the previous example.
  • Calculate the median and mad (Median absolute deviation) for extra for each group.
  • Calculate a Wilcoxon test with the sleep dataset on the effectiveness of the intervention.

:-)

library(dplyr)
sleep %>% group_by(group) %>%
  summarise(
    median = median(extra), 
    mad = mad(extra)
  )
group median mad
1 0.35 1.55673
2 1.75 2.44629
wilcox.test(extra~group, data = sleep)

    Wilcoxon rank sum test with continuity correction

data:  extra by group
W = 25.5, p-value = 0.06933
alternative hypothesis: true location shift is not equal to 0

Task

  • Install the “dslabs” package
  • Take the gapminder dataset. This dataset includes health and income outcomes for 184 countries from 1960 to 2016.
  • It includes infant_mortality, region and continent.
  • Calculate the mean and median infant_mortality for each continent
  • Calculate a t-test comparing infant_mortality between the regions Southern- and Eastern Europe.

:-)

install.packages("dslabs")
library(dslabs)
gapminder %>% group_by(continent) %>% 
  summarise(
    mean = mean(infant_mortality, na.rm = TRUE),
    median = median(infant_mortality, na.rm = TRUE)
  )
continent mean median
Africa 95.12395 93.40
Americas 42.88145 30.80
Asia 55.26174 43.10
Europe 15.33022 11.25
Oceania 39.10136 29.10

x <- gapminder %>% filter(region == "Southern Europe") %>% select(infant_mortality)
y <- gapminder %>% filter(region == "Eastern Europe") %>% select(infant_mortality)
t.test(x, y)

# or with pipes
gapminder %>% 
  filter(region %in% c("Southern Europe", "Eastern Europe")) %>%
  t.test(infant_mortality ~ region, data = .)

    Welch Two Sample t-test

data:  infant_mortality by region
t = 0.53658, df = 929.22, p-value = 0.5917
alternative hypothesis: true difference in means between group Eastern Europe and group Southern Europe is not equal to 0
95 percent confidence interval:
 -1.471847  2.579544
sample estimates:
 mean in group Eastern Europe mean in group Southern Europe 
                     20.36164                      19.80779