Basic statistics

Jürgen Wilbert

University of Münster

2025-10-30

Goals

You know some functions for basic statistical analyses.
You can create contingency tables.
You learn to integrate filtering, selecting, grouping, and subsetting of data frames with functions for statistical analyses.

Descriptive statistics

Some basic statistical functions

min(), max(): Minimum and maximum
mean(), median(): Mean and median
sd(), var(): Standard deviation and variance
mad(): Median absolute deviation
quantile(): Percentile / quantile

Task

Take the mtcars dataset
Create a new variable lpk (liters per 100km) from mpg
(Formula: lpk = 282.5 / mpg)
Calculate grouped by cylinders (cyl): mean, sd, median, mad, min, max of lpk
Round all values by one decimal
Hint: Use the tidyverse or dplyr library

:-)

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(cyl) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) %>%
  round(1)

cyl	mean	sd	median	mad	min	max
4	10.9	1.8	10.9	2.3	8.3	13.2
6	14.4	1.1	14.3	1.3	13.2	15.9
8	19.3	3.8	18.6	1.9	14.7	27.2

What is happening here?

mtcars <- mutate(mtcars, 
  transmission = factor(am, labels = c("Manual", "Automatic"))
)

Now, lets calculate the stats for the transmission type:

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(transmission) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  )

transmission	mean	sd	median	mad	min	max
Manual	17.35861	4.339156	16.32948	3.344815	11.577869	27.16346
Automatic	12.33851	3.341021	12.39035	3.028011	8.333333	18.83333

… and round the results

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(transmission) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) %>%
  round(1)

Error in Math.data.frame(list(transmission = 1:2, mean = c(17.3586131968484, : 
non-numeric-alike variable(s) in data frame: transmission

ups :-( … what went wrong?

Solution: only round variables that are numeric:

is.numeric() returns a TRUE or FALSE for a vector (is.numeric(1:5); is.numeric(c("A", "B")))
across() specifies for mutate()which variable to select.

mutate(across(.cols, .fns, ... )):

.cols := columns so select
.fns := a function to apply
… := arguments to that function

mtcars %>% 
  mutate(lpk = 282.5 / mpg) %>% 
  group_by(transmission) %>% 
  summarise(
    mean = mean(lpk),
    sd = sd(lpk),
    median = median(lpk),
    mad = mad(lpk),
    min = min(lpk),
    max = max(lpk)
  ) %>%
  mutate(across(where(is.numeric), round, 1))

transmission	mean	sd	median	mad	min	max
Manual	17.4	4.3	16.3	3.3	11.6	27.2
Automatic	12.3	3.3	12.4	3.0	8.3	18.8

Tables

Tables in base R

table(): Shows frequencies of nominal scaled variables.
prop.table(): Calculates proportions from frequency tables.
addmargins(): Adds margins to tables.

Example one dimensional

tab <- table(mtcars$cyl) # frequencies
tab


 4  6  8 
11  7 14

prop.table(tab) # proportions


      4       6       8 
0.34375 0.21875 0.43750

prop.table(tab) * 100 # percentages


     4      6      8 
34.375 21.875 43.750

Example two dimensional

tab <- table(mtcars$cyl, mtcars$am)
tab

prop.table(tab) * 100

   
         0      1
  4  9.375 25.000
  6 12.500  9.375
  8 37.500  6.250

Example two dimensional with percentages by rows and columns

tab <- table(mtcars$cyl, mtcars$am)
prop.table(tab, margin = 1) * 100  # sum of each row is 100%

   
           0        1
  4 27.27273 72.72727
  6 57.14286 42.85714
  8 85.71429 14.28571

prop.table(tab, margin = 2) * 100 # sum of each column is 100%

   
           0        1
  4 15.78947 61.53846
  6 21.05263 23.07692
  8 63.15789 15.38462

Example with added margins

tab <- table(mtcars$cyl, mtcars$am)
tab <- prop.table(tab, margin = 2) * 100
addmargins(tab)

     
              0         1       Sum
  4    15.78947  61.53846  77.32794
  6    21.05263  23.07692  44.12955
  8    63.15789  15.38462  78.54251
  Sum 100.00000 100.00000 200.00000

Task

Take the heights dataset from the dslabs library.
Create a cross table depicting: The percentage of females and males that are above and below a height of 170 cm.

library(dslabs)

heights$category[heights$height * 2.54 <= 170] <- "% <= 170 cm"
heights$category[heights$height * 2.54 > 170] <- "% > 170 cm"

tab <- table(heights$sex, heights$category)
tab <- prop.table(tab, margin = 1) * 100
tab <- round(tab, 1)
tab

        
         % <= 170 cm % > 170 cm
  Female        71.0       29.0
  Male          19.8       80.2

Tables with dplyr

mtcars %>% 
  group_by(cyl, am) %>% 
  summarise(n = n())

cyl	am	n
4	0	3
4	1	8
6	0	4
6	1	3
8	0	12
8	1	2

Shortcut for simple count tables

# Because it is shorter, I don't use pipes here (but I could!)
count(mtcars, cyl)

cyl	n
4	11
6	7
8	14

count(mtcars, cyl, am)

cyl	am	n
4	0	3
4	1	8
6	0	4
6	1	3
8	0	12
8	1	2

Tables with dplyr: `pivot_wider()`

pivot_wider():
creates separate variables from the levels of a factor variable and the values of a second variable

mtcars %>% count(cyl, am)

cyl	am	n
4	0	3
4	1	8
6	0	4
6	1	3
8	0	12
8	1	2

mtcars %>% count(cyl, am) %>%
  pivot_wider(names_from = "am", values_from = "n")

cyl	0	1
4	3	8
6	4	3
8	12	2

Tables with dplyr: A bit nicer!

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  count(cyl, am) %>% 
  pivot_wider(names_from = "am", values_from = "n") %>%
  rename("Cylinders" = "cyl")

Cylinders	Manual	Automatic
4	3	8
6	4	3
8	12	2

Task

Recreate the previous task. This time using dplyr:

Take the heights dataset from the dslabs library.
Create a cross table depicting: The percentage of females and males that are above and below a height of 170 cm.

This is a hard task …

Hint 1: ifelse(height * 2.54 <= 170, "size_a", "size_b")
Hint 2:
size_a / (size_a + size_b) * 100
size_b / (size_a + size_b) * 100
are the row proportions.

library(dslabs)

heights %>%
  mutate(category = ifelse(height * 2.54 <= 170, "size_a", "size_b")) %>% 
  count(sex, category) %>%
  pivot_wider(names_from = "category", values_from = "n") %>%
  mutate(
    "% <= 170" = size_a / (size_a + size_b) * 100,
    "% > 170" = size_b / (size_a + size_b) * 100,
    across(where(is.numeric), round, 1)
  ) %>%
  select(1, 4, 5)

sex	% <= 170	% > 170
Female	71.0	29.0
Male	19.8	80.2

Tables with dplyr: For values of a third variable

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  group_by(cyl, am) %>% 
  summarise(
    n = n(), 
    M = mean(mpg), 
    SD = sd(mpg)
  )

cyl	am	n	M	SD
4	Manual	3	22.90000	1.4525839
4	Automatic	8	28.07500	4.4838599
6	Manual	4	19.12500	1.6317169
6	Automatic	3	20.56667	0.7505553
8	Manual	12	15.05000	2.7743959
8	Automatic	2	15.40000	0.5656854

pivot_wider() can take values from several variables:

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  group_by(cyl, am) %>% 
  summarise(n = n(), M = mean(mpg), SD = sd(mpg)) %>%
  pivot_wider(names_from = "am", values_from = c("n", "M", "SD")) %>%
  round(1)

cyl	n_Manual	n_Automatic	M_Manual	M_Automatic	SD_Manual	SD_Automatic
4	3	8	22.9	28.1	1.5	4.5
6	4	3	19.1	20.6	1.6	0.8
8	12	2	15.1	15.4	2.8	0.6

Set argument names_vary = "slowest" for a different ordering of variables:

mtcars %>% 
  mutate(am = factor(am, labels = c("Manual", "Automatic"))) %>%
  group_by(cyl, am) %>% 
  summarise(n = n(), M = mean(mpg), SD = sd(mpg)) %>%
  pivot_wider(names_from = "am", values_from = c("n", "M", "SD"), names_vary = "slowest") %>%
  round(1)

cyl	n_Manual	M_Manual	SD_Manual	n_Automatic	M_Automatic	SD_Automatic
4	3	22.9	1.5	8	28.1	4.5
6	4	19.1	1.6	3	20.6	0.8
8	12	15.1	2.8	2	15.4	0.6

Task

Take the storms dataset and calculate n and mean of the wind speed (wind) by month (month) and storm classification (status) in a crosstable.

storms %>% 
  group_by(month, status) %>%
  summarise(
    n = n(),
    M = mean(wind, na.rm = TRUE)
  ) %>%
  pivot_wider(names_from = "status", values_from = c("n", "M")) %>%
  round()

month	n_extratropical	n_hurricane	n_other low	n_subtropical storm	n_tropical depression	n_tropical storm	n_subtropical depression	n_disturbance	n_tropical wave	M_extratropical	M_hurricane	M_other low	M_subtropical storm	M_tropical depression	M_tropical storm	M_subtropical depression	M_disturbance	M_tropical wave
1	29	5	5	6	2	23	NA	NA	NA	54	70	25	48	30	48	NA	NA	NA
4	40	NA	NA	3	1	18	4	NA	NA	38	NA	NA	43	30	43	30	NA	NA
5	18	NA	49	20	49	60	5	NA	NA	42	NA	29	40	29	42	30	NA	NA
6	130	18	84	12	213	282	35	35	NA	36	70	24	40	27	44	26	35	NA
7	135	221	181	6	399	645	11	46	7	36	81	23	35	26	46	23	30	34
8	275	1038	319	23	975	1696	36	25	55	36	86	25	40	27	45	24	31	27
9	800	2464	473	72	1331	2522	34	41	41	43	89	27	40	28	46	28	26	30
10	527	803	226	66	429	1049	22	16	NA	42	85	26	47	28	46	29	31	NA
11	183	221	85	48	149	464	4	8	8	47	82	25	47	28	48	30	24	28
12	14	33	31	42	21	71	NA	NA	NA	44	68	28	52	30	44	NA	NA	NA

Design example:

Replicate and try to understand the following table example:

library(kableExtra)
tab <- mtcars %>% 
  group_by(cyl, am) %>% 
  summarise(n = n(), M = mean(mpg), SD = sd(mpg)) %>%
  pivot_wider(names_from = "am", values_from = c("n", "M", "SD"), names_vary = "slowest") 

names(tab) <- c("Cylinders", "n", "M", "SD", "n", "M", "SD")

kable(tab, 
  caption =  "Table 1.<br>Miles per gallon by cylinders and gearshift", 
  digits = 1, 
  full_width = FALSE
) %>% 
  kable_classic() %>%
  add_header_above(c(" " = 1, "Automatic" = 3, "Manual" = 3))

Table 1.
Miles per gallon by cylinders and gearshift
	Automatic			Manual
Cylinders	n	M	SD	n	M	SD
4	3	22.9	1.5	8	28.1	4.5
6	4	19.1	1.6	3	20.6	0.8
8	12	15.1	2.8	2	15.4	0.6

Inferential statistics

Some statistical functions:

t.test(): Calculating a t-test.
wilcox.test(): Calculating Wilcox test / U-Test.
chisq.test(): Calculating a Pearson \(X^2\)-test.
Some more we will not address today:
- lm(): Regression analyses
- cor.test(): Calculating a correlation test.
- binom.test(): Binomial test.
- fisher.test(): Fisher exact test for count data.
- ks.test(): One and two sample Kolmogorov-Smirnov Tests.
- shapiro.test(): Shapiro-Wilk Normality Test.
- aov(): Analysis of variance

`chisq.test()`

Pearson’s \(X^2\) test is a very versatile test to calculate whether a distribution of observed frequencies equates expected frequencies.
A very common application is to test whether frequencies of observations in two variables are related. (E.g. number of patients that improved in a treatment vs. control group).
The chisq.test() functions takes a two dimensional frequency table and calculates a \(X^2\) test.

Example with data from an intervention study

# get an external data example
dat <- read.csv("https://goo.gl/j6lRXD")
# Create a two distribution table
tab <- table(dat$treatment, dat$improvement)
tab

             
              improved not-improved
  not-treated       26           29
  treated           35           15

# Test for non random distribution:
chisq.test(tab) # Alternatively: chisq.test(dat$treatment, dat$improvement)


    Pearson's Chi-squared test with Yates' continuity correction

data:  tab
X-squared = 4.6626, df = 1, p-value = 0.03083

Significantly more patients improved in the treatment condition (\(X^2(1)=4.7, p < .05\))

Task

Take the starwars dataset from the tidyverse package.
Use table() to get the distribution eye_color by hair_color
Apply the chisq.test() to calculate a X² test to test for an even distribution.

:-)

tab <- table(starwars$eye_color, starwars$hair_color)
chisq.test(tab)


    Pearson's Chi-squared test

data:  tab
X-squared = 185.19, df = 140, p-value = 0.006316

`t.test()`

Student’s t-test analysis whether two samples originate from the same normal distribution.
It is used to test for mean differences in two groups.
Arguments of the t.test() function:
- x and y: Each variable provides data from a samples.
- formula: If you have one vector with all data (e.g. values) and a second vector with grouping information (e.g. group) use: values ~ group.
- data: If you work with formula: The name of a data frame.
- paired: If set TRUE, the test assumes repeated measures of one sample instead of two independent samples.

Example

Does one of two drugs increase hours of sleep better?

extra: change in sleep duration, group : drug given

sleep

extra	group	ID
0.7	1	1
-1.6	1	2
-0.2	1	3
-1.2	1	4
-0.1	1	5
3.4	1	6
3.7	1	7
0.8	1	8
0.0	1	9
2.0	1	10
1.9	2	1
0.8	2	2
1.1	2	3
0.1	2	4
-0.1	2	5
4.4	2	6
5.5	2	7
1.6	2	8
4.6	2	9
3.4	2	10

Example

# Applying a T-test by providing x and y
x <- sleep$extra[sleep$group == 1]
y <- sleep$extra[sleep$group == 2]
t.test(x, y)

# Applying a T-test with a formula
t.test(extra ~ group, data = sleep)

Example


    Welch Two Sample t-test

data:  extra by group
t = -1.8608, df = 17.776, p-value = 0.07939
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -3.3654832  0.2054832
sample estimates:
mean in group 1 mean in group 2 
           0.75            2.33

Group 2 shows an increase in sleep length of \(\Delta M=1.58\) (\(t(17.8)=1.86, p = .08\))

`wilcox.test()`

Calculates a Wilcoxon rank sum test. Also known as Mann Whitney U-Test
This test is applied as an alternative to a t-Test when data are assumed to be non-normal distributed.
It takes the same arguments as t.test()

Task

Take the sleep dataset from the previous example.
Calculate the median and mad (Median absolute deviation) for extra for each group.
Calculate a Wilcoxon test with the sleep dataset on the effectiveness of the intervention.

:-)

library(dplyr)
sleep %>% group_by(group) %>%
  summarise(
    median = median(extra), 
    mad = mad(extra)
  )

group	median	mad
1	0.35	1.55673
2	1.75	2.44629

wilcox.test(extra~group, data = sleep)


    Wilcoxon rank sum test with continuity correction

data:  extra by group
W = 25.5, p-value = 0.06933
alternative hypothesis: true location shift is not equal to 0

Task

Install the “dslabs” package
Take the gapminder dataset. This dataset includes health and income outcomes for 184 countries from 1960 to 2016.
It includes infant_mortality, region and continent.
Calculate the mean and median infant_mortality for each continent
Calculate a t-test comparing infant_mortality between the regions Southern- and Eastern Europe.

:-)

install.packages("dslabs")

library(dslabs)
gapminder %>% group_by(continent) %>% 
  summarise(
    mean = mean(infant_mortality, na.rm = TRUE),
    median = median(infant_mortality, na.rm = TRUE)
  )

continent	mean	median
Africa	95.12395	93.40
Americas	42.88145	30.80
Asia	55.26174	43.10
Europe	15.33022	11.25
Oceania	39.10136	29.10

x <- gapminder %>% filter(region == "Southern Europe") %>% select(infant_mortality)
y <- gapminder %>% filter(region == "Eastern Europe") %>% select(infant_mortality)
t.test(x, y)

# or with pipes
gapminder %>% 
  filter(region %in% c("Southern Europe", "Eastern Europe")) %>%
  t.test(infant_mortality ~ region, data = .)


    Welch Two Sample t-test

data:  infant_mortality by region
t = 0.53658, df = 929.22, p-value = 0.5917
alternative hypothesis: true difference in means between group Eastern Europe and group Southern Europe is not equal to 0
95 percent confidence interval:
 -1.471847  2.579544
sample estimates:
 mean in group Eastern Europe mean in group Southern Europe 
                     20.36164                      19.80779

Basic statistics

Goals

Descriptive statistics

Some basic statistical functions

Task

Tables

Tables in base R

Example one dimensional

Example two dimensional

Example two dimensional with percentages by rows and columns

Example with added margins

Task

Tables with dplyr

Shortcut for simple count tables

Tables with dplyr: pivot_wider()

Tables with dplyr: A bit nicer!

Task

Tables with dplyr: For values of a third variable

Task

Design example:

Inferential statistics

Some statistical functions:

chisq.test()

Example with data from an intervention study

Task

t.test()

Example

Example

Example

wilcox.test()

Task

Task

Tables with dplyr: `pivot_wider()`

`chisq.test()`

`t.test()`

`wilcox.test()`