Computes summary statistics for selected variables within subgroups defined by the combination of one or more grouping variables (e.g., age within sex) and merges the aggregated values back into the original data.

add_group_aggregate(
  dat,
  grouping,
  vars,
  func = list(mean = function(x) mean(x, na.rm = TRUE))
)

Arguments

dat

A data.frame containing the columns listed in grouping and vars.

grouping

A character vector of one or more column names in dat that define the subgroups (their joint combinations). For example, c("sex", "age") yields aggregates for each sex-by-age subgroup.

vars

A character vector of column names in dat to be aggregated.

func

A list with named functions applied to each variable in vars within each subgroup. Default computes the mean with missing values removed.

Value

A data.frame with the same observations as dat, plus additional columns containing subgroup-level aggregated values for each variable in vars.

Details

Aggregation is performed using stats::aggregate() with by = dat[, grouping], so each unique combination of the grouping variables defines a subgroup. Results are joined back to dat using base::merge() by all grouping columns. If multiple functions are provided in func, the resulting columns are suffixed with the names of the functions in func. If func is an unnamed list, suffixes "stat1", "stat2", etc. are used.

Examples

dat <- data.frame(
  sex = c("f", "f", "m", "m", "m"),
  age = c(10, 10, 10, 12, 12),
  score = c(1, NA, 3, 5, 7),
  other = 1:5
)

# Mean score per subgroup (sex x age), added back to each row
add_group_aggregate(dat, grouping = c("sex", "age"), vars = "score")
#>   sex age score other score_mean
#> 1   f  10     1     1          1
#> 2   f  10    NA     2          1
#> 3   m  10     3     3          3
#> 4   m  12     5     4          6
#> 5   m  12     7     5          6

# Maximum and median per subgroup
add_group_aggregate(
  dat,
  grouping = c("sex", "age"),
  vars = c("score", "other"),
  func = list(
    max = function(x) max(x, na.rm = TRUE),
    median = function(x) median(x, na.rm = TRUE)
   )
)
#>   sex age score other score_max other_max score_median other_median
#> 1   f  10     1     1         1         2            1          1.5
#> 2   f  10    NA     2         1         2            1          1.5
#> 3   m  10     3     3         3         3            3          3.0
#> 4   m  12     5     4         7         5            6          4.5
#> 5   m  12     7     5         7         5            6          4.5