Computes summary statistics for selected variables within subgroups defined by the combination of one or more grouping variables (e.g., age within sex) and merges the aggregated values back into the original data.

add_group_aggregate(
  dat,
  grouping,
  vars,
  func = list(mean = function(x) mean(x, na.rm = TRUE))
)

Arguments

dat

A data.frame containing the columns listed in grouping and vars.

grouping

A character vector of one or more column names in dat that define the subgroups (their joint combinations). For example, c("sex", "age") yields aggregates for each sex-by-age subgroup.

vars

A character vector of column names in dat to be aggregated.

func

A list with named functions applied to each variable in vars within each subgroup. Default computes the mean with missing values removed.

Value

A data.frame with the same observations as dat, plus additional columns containing subgroup-level aggregated values for each variable in vars. The new columns are named using the pattern <var>_<func_name>, where <var> is the original variable name and <func_name> is the name of the aggregation function.

Details

This implementation avoids repeated merge() calls (which can lead to duplicated columns and ordering issues) by computing a stable subgroup key and indexing results back to the original rows.

If all grouping variables are NA for a given row, the resulting aggregated columns for that row will also be NA. Missing values in the variables to be aggregated are handled by the functions provided in func (e.g., using na.rm = TRUE within those functions).

If multiple functions are provided in func, the resulting columns are suffixed with the names of the functions in func. If func is an unnamed list, suffixes "stat1", "stat2", etc. are used.

Warning

If the grouping variables contain special characters (e.g., line breaks, carriage returns, tabs), the function may not work as intended, since it uses interaction() with a separator to create subgroup keys. Ensure that grouping variable values do not contain such characters.

Author

Juergen Wilbert

Examples

dat <- data.frame(
  sex = c("f", "f", "m", "m", "m"),
  age = c(10, 10, 10, 12, 12),
  score = c(1, NA, 3, 5, 7),
  other = 1:5
)

# Mean score per subgroup (sex x age), added back to each row
add_group_aggregate(dat, grouping = c("sex", "age"), vars = "score")
#>   sex age score other score_mean
#> 1   f  10     1     1          1
#> 2   f  10    NA     2          1
#> 3   m  10     3     3          3
#> 4   m  12     5     4          6
#> 5   m  12     7     5          6

# Maximum and median per subgroup
add_group_aggregate(
  dat,
  grouping = c("sex", "age"),
  vars = c("score", "other"),
  func = list(
    max = function(x) max(x, na.rm = TRUE),
    median = function(x) median(x, na.rm = TRUE)
   )
)
#>   sex age score other score_max score_median other_max other_median
#> 1   f  10     1     1         1            1         2          1.5
#> 2   f  10    NA     2         1            1         2          1.5
#> 3   m  10     3     3         3            3         3          3.0
#> 4   m  12     5     4         7            6         5          4.5
#> 5   m  12     7     5         7            6         5          4.5