add_group_aggregate.RdComputes summary statistics for selected variables within subgroups defined
by the combination of one or more grouping variables (e.g., age within
sex) and merges the aggregated values back into the original data.
A data.frame containing the columns listed in grouping and
vars.
A character vector of one or more column names in dat that
define the subgroups (their joint combinations). For example, c("sex", "age") yields aggregates for each sex-by-age subgroup.
A character vector of column names in dat to be aggregated.
A list with named functions applied to each variable in vars
within each subgroup. Default computes the mean with missing values
removed.
A data.frame with the same observations as dat, plus additional
columns containing subgroup-level aggregated values for each variable in
vars. The new columns are named using the pattern
<var>_<func_name>, where <var> is the original variable name and
<func_name> is the name of the aggregation function.
This implementation avoids repeated merge() calls (which can lead to
duplicated columns and ordering issues) by computing a stable subgroup key and
indexing results back to the original rows.
If all grouping variables are NA for a given row, the resulting aggregated
columns for that row will also be NA.
Missing values in the variables to be aggregated are handled by the
functions provided in func (e.g., using na.rm = TRUE within those
functions).
If multiple functions are provided in func, the
resulting columns are suffixed with the names of the functions in func. If
func is an unnamed list, suffixes "stat1", "stat2", etc. are used.
If the grouping variables contain special characters (e.g., line breaks,
carriage returns, tabs), the function may not work as intended, since it
uses interaction() with a separator to create subgroup keys. Ensure that
grouping variable values do not contain such characters.
dat <- data.frame(
sex = c("f", "f", "m", "m", "m"),
age = c(10, 10, 10, 12, 12),
score = c(1, NA, 3, 5, 7),
other = 1:5
)
# Mean score per subgroup (sex x age), added back to each row
add_group_aggregate(dat, grouping = c("sex", "age"), vars = "score")
#> sex age score other score_mean
#> 1 f 10 1 1 1
#> 2 f 10 NA 2 1
#> 3 m 10 3 3 3
#> 4 m 12 5 4 6
#> 5 m 12 7 5 6
# Maximum and median per subgroup
add_group_aggregate(
dat,
grouping = c("sex", "age"),
vars = c("score", "other"),
func = list(
max = function(x) max(x, na.rm = TRUE),
median = function(x) median(x, na.rm = TRUE)
)
)
#> sex age score other score_max score_median other_max other_median
#> 1 f 10 1 1 1 1 2 1.5
#> 2 f 10 NA 2 1 1 2 1.5
#> 3 m 10 3 3 3 3 3 3.0
#> 4 m 12 5 4 7 6 5 4.5
#> 5 m 12 7 5 7 6 5 4.5