Splits a numeric vector into groups at specified percentiles. Values below (or above) the percentiles are assigned to one group, values equal to or above (or below) the percentiles are assigned to the other groups. Optionally missing values can be assigned to a separate factor level.

split_at_percentile(x, frac, labels, type = "higher", explicit_na = NA)

Arguments

x

A vector.

frac

A numeric vector with percentiles (between 0 and 1) at which to split the vector. Alternatively, use character strings "median", "tertile", "quartile", "quintile", or "decile" for common splits.

labels

Vector with factor labels.

type

"higher" will split group below fraction and last group equal above last fraction. and "lower" will split group below fraction (vs. equal and above).

explicit_na

If not NA, NAs will be recoded as a factor level of the provided name. If TRUE, the name will default to '(Missing)'.

Value

A vector of type factor with two levels.

Details

This function computes the specified percentiles of the input vector and assigns each value to a group based on these percentiles. The resulting groups are returned as a factor with the specified labels. The type parameter determines whether values equal to the percentile thresholds are included in the lower or higher group.

Common splits can be specified using character strings for the frac parameter:

  • "median": splits at the 50th percentile

  • "tertile": splits at the 33.3rd and 66.6th percentiles

  • "quartile": splits at the 25th, 50th and 75th percentiles

  • "quintile": splits at the 20th, 40th, 60th and 80th percentiles

  • "decile": splits at the 10th, 20th, ..., 90th percentiles

The labels parameter should contain one more label than the number of percentiles specified in frac, as it defines the labels for each resulting group.

If explicit_na is provided (not NA), missing values in the input vector will be recoded as a separate factor level with the specified name. If explicit_na is set to TRUE, the name will default to '(Missing)'.

Examples

## Generate sample data
x <- sample(c(1:100, NA), 1000, replace = TRUE) 

## Ternary split
split_at_percentile(x, "tertile", explicit_na = TRUE) |> table()
#> 
#>       low    middle      high (Missing) 
#>       328       327       332        13 

## Quartile split with custom labels
split_at_percentile(
  x,
  frac = c(0.25, 0.5, 0.75), 
  labels = c("0-24", "25-49", "50-74", "75-100")
) |> table()
#> 
#>   0-24  25-49  50-74 75-100 
#>    239    241    257    250 

## Quintile split
split_at_percentile(x, frac = "quintile") |> table() |> prop.table() |> round(2)
#> 
#> quintile 1 quintile 2 quintile 3 quintile 4 quintile 5 
#>       0.20       0.19       0.20       0.20       0.21