In healthcare, we are often faced with high cardinality variables, where each observation may have zero, one, or more levels, e.g. medications for a model at the patient grain. In these cases, creating a feature variable for each level (each medication) as in one-hot encoding can be prohibitively computationally intensive and can hurt performance by diminishing the signal-to-noise ratio. get_best_levels identifies a subset of categories that are likely to be valuable features, and add_best_levels adds them to a model data frame.

get_best_levels finds levels of groups that are likely to be useful predictors in d and returns them as a character vector. add_best_levels does the same and adds them, pivoted, to d. The function attempts to find both positive and negative predictors of outcome.

add_best_levels stores the identified best levels and passes them through model training so that in deployment, the same columns created in training are again created (see the final example).

add_best_levels accepts arguments to pivot so that values associated with the levels (e.g. doses of medications) can be used in the new features. However, note that these are not used in determining the best levels. I.e. get_best_levels determines which levels are likely to be good predictors looking only at outcomes where the levels are present or abssent; it does not use fill or fun in this determination. See details for more info about how levels are selected.

  n_levels = 100,
  min_obs = 1,
  positive_class = "Y",
  cohesion_weight = 2,
  levels = NULL,
  fun = sum,
  missing_fill = NA

  n_levels = 100,
  min_obs = 1,
  positive_class = "Y",
  cohesion_weight = 2



Data frame to use in models, at desired grain. Has id and outcome


Data frame containing multiple observations per grain. Has id and groups


Name of identifier column, unquoted. Must be present and identical in both tables


Name of grouping column, unquoted


Name of outcome column, unquoted


Number of levels to return, default = 100. An attempt is made to return half levels positively associated with the outcome and half negatively. If n_levels is greater than the number present, all levels will be returned


Minimum number of observations a level must be found in in order to be considered. Defaults to one, but larger values are often useful because a level present in only a few observation will rarely be a useful.


If classification model, the positive class of the outcome, default = "Y"; ignored if regression


For classification problems only, how much to value a level being consistently associated with an outcome relative to its being present in many observations. Default = 2; equal weight is 1. Note that this is a parameter that could potentially be tuned over.


Use this argument when add_best_levels was used in training and you want to add the same columns for deployment. You can pass the model trained on the data frame from add_best_levels, the data frame from add_best_levels, or a character vector of levels to add.


Passed to pivot. Column to be used to fill the values of cells in the output, perhaps after aggregation by fun. If fill is not provided, counts will be used, as though a fill column of 1s had been provided.


Passed to pivot. Function for aggregation, defaults to sum. Custom functions can be used with the same syntax as the apply family of functions, e.g. fun = function(x) some_function(another_fun(x)).


Passed to pivot. Value to fill for combinations of grain and spread that are not present. Defaults to NA, but 0 may be useful as well.


For add_best_levels, d with new columns for the best levels added and best_levels attribute containing a named list of levels added. For get_best_levels, a character vector of the best levels.


Here is how get_best_levels determines the levels of groups that are likely to be good predictors.

  • For regression: For each group, the difference of the group-mean from the grand-mean is divided by the standard deviation of the group as a sample (i.e. centered_mean(group) / sqrt(var(group) / n(group))), and the groups with the largest absolute values of that statistic are retained.

  • For classification: For each group, two "log-loss-like" statistics are calculated. One is the log of the fraction of observations in which the group does not appear, which captures how ubiquitous the group is: more common groups are more useful as predictors. The other captures how far the group is from being always associated with the same outcome: groups that are consistently assoicated with either outcome are more useful as predictors. This is calculated as the log of the proportion of outcomes that are not all the same outcome (e.g. if 4/5 observations are positive class, this statistic is log(.2)). This value is then raised to the cohesion_weight power. To ensure retainment of both positive- and negative-predictors, the all-same-outcome that is used as the comparison is determined by which side of the median proportion of positive_class the group falls on.

See also


set.seed(45796) # We have two tables we want to use in our models: # - df is the model table. It has the outcomes (survived), and we want one # prediction for each row in df # - meds has detailed information on each row (patient) in df. Each patient # may have zero, one, or more observations (drugs) in meds, and meds may # have associated values (doses). df <- tibble::tibble( patient = paste0("Z", sample(10, 5)), age = sample(20:80, 5), survived = sample(c("N", "Y"), 5, replace = TRUE, prob = c(1, 2)) ) meds <- tibble::tibble( patient = sample(df$patient, 10, replace = TRUE), drug = sample(c("Quinapril", "Vancomycin", "Ibuprofen", "Paclitaxel", "Epinephrine", "Dexamethasone"), 10, replace = TRUE), dose = sample(c(100, 250), 10, replace = TRUE) ) # Identify three drugs likely to be good predictors of survival get_best_levels(d = df, longsheet = meds, id = patient, groups = drug, outcome = survived, n_levels = 3)
#> [1] "Epinephrine" "Ibuprofen" "Paclitaxel"
# Identify four drugs likely to make good features and add them to df. # The "fill", "fun", and "missing_fill" arguments are passed to # `pivot`, which allows us to use the total doses of each drug given to the # patient as our new features new_df <- add_best_levels(d = df, longsheet = meds, id = patient, groups = drug, outcome = survived, n_levels = 4, fill = dose, fun = sum, missing_fill = 0) new_df
#> # A tibble: 5 x 7 #> patient age survived drug_Dexamethas… drug_Epinephrine drug_Ibuprofen #> <chr> <int> <chr> <dbl> <dbl> <dbl> #> 1 Z10 23 N 0 0 0 #> 2 Z2 20 N 350 100 0 #> 3 Z6 59 Y 0 0 100 #> 4 Z7 52 Y 250 0 100 #> 5 Z8 53 N 0 0 0 #> # … with 1 more variable: drug_Paclitaxel <dbl>
# The names of the medications that were added to df in new_df are stored in the # best_levels attribute of new_df so that the same columns can be added in # deployment. This is useful because you need to have the same columns to make # predictions as you had in model training. When you are ready to add levels to # a deployment data frame, you can pass to the "levels" argument of # add_best_levels either the models trained on new_df, new_df itself, or the # character vector of levels to add. deployment_df <- tibble::tibble( patient = "p6", age = 30 ) deployment_meds <- tibble::tibble( patient = rep("p6", 2), drug = rep("Vancomycin", 2), dose = c(100, 250) ) # Now, even though Vancomycin is the only drug that appears in # deployment_meds, because we pass new_df to "levels", we get all the columns # needed to make predictions on a model trained on new_df add_best_levels(d = deployment_df, longsheet = deployment_meds, id = patient, groups = drug, levels = new_df, fill = dose, missing_fill = 0)
#> # A tibble: 1 x 6 #> patient age drug_Epinephrine drug_Ibuprofen drug_Paclitaxel drug_Dexamethas… #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 p6 30 0 0 0 0