In healthcare, we are often faced with high cardinality
variables, where each observation may have zero, one, or more levels, e.g.
medications for a model at the patient grain. In these cases, creating a
feature variable for each level (each medication) as in one-hot encoding
can be prohibitively computationally intensive and can hurt performance by
diminishing the signal-to-noise ratio.
get_best_levels identifies a
subset of categories that are likely to be valuable features, and
add_best_levels adds them to a model data frame.
get_best_levels finds levels of
groups that are likely to be
useful predictors in
d and returns them as a character vector.
add_best_levels does the same and adds them, pivoted, to
The function attempts to find both positive and negative predictors of
add_best_levels stores the identified best levels and passes them
through model training so that in deployment, the same columns created in
training are again created (see the final example).
add_best_levels accepts arguments to
pivot so that
values associated with the levels (e.g. doses of medications) can be used
in the new features. However, note that these are not used in determining
the best levels. I.e.
get_best_levels determines which levels are
likely to be good predictors looking only at outcomes where the levels are
present or abssent; it does not use
fun in this
details for more info about how levels are
add_best_levels(d, longsheet, id, groups, outcome, n_levels = 100, min_obs = 1, positive_class = "Y", cohesion_weight = 2, levels = NULL, fill, fun = sum, missing_fill = NA) get_best_levels(d, longsheet, id, groups, outcome, n_levels = 100, min_obs = 1, positive_class = "Y", cohesion_weight = 2)
Data frame to use in models, at desired grain. Has id and outcome
Data frame containing multiple observations per grain. Has id and groups
Name of identifier column, unquoted. Must be present and identical in both tables
Name of grouping column, unquoted
Name of outcome column, unquoted
Number of levels to return, default = 100. An attempt is made to return half levels positively associated with the outcome and half negatively. If n_levels is greater than the number present, all levels will be returned
Minimum number of observations a level must be found in in order to be considered. Defaults to one, but larger values are often useful because a level present in only a few observation will rarely be a useful.
If classification model, the positive class of the outcome, default = "Y"; ignored if regression
For classification problems only, how much to value a level being consistently associated with an outcome relative to its being present in many observations. Default = 2; equal weight is 1. Note that this is a parameter that could potentially be tuned over.
Use this argument when add_best_levels was used in training and
you want to add the same columns for deployment. You can pass the model
trained on the data frame from
add_best_levels, d with new columns for the best levels
added and best_levels attribute containing a named list of levels added.
get_best_levels, a character vector of the best levels.
Here is how
get_best_levels determines the levels of
groups that are likely to be good predictors.
For regression: For each group, the difference of the group-mean from the grand-mean is divided by the standard deviation of the group as a sample (i.e. centered_mean(group) / sqrt(var(group) / n(group))), and the groups with the largest absolute values of that statistic are retained.
For classification: For each group, two "log-loss-like" statistics are
calculated. One is the log of the fraction of observations in which the
group does not appear, which captures how ubiquitous the group is: more
common groups are more useful as predictors. The other captures how far the
group is from being always associated with the same outcome: groups that
are consistently assoicated with either outcome are more useful as
predictors. This is calculated as the log of the proportion of outcomes
that are not all the same outcome (e.g. if 4/5 observations are positive
class, this statistic is log(.2)). This value is then raised to the
cohesion_weight power. To ensure retainment of both positive- and
negative-predictors, the all-same-outcome that is used as the comparison is
determined by which side of the median proportion of positive_class the
group falls on.
set.seed(45796) # We have two tables we want to use in our models: # - df is the model table. It has the outcomes (survived), and we want one # prediction for each row in df # - meds has detailed information on each row (patient) in df. Each patient # may have zero, one, or more observations (drugs) in meds, and meds may # have associated values (doses). df <- tibble::tibble( patient = paste0("Z", sample(10, 5)), age = sample(20:80, 5), survived = sample(c("N", "Y"), 5, replace = TRUE, prob = c(1, 2)) ) meds <- tibble::tibble( patient = sample(df$patient, 10, replace = TRUE), drug = sample(c("Quinapril", "Vancomycin", "Ibuprofen", "Paclitaxel", "Epinephrine", "Dexamethasone"), 10, replace = TRUE), dose = sample(c(100, 250), 10, replace = TRUE) ) # Identify three drugs likely to be good predictors of survival get_best_levels(d = df, longsheet = meds, id = patient, groups = drug, outcome = survived, n_levels = 3)#>  "Epinephrine" "Vancomycin" "Paclitaxel"# Identify four drugs likely to make good features and add them to df. # The "fill", "fun", and "missing_fill" arguments are passed to # `pivot`, which allows us to use the total doses of each drug given to the # patient as our new features new_df <- add_best_levels(d = df, longsheet = meds, id = patient, groups = drug, outcome = survived, n_levels = 4, fill = dose, fun = sum, missing_fill = 0) new_df#> # A tibble: 5 x 7 #> patient age survived drug_Dexamethas… drug_Epinephrine drug_Paclitaxel #> <chr> <int> <chr> <dbl> <dbl> <dbl> #> 1 Z5 49 Y 0 0 0 #> 2 Z7 58 Y 0 0 0 #> 3 Z3 22 Y 0 0 100 #> 4 Z8 57 Y 100 0 600 #> 5 Z6 40 N 0 250 250 #> # ... with 1 more variable: drug_Vancomycin <dbl># The names of the medications that were added to df in new_df are stored in the # best_levels attribute of new_df so that the same columns can be added in # deployment. This is useful because you need to have the same columns to make # predictions as you had in model training. When you are ready to add levels to # a deployment data frame, you can pass to the "levels" argument of # add_best_levels either the models trained on new_df, new_df itself, or the # character vector of levels to add. deployment_df <- tibble::tibble( patient = "p6", age = 30 ) deployment_meds <- tibble::tibble( patient = rep("p6", 2), drug = rep("Vancomycin", 2), dose = c(100, 250) ) # Now, even though Vancomycin is the only drug that appears in # deployment_meds, because we pass new_df to "levels", we get all the columns # needed to make predictions on a model trained on new_df add_best_levels(d = deployment_df, longsheet = deployment_meds, id = patient, groups = drug, levels = new_df, fill = dose, missing_fill = 0)#> # A tibble: 1 x 6 #> patient age drug_Vancomycin drug_Epinephrine drug_Paclitaxel #> <chr> <dbl> <dbl> <dbl> <dbl> #> 1 p6 30 350 0 0 #> # ... with 1 more variable: drug_Dexamethasone <dbl>