Build efficient features from high-cardinality, multiple-membership factors

In healthcare, we are often faced with high cardinality variables, where each observation may have zero, one, or more levels, e.g. medications for a model at the patient grain. In these cases, creating a feature variable for each level (each medication) as in one-hot encoding can be prohibitively computationally intensive and can hurt performance by diminishing the signal-to-noise ratio. get_best_levels identifies a subset of categories that are likely to be valuable features, and add_best_levels adds them to a model data frame.

get_best_levels finds levels of groups that are likely to be useful predictors in d and returns them as a character vector. add_best_levels does the same and adds them, pivoted, to d. The function attempts to find both positive and negative predictors of outcome.

add_best_levels stores the identified best levels and passes them through model training so that in deployment, the same columns created in training are again created (see the final example).

add_best_levels accepts arguments to pivot so that values associated with the levels (e.g. doses of medications) can be used in the new features. However, note that these are not used in determining the best levels. I.e. get_best_levels determines which levels are likely to be good predictors looking only at outcomes where the levels are present or abssent; it does not use fill or fun in this determination. See details for more info about how levels are selected.

add_best_levels(
  d,
  longsheet,
  id,
  groups,
  outcome,
  n_levels = 100,
  min_obs = 1,
  positive_class = "Y",
  cohesion_weight = 2,
  levels = NULL,
  fill,
  fun = sum,
  missing_fill = NA
)

get_best_levels(
  d,
  longsheet,
  id,
  groups,
  outcome,
  n_levels = 100,
  min_obs = 1,
  positive_class = "Y",
  cohesion_weight = 2
)

Arguments

d	Data frame to use in models, at desired grain. Has id and outcome
longsheet	Data frame containing multiple observations per grain. Has id and groups
id	Name of identifier column, unquoted. Must be present and identical in both tables
groups	Name of grouping column, unquoted
outcome	Name of outcome column, unquoted
n_levels	Number of levels to return, default = 100. An attempt is made to return half levels positively associated with the outcome and half negatively. If n_levels is greater than the number present, all levels will be returned
min_obs	Minimum number of observations a level must be found in in order to be considered. Defaults to one, but larger values are often useful because a level present in only a few observation will rarely be a useful.
positive_class	If classification model, the positive class of the outcome, default = "Y"; ignored if regression
cohesion_weight	For classification problems only, how much to value a level being consistently associated with an outcome relative to its being present in many observations. Default = 2; equal weight is 1. Note that this is a parameter that could potentially be tuned over.
levels	Use this argument when add_best_levels was used in training and you want to add the same columns for deployment. You can pass the model trained on the data frame from `add_best_levels`, the data frame from `add_best_levels`, or a character vector of levels to add.
fill	Passed to `pivot`. Column to be used to fill the values of cells in the output, perhaps after aggregation by `fun`. If `fill` is not provided, counts will be used, as though a fill column of 1s had been provided.
fun	Passed to `pivot`. Function for aggregation, defaults to `sum`. Custom functions can be used with the same syntax as the apply family of functions, e.g. `fun = function(x) some_function(another_fun(x))`.
missing_fill	Passed to `pivot`. Value to fill for combinations of grain and spread that are not present. Defaults to NA, but 0 may be useful as well.

Value

For add_best_levels, d with new columns for the best levels added and best_levels attribute containing a named list of levels added. For get_best_levels, a character vector of the best levels.

Details

Here is how get_best_levels determines the levels of groups that are likely to be good predictors.

For regression: For each group, the difference of the group-mean from the grand-mean is divided by the standard deviation of the group as a sample (i.e. centered_mean(group) / sqrt(var(group) / n(group))), and the groups with the largest absolute values of that statistic are retained.
For classification: For each group, two "log-loss-like" statistics are calculated. One is the log of the fraction of observations in which the group does not appear, which captures how ubiquitous the group is: more common groups are more useful as predictors. The other captures how far the group is from being always associated with the same outcome: groups that are consistently assoicated with either outcome are more useful as predictors. This is calculated as the log of the proportion of outcomes that are not all the same outcome (e.g. if 4/5 observations are positive class, this statistic is log(.2)). This value is then raised to the cohesion_weight power. To ensure retainment of both positive- and negative-predictors, the all-same-outcome that is used as the comparison is determined by which side of the median proportion of positive_class the group falls on.

Examples

set.seed(45796)

# We have two tables we want to use in our models:
# - df is the model table. It has the outcomes (survived), and we want one
#   prediction for each row in df
# - meds has detailed information on each row (patient) in df. Each patient
#   may have zero, one, or more observations (drugs) in meds, and meds may
#   have associated values (doses).

df <- tibble::tibble(
  patient = paste0("Z", sample(10, 5)),
  age = sample(20:80, 5),
  survived = sample(c("N", "Y"), 5, replace = TRUE, prob = c(1, 2))
)

meds <- tibble::tibble(
  patient = sample(df$patient, 10, replace = TRUE),
  drug = sample(c("Quinapril", "Vancomycin", "Ibuprofen",
                  "Paclitaxel", "Epinephrine", "Dexamethasone"),
                10, replace = TRUE),
  dose = sample(c(100, 250), 10, replace = TRUE)
)

# Identify three drugs likely to be good predictors of survival

get_best_levels(d = df,
                longsheet = meds,
                id = patient,
                groups = drug,
                outcome = survived,
                n_levels = 3)
#> [1] "Epinephrine" "Ibuprofen"   "Paclitaxel" 

# Identify four drugs likely to make good features and add them to df.
# The "fill", "fun", and "missing_fill" arguments are passed to
# `pivot`, which allows us to use the total doses of each drug given to the
# patient as our new features

new_df <- add_best_levels(d = df,
                          longsheet = meds,
                          id = patient,
                          groups = drug,
                          outcome = survived,
                          n_levels = 4,
                          fill = dose,
                          fun = sum,
                          missing_fill = 0)
new_df
#> # A tibble: 5 x 7
#>   patient   age survived drug_Dexamethas… drug_Epinephrine drug_Ibuprofen
#>   <chr>   <int> <chr>               <dbl>            <dbl>          <dbl>
#> 1 Z10        23 N                       0                0              0
#> 2 Z2         20 N                     350              100              0
#> 3 Z6         59 Y                       0                0            100
#> 4 Z7         52 Y                     250                0            100
#> 5 Z8         53 N                       0                0              0
#> # … with 1 more variable: drug_Paclitaxel <dbl>

# The names of the medications that were added to df in new_df are stored in the
# best_levels attribute of new_df so that the same columns can be added in
# deployment. This is useful because you need to have the same columns to make
# predictions as you had in model training. When you are ready to add levels to
# a deployment data frame, you can pass to the "levels" argument of
# add_best_levels either the models trained on new_df, new_df itself, or the
# character vector of levels to add.

deployment_df <- tibble::tibble(
  patient = "p6",
  age = 30
)
deployment_meds <- tibble::tibble(
  patient = rep("p6", 2),
  drug = rep("Vancomycin", 2),
  dose = c(100, 250)
)

# Now, even though Vancomycin is the only drug that appears in
# deployment_meds, because we pass new_df to "levels", we get all the columns
# needed to make predictions on a model trained on new_df

add_best_levels(d = deployment_df,
                longsheet = deployment_meds,
                id = patient,
                groups = drug,
                levels = new_df,
                fill = dose,
                missing_fill = 0)
#> # A tibble: 1 x 6
#>   patient   age drug_Epinephrine drug_Ibuprofen drug_Paclitaxel drug_Dexamethas…
#>   <chr>   <dbl>            <dbl>          <dbl>           <dbl>            <dbl>
#> 1 p6         30                0              0               0                0

Build efficient features from high-cardinality, multiple-membership factors

Arguments

Value

Details

See also

Examples