Summary

In healthcare we often have high-cardinality variables (categories with many distinct values) such as DRG, medication, or physician that we want to use in predictive models. However using all of the entries can be computationally prohibitive and, perhaps paradoxically, can hurt model performance. healthcareai can identify the subset of values that are most strongly associated with the outcome and so are likely to make good model features. This makes model training faster and predictions more accurate.

This vignette documents how to use the add_best_levels function to take high-cardinality factors and engineer features from them that will add to model performance. The problem add_best_levels addresses is when you have so many levels of a categorical variable that using them all would create a table that is so wide (when one column is created for each level, which is required for model training, though healthcareai and other R packages do this automatically) that training takes too long or signal is lost in noise. For example, a laboratory test results table might have thousands of unique lab tests. That presents two problems:

  1. Creating a column for each lab test makes a very wide table. E.g. if there are one-million patients that each got one of five-thousand possible lab tests, fully encoding those tests would create a one-million by five-thousand = five-billion cell table, which will occupy a lot of RAM and likely take too long to train models on.

  2. If only a subset of the tests are associated with the outcome of interest, the other categories can add a lot of noise to the signal, hurting predictive accuracy.

For both of those reasons it would be better to create columns only for the tests that are strongly associated (positively or negatively) with the outcome variable. That is what add_best_levels does.

Introduction

library(healthcareai)
library(tidyverse)

For this vignette, we introduce a new data table, pima_meds, which contains information on what medications each patient in the pima_diabetes dataset has taken and for how long. This dataset is built into healthcareai as of version 2.1.1, which you can install before it is on CRAN with remotes::install_github("HealthCatalyst/healthcareai-r@v2.1.1"), or you can generate it yourself with the code in Appendix: Data Generation.

Some patients have no meds and so don’t appear in the table, some have one med, and most have more than one. Because of this, the medication information doesn’t fit neatly into the model table (pima_diabetes) without being pivoted. Under the hood, get_best_levels identifies which levels to use, and add_best_levels pivots them into new columns on the model table. Here are the first rows in the pima_meds table.

# > Warning: `cols` is now required when using unnest().
# > Please use `cols = c(drug_name)`
pima_meds
# > # A tibble: 1,534 x 3
# >   patient_id medication years_taken
# >        <int> <chr>            <dbl>
# > 1          1 metformin        0.856
# > 2          1 metoprolol       9.44 
# > 3          1 nexium           3.91 
# > 4          1 prednisone       2.69 
# > 5          2 tiotropium       3.53 
# > # … with 1,529 more rows

To keep things simple, there are only six medications in pima_meds: insulin and metformin are positively associated with diabetes, prednisone and metoprolol are negatively associated with diabetes, and nexium and tiotropium and equally likely among diabetic and non-diabetic patients. In real work, you probably wouldn’t use these functions with fewer than ~100 unique categories.

Basic Use

Let’s add to pima_diabetes columns for whether or not the patient got the two drugs most strongly associated with diabetes. The code below does that, with comments describing what each argument does.

pima_diabetes_meds <-
  add_best_levels(
    # Data frame with id, outcome, and (optionally) any number other features
    d = pima_diabetes,
    # Data frame with id, the high-cardinality factor, and (optionally) a column 
    # to be used in pivoting
    longsheet = pima_meds,
    # The name of the ID variable present in both tables, used to join them
    id = patient_id,
    # The name of the high-cardinality factor
    groups = medication,
    # The name of the outcome
    outcome = diabetes,
    # The number of categories to keep. This many columns will be added to the first data frame
    n_levels = 2)
# > No fill column was provided, so using "1" for present entities

add_best_levels correctly identifies two drugs associated with diabetes: insulin and metoprolol. It returns pima_diabetes unchanged, except that there are two new columns, one for each of the drugs identified, with 1 for patients who had the drug, and NA for patients who didn’t. Next, we’ll demonstrate how to change the way those columns are filled.

glimpse(pima_diabetes_meds)
# > Rows: 768
# > Columns: 12
# > $ patient_id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
# > $ pregnancies           <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5,…
# > $ plasma_glucose        <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, …
# > $ diastolic_bp          <int> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74,…
# > $ skinfold              <int> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA,…
# > $ insulin               <int> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, N…
# > $ weight_class          <chr> "obese", "overweight", "normal", "overweight", …
# > $ pedigree              <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248…
# > $ age                   <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34,…
# > $ diabetes              <chr> "Y", "N", "Y", "N", "Y", "N", "Y", "N", "Y", "Y…
# > $ medication_insulin    <int> NA, NA, 1, NA, NA, NA, NA, NA, 1, 1, NA, 1, NA,…
# > $ medication_metoprolol <int> 1, NA, 1, NA, 1, NA, 1, NA, NA, NA, 1, NA, 1, N…

Feature Engineering

Note the message that add_best_levels returned above: because no fill column was provided, “1” is used for present entries. That message actually comes through from the pivot function, and it means the data are being one-hot encoded: for each drug identified as a good feature, a variable is created with a “1” indicating that the patient got the drug.

You can customize what goes in those columns by passing arguments (fill, fun, and missing_fill) through to the pivot function. The values passed to those arguments won’t affect what columns get created, only what goes in them. By specifying fill = years_taken and missing_fill = 0, we get columns with the years each patient has been on each of the best-feature drugs with “0” entries for patients who didn’t get that drug.

pima_diabetes_med_duration <-
  add_best_levels(
    d = pima_diabetes,
    longsheet = pima_meds,
    id = patient_id,
    groups = medication,
    outcome = diabetes,
    n_levels = 2,
    ### The following arguments are passed to the `pivot` function.
    # The name of the column in longsheet to be used to fill new columns
    fill = years_taken,
    # The value to use for observations that lack a best level
    missing_fill = 0)
glimpse(pima_diabetes_med_duration)
# > Rows: 768
# > Columns: 12
# > $ patient_id            <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, …
# > $ pregnancies           <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, 5,…
# > $ plasma_glucose        <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125, …
# > $ diastolic_bp          <int> 72, 66, 64, 66, 40, 74, 50, NA, 70, 96, 92, 74,…
# > $ skinfold              <int> 35, 29, NA, 23, 35, NA, 32, NA, 45, NA, NA, NA,…
# > $ insulin               <int> NA, NA, NA, 94, 168, NA, 88, NA, 543, NA, NA, N…
# > $ weight_class          <chr> "obese", "overweight", "normal", "overweight", …
# > $ pedigree              <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.248…
# > $ age                   <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 34,…
# > $ diabetes              <chr> "Y", "N", "Y", "N", "Y", "N", "Y", "N", "Y", "Y…
# > $ medication_insulin    <dbl> 0.0000000, 0.0000000, 1.7606305, 0.0000000, 0.0…
# > $ medication_metoprolol <dbl> 9.4427073, 0.0000000, 8.6279867, 0.0000000, 5.0…

The fun argument is only relevant if there are multiple observations of the same level for the same ID; for example, if a patient could have the same drug listed twice for different periods they were on the drug, we could use fun = sum to get the total length of time they were on the drug.

Working with a single table

In the examples above, each observation (each row in d/pima_diabetes) could have zero, one, or many group memberships (rows in longsheet/pima_meds). There is a special case when each observation is in exactly one group. In this case you might have only one table because the groups fit tidily into a single column in d. However, there might still be so many groups that creating a column for each is infeasible for the reasons described in the Summary. There are other options available for this special case, e.g. impact coding as implemented by the vtreat package, but add_best_levels works well here too. In this case, simply provide the same data frame to d and longsheet and proceed as usual, but note that you will need to remove the original column at the end.

For example, suppose we have 100,000 patients’ zip codes and we want to predict their lengths of stay (LOS). In reality we’d want lots of other features, but we’ll keep things simple here. Creating a column for every zip code (we’ll use Intermountain West zip codes, which we’ll pretend are every integer from 80001 to 89999) would create a massive table – model training would be very slow, and including zip codes that aren’t associated with particularly long or short LOS would impair model performance. add_best_levels helps by identifying zip codes associated with especially long or short LOS, and creating dummy variables for those zip codes. Here is that simulated data.

n <- 1e5
zip_los <- tibble(
  id = seq_len(n),
  zip = sample(80001:89999, n, replace = TRUE),
  LOS = zip * 1e-4 + abs(rnorm(n, sd = .1)))
zip_los
# > # A tibble: 100,000 x 3
# >      id   zip   LOS
# >   <int> <int> <dbl>
# > 1     1 81774  8.26
# > 2     2 81839  8.33
# > 3     3 84249  8.48
# > 4     4 89807  9.06
# > 5     5 81683  8.20
# > # … with 99,995 more rows

Now we create we’ll create dummy variables for the ten zip codes likely to make the best predictors of LOS. Note that LOS was simulated as a linear function of the numeric value of the patient’s zip code (plus some random noise), so we should expect zip codes close to 80001 to be associated with especially short LOS and close to 89999 with especially long LOS. Therefore the columns added as best levels should be zip codes near the endpoints.

add_best_levels(
  d = zip_los,
  longsheet = zip_los,
  id = id,
  groups = zip,
  outcome = LOS,
  n_levels = 10,
  missing_fill = 0L
) %>%
  select(-zip)
# > # A tibble: 100,000 x 12
# >      id   LOS zip_80008 zip_80124 zip_80337 zip_81214 zip_82412 zip_88634
# >   <int> <dbl>     <int>     <int>     <int>     <int>     <int>     <int>
# > 1     1  8.26         0         0         0         0         0         0
# > 2     2  8.33         0         0         0         0         0         0
# > 3     3  8.48         0         0         0         0         0         0
# > 4     4  9.06         0         0         0         0         0         0
# > 5     5  8.20         0         0         0         0         0         0
# > # … with 99,995 more rows, and 4 more variables: zip_89136 <int>,
# > #   zip_89299 <int>, zip_89586 <int>, zip_89932 <int>

Note that we remove the zip column at the end because otherwise modeling functions would create a column for every zip code. You only have to do this when providing the same data frame to d and longsheet.

Adding best levels in deployment

It’s one thing to identify and create useful features for model training, but healthcareai goes to great lengths to make model deployment easy and reliable so that you can count on getting good predictions in production. Once you’ve identified the best levels, added them to your training dataset, and trained models on that dataset, you can reliably get the same features in deployment. You do this by passing to the levels argument of add_best_levels either a model trained on a data frame that came back from add_best_levels, or the data frame that came back from add_best_levels itself.

For example, suppose we get a new patient with the following attributes. We can add her medication information to the rest of her data, creating exactly the same columns as in the training dataset. The best levels aren’t being identified here (they couldn’t be because there’s only one observation and we don’t know her outcome!). Instead we apply the information from the training dataset to achieve the same transformations here. Note that the patient got nexium and metoprolol, but columns are created for metoprolol and insulin.

new_patient <- tibble(patient_id = 999, pregnancies = 0, plasma_glucose = 94,
                      diastolic_bp = 69, skinfold = 24, insulin = NA,
                      weight_class = "normal", pedigree = 0.5, age = 22)
new_meds <- tibble(patient_id = rep(999, 2),
                   medication = c("nexium", "metoprolol"),
                   years_taken = c(.25, 2.4))
new_patient_med_duration <-
  add_best_levels(d = new_patient,
                  longsheet = new_meds,
                  id = patient_id,
                  groups = medication,
                  outcome = diabetes,
                  n_levels = 2,
                  levels = pima_diabetes_med_duration,
                  fill = years_taken,
                  missing_fill = 0)
glimpse(new_patient_med_duration)
# > Rows: 1
# > Columns: 11
# > $ patient_id            <dbl> 999
# > $ pregnancies           <dbl> 0
# > $ plasma_glucose        <dbl> 94
# > $ diastolic_bp          <dbl> 69
# > $ skinfold              <dbl> 24
# > $ insulin               <lgl> NA
# > $ weight_class          <chr> "normal"
# > $ pedigree              <dbl> 0.5
# > $ age                   <dbl> 22
# > $ medication_metoprolol <dbl> 2.4
# > $ medication_insulin    <dbl> 0

Using Models to Add Best Levels

When you train models on a data frame that has had best levels added to it, and then you go make predictions from those models, you’ll need to add best levels in the same way on the prediction dataset. Here’s how to do that.

As above, we’re using the determination of best levels from the training data and applying it to the data we want a prediction from, but now we pass the trained model object to the levels argument. That means that you don’t have to carry any additional objects into your deployment environment – just the model and whatever data you want to predict on.

models <- machine_learn(pima_diabetes_med_duration, patient_id, outcome = diabetes,
                        models = "xgb", tune = FALSE)
add_best_levels(d = new_patient,
                longsheet = new_meds,
                id = patient_id,
                groups = medication,
                outcome = diabetes,
                n_levels = 2,
                levels = models,
                fill = years_taken,
                missing_fill = 0) %>%
  predict(models, .)
# > # A tibble: 1 x 12
# >   predicted_diabe… patient_id pregnancies plasma_glucose diastolic_bp skinfold
# > *            <dbl>      <dbl>       <dbl>          <dbl>        <dbl>    <dbl>
# > 1         0.000498        999           0             94           69       24
# > # … with 6 more variables: insulin <lgl>, weight_class <chr>, pedigree <dbl>,
# > #   age <dbl>, medication_metoprolol <dbl>, medication_insulin <dbl>

get_best_levels

add_best_levels is a convenience function that wraps two workhorses: get_best_levels and pivot. get_best_levels does the work of identifying which levels should be used as features. You can call it directly the same way you’d call add_best_levels (sans the arguments to pivot), but instead of adding columns to the data frame, you get back a character vector of the best levels. Next, we look at how get_best_levels determines what those levels are.

get_best_levels(pima_diabetes, pima_meds, patient_id, medication, diabetes, 4)
# > [1] "metoprolol" "insulin"    "prednisone" "metformin"

Mathematics

Basically, get_best_levels identifies groups that are present in many observations and are strongly and consistently associated with a certain type of outcome. It returns equal numbers of groups that are positively and negatively associated with the outcome. The mechanics are different for regression and classification problems.

Regression

When the outcome is numeric, groups that are consistently associated with outcomes far from the mean are likely to be useful, because outcomes near the mean are the easiest to predict. Groups with similar outcomes and groups with many observations are also preferable. So, get_best_levels calculates how far the group-mean outcome is from the overall-mean outcome and balances that against how cohesive the group is (how small is variance within the group) and how common the group is (what fraction of observations are in the group). Specifically the following statistic, \(\Omega\), is calculated for each group, and the n_levels groups with the greatest absolute value of \(\Omega\) are used, subject to the constraint that an equal number of groups are used with positive and negative values of \(\bar{y_j} - \bar{y}\) if possible. That constraint assures that the features will be useful for predicting both particularly large and small outcomes. Here, \(j\) is the group, \(y\) is the outcome, \(\sigma^2\) is variance, and \(n_j\) is the number of observations in \(j\).

\[\Omega_j = \frac{\bar{y_j} - \bar{y}}{\sqrt{\sigma^2_j / n_j}}\]

Classification

For categorical outcomes, an ideal group has two characteristics: it is consistently associated with one of the outcome classes, and it is common among observations. get_best_levels calculates a “log-loss” type statistic for the distance from each of those ideals for each group, and returns the groups closest to the ideal, as above, subject to the constraint that half are associated with the positive class and half with the negative class if possible.

There is one user-customizable parameter for classification, cohesion_weight (\(\xi\) in the equation below), which controls how much to value all the observations in a group having the same outcome, relative to the group being common. Its default value is 2. If which levels are added is important for model performance, cohesion_weight could be tuned, e.g. by cross validation.

Specifically, the best groups are identified as those with the smallest values of \(\Omega_j\), subject to the constraint that half are associated with each of the positive and negative classes, if possible. \(p_j^{y_1}\) is the proportion of observations in group \(j\) that are of the positive outcome class, and \(C_j\) is an indicator variable for which outcome group-\(j\) is associated with: it takes the value 1 if a greater fraction of observations in \(j\) are associated with the positive class than in the median group, and 0 otherwise.

\[\Omega_j = (-C_j\ln{p_j^{y_1}} - (1 - C_j)(\ln{(1 - p_j^{y_1})}))^\xi \times -\ln (n_j/n)\]

The mathematically intuitive will notice that \(\Omega_j\) blows up when every observation is in the group or the group is universally associated with one outcome. In these cases, the intermediate statistics are moved a small distance (0.5) from the endpoint before their logarithm is taken.

Appendix: Data Generation

We now show how we generated the pima_meds table that supplements the pima_diabetes dataset. This is “FYI” and is not necessary to understand how to use add_best_levels or get_best_levels.

First we create a table of medications with a numeric value declaring how common the medication is among diabetic patients, with values close to one indicating medications that are used almost exclusively in diabetics, values close to zero being counter-indicated for diabetics, and values near 0.5 being used equally among diabetic and non-diabetic patients.

Insulin and metformin are both strong indicators that a patient has diabetes. Prednisone is a corticosteroid that can cause blood sugar to spike and so might be used more rarely in patients with diabetes. Metoprolol is a beta-blocker that can mask symptoms of hypoglycemia and so might be avoided by diabetics. Nexium and tiotropium are both common drugs that we don’t expect to be more- or less-common in diabetic patients than in non-diabetic patients.

meds <- tribble(
  ~ name, ~ predicts_diabetes,
  "insulin", .99,
  "metformin", .95,
  "prednisone", .25,
  "metoprolol", .2,
  "nexium", .5,
  "tiotropium", .5
)

Now we take each patient and choose 0 - 4 (at random) medications for them, with the probability they have each medication proportional to meds$predicts_diabetes if the patient has diabetes, and 1 - meds$predicts_diabetes if they don’t. We also create a years_taken variable, which is sampled at random from an exponential distribution. So, some patients won’t appear in this table (those who got 0 meds), some will have one row (those who got 1 med), and some will have multiple rows (those who got more than 1 med).

pima_meds <-
  pima_diabetes %>%
  group_by(patient_id) %>%
  summarize(drug_name = list(tibble(
    medication = sample(x = meds$name, size = sample(0:4, 1), replace = FALSE,
                        prob = if (diabetes == "Y") meds$predicts_diabetes else 1 - meds$predicts_diabetes),
    years_taken = rexp(n = length(medication), rate = .2)))
  ) %>%
  unnest()