This vignette demonstrates how to translate healthcareai v1 code into v2 code by translating the example from the old RandomForestDeployment help page. Throughout this vignette, commented code is the v1 code that is being updated in each chunk.

Install and load the package

First, check the package version with packageVersion("healthcareai"). If the first digit isn’t 2, you need to update. You should be able to install v2 by running install.packages("healthcareai").

Then, load the package.

library(healthcareai)
# > healthcareai version 2.5.0
# > Please visit https://docs.healthcare.ai for full documentation and vignettes. Join the community at https://healthcare-ai.slack.com

Load Data

First, we’ll read the data from the repository and inspect it.

# csvfile <- system.file("extdata", 
#                        "HCRDiabetesClinical.csv", 
#                        package = "healthcareai")
# df <- read.csv(file = csvfile, 
#                header = TRUE, 
#                na.strings = c("NULL", "NA", ""))
# 
# df$PatientID <- NULL # Only one ID column (ie, PatientEncounterID) is needed remove this column
## v2 lets you retain any number of identifier columns

df <- read.csv("https://raw.githubusercontent.com/HealthCatalyst/healthcareai-r/8355ee33a6ea2ad549c8f840832b7843ddf27b04/inst/extdata/HCRDiabetesClinical.csv",
               na.strings = "")
str(df)
# > 'data.frame':   1000 obs. of  7 variables:
# >  $ PatientEncounterID : int  1 2 3 4 5 6 7 8 9 10 ...
# >  $ PatientID          : int  10001 10001 10001 10002 10002 10002 10002 10003 10003 10003 ...
# >  $ SystolicBPNBR      : int  167 153 170 187 188 185 189 149 155 160 ...
# >  $ LDLNBR             : int  195 214 191 135 125 178 101 160 144 130 ...
# >  $ A1CNBR             : num  4.2 5 4 4.4 4.3 5 4 5 6.6 8 ...
# >  $ GenderFLG          : chr  "M" "M" "M" "M" ...
# >  $ ThirtyDayReadmitFLG: chr  "N" "N" "N" "N" ...

There are missing values in the outcome variable ThirtyDayReadmitFLG, and v2 of the package intentionally does not allow missing values in the outcome, so we will discard those rows. Missingness in predictors is automatically taken care of and a variety of imputation methods are supported.

df <- dplyr::filter(df, !is.na(ThirtyDayReadmitFLG))

Reserve Validation Set

This is an optional step. Withholding some data from model training and evaluating model performance on that data provides a rigorously honest estimation of how well the model will perform on new data. healthcareai v2 actually does this repeatedly under the hood in model training (through cross validation), so the performance reported on training data should be consistent with performance on new data; however, validation on a test dataset can still be useful, so we demonstrate how to do it here.

split_train_test returns two data frames, train and test, ensuring equal representation of the outcome in each. The following code puts 95% of rows in train and 5% of rows in test. The latter won’t be used in model training, and instead will be used to make predictions to see how well the model performs on data it has never seen.

# dfDeploy <- df[951:1000,]
d <- split_train_test(df, outcome = ThirtyDayReadmitFLG, percent_train = .95)

Train Models

machine_learn needs the name of the data frame, any columns that shouldn’t be used in training (identifier columns), and the outcome variable. It will do everything else automatically, including imputing missing values, training multiple algorithms, optimizing algorithm details through cross validation, and much more. Specifying models = "rf" means that only random forests will be trained; if left blank, all supported models will be trained, and the best performing model will be used to make predictions.

# p <- SupervisedModelDevelopmentParams$new()
# p$df <- df
# p$type <- "classification"
# p$impute <- TRUE
# p$grainCol <- "PatientEncounterID"
# p$predictedCol <- "ThirtyDayReadmitFLG"
# p$debug <- FALSE
# p$cores <- 1
# 
# # Run RandomForest
# RandomForest <- RandomForestDevelopment$new(p)
# RandomForest$run()

models <- machine_learn(d$train,
                        PatientEncounterID, PatientID,
                        outcome = ThirtyDayReadmitFLG,
                        models = "rf")
# > Training new data prep recipe...
# > Variable(s) ignored in prep_data won't be used to tune models: PatientEncounterID and PatientID
# > 
# > ThirtyDayReadmitFLG looks categorical, so training classification algorithms.
# > 
# > After data processing, models are being trained on 6 features with 942 observations.
# > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 50 rf's
# > Training with cross validation: Random Forest
# > 
# > *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***
# > *** If there was PHI in training data, normal PHI protocols apply to the model object. ***

Save and Load Models

healthcareai v1 automatically saved models in the working directory, and automatically loaded them from the working directory. v2 instead provides functions to make it easy to save and load models. This is only necrequires you to manage model files if you want to use the models in a different R session than the one in which you trained the models.

As with other objects in R, you can save an object to disk like this:

save(models, file = "my_random_forests.RDA")

That will write the models object to the my_random_forests.RDA file in the working directory. You can find out where that is by running getwd().

Loading the models with the following line will reestablish the models object in a new R session. That means you can move the my_random_forests.RDA file to another directory or another computer, and get your models there with this line. If the RDA file is in a different directory than your R script (or project if you use RStudio’s Projects), you’ll need to point to that location relative to the current working directory, e.g. load("../data/trained_models/my_random_forests.RDA"). Here is some recommended reading on working directories and filepaths if you need help loading saved models.

load("my_random_forests.RDA")

Make Predictions

To make predictions on the training dataset, simply pass the model object to predict. To make predictions on new data, pass the data frame to predict after the model object.

# p2 <- SupervisedModelDeploymentParams$new()
# p2$type <- "classification"
# p2$df <- dfDeploy
# p2$grainCol <- "PatientEncounterID"
# p2$predictedCol <- "ThirtyDayReadmitFLG"
# p2$impute <- TRUE
# p2$debug <- FALSE
# p2$cores <- 1
# 
# dL <- RandomForestDeployment$new(p2)
# dL$deploy()
# 
# dfOut <- dL$getOutDf()
# head(dfOut)

predict(models, d$test)
# > Prepping data based on provided recipe
# > "predicted_ThirtyDayReadmitFLG" predicted by Random Forest last trained: 2020-08-05 09:20:40
# > Performance in training: AUROC = 0.87
# > # A tibble: 48 x 8
# >   ThirtyDayReadmi… predicted_Thirt… PatientEncounte… PatientID SystolicBPNBR
# > * <fct>                       <dbl>            <int>     <int>         <int>
# > 1 N                         0.103                 16     10004           117
# > 2 Y                         0.807                 19     10004           187
# > 3 N                         0.00580               29     10005           170
# > 4 N                         0.0516                46     10008           134
# > 5 Y                         0.254                 56     10009           191
# > # … with 43 more rows, and 3 more variables: LDLNBR <int>, A1CNBR <dbl>,
# > #   GenderFLG <chr>

Further Reading

For more detail on using v2 or about what’s happening under the hood, see Getting Started or the function reference page. To see how to connect to a database, see Database Connections.