This vignette demonstrates how to translate
healthcareai v1 code into v2 code by translating the example from the old
RandomForestDeployment help page. Throughout this vignette, commented code is the v1 code that is being updated in each chunk.
First, check the package version with
packageVersion("healthcareai"). If the first digit isn’t 2, you need to update. You should be able to install v2 by running
Then, load the package.
First, we’ll read the data from the repository and inspect it.
# csvfile <- system.file("extdata", # "HCRDiabetesClinical.csv", # package = "healthcareai") # df <- read.csv(file = csvfile, # header = TRUE, # na.strings = c("NULL", "NA", "")) # # df$PatientID <- NULL # Only one ID column (ie, PatientEncounterID) is needed remove this column ## v2 lets you retain any number of identifier columns df <- read.csv("https://raw.githubusercontent.com/HealthCatalyst/healthcareai-r/8355ee33a6ea2ad549c8f840832b7843ddf27b04/inst/extdata/HCRDiabetesClinical.csv", na.strings = "") str(df) # > 'data.frame': 1000 obs. of 7 variables: # > $ PatientEncounterID : int 1 2 3 4 5 6 7 8 9 10 ... # > $ PatientID : int 10001 10001 10001 10002 10002 10002 10002 10003 10003 10003 ... # > $ SystolicBPNBR : int 167 153 170 187 188 185 189 149 155 160 ... # > $ LDLNBR : int 195 214 191 135 125 178 101 160 144 130 ... # > $ A1CNBR : num 4.2 5 4 4.4 4.3 5 4 5 6.6 8 ... # > $ GenderFLG : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ... # > $ ThirtyDayReadmitFLG: Factor w/ 2 levels "N","Y": 1 1 1 1 1 1 1 1 1 2 ...
There are missing values in the outcome variable
ThirtyDayReadmitFLG, and v2 of the package intentionally does not allow missing values in the outcome, so we will discard those rows. Missingness in predictors is automatically taken care of and a variety of imputation methods are supported.
This is an optional step. Withholding some data from model training and evaluating model performance on that data provides a rigorously honest estimation of how well the model will perform on new data.
healthcareai v2 actually does this repeatedly under the hood in model training (through cross validation), so the performance reported on training data should be consistent with performance on new data; however, validation on a test dataset can still be useful, so we demonstrate how to do it here.
split_train_test returns two data frames,
test, ensuring equal representation of the outcome in each. The following code puts 95% of rows in
train and 5% of rows in
test. The latter won’t be used in model training, and instead will be used to make predictions to see how well the model performs on data it has never seen.
machine_learn needs the name of the data frame, any columns that shouldn’t be used in training (identifier columns), and the outcome variable. It will do everything else automatically, including imputing missing values, training multiple algorithms, optimizing algorithm details through cross validation, and much more. Specifying
models = "rf" means that only random forests will be trained; if left blank, all supported models will be trained, and the best performing model will be used to make predictions.
# p <- SupervisedModelDevelopmentParams$new() # p$df <- df # p$type <- "classification" # p$impute <- TRUE # p$grainCol <- "PatientEncounterID" # p$predictedCol <- "ThirtyDayReadmitFLG" # p$debug <- FALSE # p$cores <- 1 # # # Run RandomForest # RandomForest <- RandomForestDevelopment$new(p) # RandomForest$run() models <- machine_learn(d$train, PatientEncounterID, PatientID, outcome = ThirtyDayReadmitFLG, models = "rf") # > Training new data prep recipe... # > Variable(s) ignored in prep_data won't be used to tune models: PatientEncounterID and PatientID # > # > ThirtyDayReadmitFLG looks categorical, so training classification algorithms. # > # > After data processing, models are being trained on 6 features with 942 observations. # > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 50 rf's # > Training with cross validation: Random Forest # > # > *** Models successfully trained. The model object contains the training data minus ignored ID columns. *** # > *** If there was PHI in training data, normal PHI protocols apply to the model object. ***
healthcareai v1 automatically saved models in the working directory, and automatically loaded them from the working directory. v2 instead provides functions to make it easy to save and load models. This is only necrequires you to manage model files if you want to use the models in a different R session than the one in which you trained the models.
As with other objects in R, you can save an object to disk like this:
That will write the
models object to the
my_random_forests.RDA file in the working directory. You can find out where that is by running
Loading the models with the following line will reestablish the
models object in a new R session. That means you can move the
my_random_forests.RDA file to another directory or another computer, and get your models there with this line. If the RDA file is in a different directory than your R script (or project if you use RStudio’s Projects), you’ll need to point to that location relative to the current working directory, e.g.
load("../data/trained_models/my_random_forests.RDA"). Here is some recommended reading on working directories and filepaths if you need help loading saved models.
To make predictions on the training dataset, simply pass the model object to
predict. To make predictions on new data, pass the data frame to
predict after the model object.
# p2 <- SupervisedModelDeploymentParams$new() # p2$type <- "classification" # p2$df <- dfDeploy # p2$grainCol <- "PatientEncounterID" # p2$predictedCol <- "ThirtyDayReadmitFLG" # p2$impute <- TRUE # p2$debug <- FALSE # p2$cores <- 1 # # dL <- RandomForestDeployment$new(p2) # dL$deploy() # # dfOut <- dL$getOutDf() # head(dfOut) predict(models, d$test) # > Prepping data based on provided recipe # > "predicted_ThirtyDayReadmitFLG" predicted by Random Forest last trained: 2018-09-01 17:29:13 # > Performance in training: AUROC = 0.88 # > # A tibble: 48 x 8 # > ThirtyDayReadmi… predicted_Thirt… PatientEncounte… PatientID # > * <fct> <dbl> <int> <int> # > 1 N 0.248 22 10004 # > 2 N 0.00463 24 10005 # > 3 N 0.00366 25 10005 # > 4 N 0.115 79 10014 # > 5 N 0.0546 135 10024 # > # ... with 43 more rows, and 4 more variables: SystolicBPNBR <int>, # > # LDLNBR <int>, A1CNBR <dbl>, GenderFLG <fct>