This vignette demonstrates how to translate healthcareai v1 code into v2 code by translating the example from the old RandomForestDeployment help page. Throughout this vignette, commented code is the v1 code that is being updated in each chunk.

Install and load the package

First, check the package version with packageVersion("healthcareai"). If the first digit isn’t 2, you need to update. You should be able to install v2 by running install.packages("healthcareai").

Then, load the package.

Reserve Validation Set

This is an optional step. Withholding some data from model training and evaluating model performance on that data provides a rigorously honest estimation of how well the model will perform on new data. healthcareai v2 actually does this repeatedly under the hood in model training (through cross validation), so the performance reported on training data should be consistent with performance on new data; however, validation on a test dataset can still be useful, so we demonstrate how to do it here.

split_train_test returns two data frames, train and test, ensuring equal representation of the outcome in each. The following code puts 95% of rows in train and 5% of rows in test. The latter won’t be used in model training, and instead will be used to make predictions to see how well the model performs on data it has never seen.

# dfDeploy <- df[951:1000,]
d <- split_train_test(df, outcome = ThirtyDayReadmitFLG, percent_train = .95)

Train Models

machine_learn needs the name of the data frame, any columns that shouldn’t be used in training (identifier columns), and the outcome variable. It will do everything else automatically, including imputing missing values, training multiple algorithms, optimizing algorithm details through cross validation, and much more. Specifying models = "rf" means that only random forests will be trained; if left blank, all supported models will be trained, and the best performing model will be used to make predictions.

Save and Load Models

healthcareai v1 automatically saved models in the working directory, and automatically loaded them from the working directory. v2 instead provides functions to make it easy to save and load models. This is only necrequires you to manage model files if you want to use the models in a different R session than the one in which you trained the models.

As with other objects in R, you can save an object to disk like this:

That will write the models object to the my_random_forests.RDA file in the working directory. You can find out where that is by running getwd().

Loading the models with the following line will reestablish the models object in a new R session. That means you can move the my_random_forests.RDA file to another directory or another computer, and get your models there with this line. If the RDA file is in a different directory than your R script (or project if you use RStudio’s Projects), you’ll need to point to that location relative to the current working directory, e.g. load("../data/trained_models/my_random_forests.RDA"). Here is some recommended reading on working directories and filepaths if you need help loading saved models.

Further Reading

For more detail on using v2 or about what’s happening under the hood, see Getting Started or the function reference page. To see how to connect to a database, see Database Connections.