First we attach the healthcareai R package to make its functions available. If your package version is less than 2.0, none of the code here will work. You can check the package version with packageVersion("healthcareai")
, and you can get the latest stable version by running install.packages("healthcareai")
. If you have v1.X code that you want to use with the new version of the package, check out the Transitioning vignette.
library(healthcareai) # > healthcareai version 2.5.0 # > Please visit https://docs.healthcare.ai for full documentation and vignettes. Join the community at https://healthcare-ai.slack.com
healthcareai
comes with a built in dataset documenting diabetes among adult Pima females. Once you attach the package, the dataset is available in the variable pima_diabetes
. Let’s take a look at the data with the str
function. There are 768 records in 10 variables including one identifier column, several nominal variables, and substantial missingness (represented in R by NA
).
str(pima_diabetes) # > tibble [768 × 10] (S3: tbl_df/tbl/data.frame) # > $ patient_id : int [1:768] 1 2 3 4 5 6 7 8 9 10 ... # > $ pregnancies : int [1:768] 6 1 8 1 0 5 3 10 2 8 ... # > $ plasma_glucose: int [1:768] 148 85 183 89 137 116 78 115 197 125 ... # > $ diastolic_bp : int [1:768] 72 66 64 66 40 74 50 NA 70 96 ... # > $ skinfold : int [1:768] 35 29 NA 23 35 NA 32 NA 45 NA ... # > $ insulin : int [1:768] NA NA NA 94 168 NA 88 NA 543 NA ... # > $ weight_class : chr [1:768] "obese" "overweight" "normal" "overweight" ... # > $ pedigree : num [1:768] 0.627 0.351 0.672 0.167 2.288 ... # > $ age : int [1:768] 50 31 32 21 33 30 26 29 53 54 ... # > $ diabetes : chr [1:768] "Y" "N" "Y" "N" ...
If you don’t want to fuss with details any more than necessary, machine_learn
is the function for you. It makes it as easy as possible to implement machine learning models by putting all the detains in the background so that you don’t have to worry about them. Of course it might be wise to worry about them, and we’ll get to how to do that further down, but for now, you can automatically take care of problems in the data, do basic feature engineering, and tune multiple machine learning models using cross validation with machine_learn
.
machine_learn
always gets the name of the data frame, then any columns that should not be used by the model (uninformative columns, such as IDs), then the variable to be predicted with outcome =
. If you want machine_learn
to run faster, you can have that—at the expense of a bit of predictive power—by setting its tune
argument to FALSE
.
quick_models <- machine_learn(pima_diabetes, patient_id, outcome = diabetes) # > Training new data prep recipe... # > Variable(s) ignored in prep_data won't be used to tune models: patient_id # > # > diabetes looks categorical, so training classification algorithms. # > # > After data processing, models are being trained on 12 features with 768 observations. # > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 50 rf's, 50 xgb's, and 100 glm's # > Training with cross validation: Random Forest # > Training with cross validation: eXtreme Gradient Boosting # > Training with cross validation: glmnet # > # > *** Models successfully trained. The model object contains the training data minus ignored ID columns. *** # > *** If there was PHI in training data, normal PHI protocols apply to the model object. ***
machine_learn
has told us that it has created a recipe for data preparation (this allows us to do exactly the same data cleaning and feature engineering when you want predictions on a new dataset), is ignoring patient_id
when tuning models as we told it to, is training classification algorithms because the outcome variable diabetes
is categorical, and has executed cross validation for three machine learning models: random forests, XGBoost, and regularized regression. Let’s see what the models look like.
quick_models # > Algorithms Trained: Random Forest, eXtreme Gradient Boosting, and glmnet # > Model Name: diabetes # > Target: diabetes # > Class: Classification # > Performance Metric: AUROC # > Number of Observations: 768 # > Number of Features: 12 # > Models Trained: 2020-08-05 09:09:29 # > # > Models tuned via 5-fold cross validation over 9 combinations of hyperparameter values. # > Best model: Random Forest # > AUPR = 0.71, AUROC = 0.85 # > Optimal hyperparameter values: # > mtry = 5 # > splitrule = extratrees # > min.node.size = 20
Everything looks as expected, and the best model is is a random forest that achieves performance of AUROC = 0.85. Not bad for one line of code.
Now that we have our models, we can make predictions using the predict
function. If you provide a new data frame to predict
it will make predictions on the new data; otherwise, it will make predictions on the training data.
predictions <- predict(quick_models) predictions # > "predicted_diabetes" predicted by Random Forest last trained: 2020-08-05 09:09:29 # > Performance in training: AUROC = 0.85 # > # A tibble: 768 x 11 # > diabetes predicted_diabe… patient_id pregnancies plasma_glucose diastolic_bp # > * <fct> <dbl> <int> <int> <int> <int> # > 1 Y 0.678 1 6 148 72 # > 2 N 0.153 2 1 85 66 # > 3 Y 0.460 3 8 183 64 # > 4 N 0.00927 4 1 89 66 # > 5 Y 0.566 5 0 137 40 # > # … with 763 more rows, and 5 more variables: skinfold <int>, insulin <int>, # > # weight_class <chr>, pedigree <dbl>, age <int>
We get a message about when the model was trained and how well it preformed in training, and we get back a data frame that looks sort of like the original, but has a new column predited_diabetes
that contains the model-generated probability each individual has diabetes, and contains changes that were made preparing the data for model training, e.g. missingness has been filled in and weight_class
has been split into a series of “dummy” variables.
We can plot how effectively the model is able to separate diabetic from non-diabetic individuals by calling the plot
function on the output of predict
.
plot(predictions)
If you want outcome-class predictions in addition to predicted probabilites, the outcome_groups
argument accomplishes that. If it is TRUE
the overall accuracy of predictions is maximized. If it is a number, it represents the relative cost of a false-negative to a false-positive outcome. The example below says that one false negative is as bad as two false positives. If you want risk groups instead, see the risk_groups
argument.
It is always a good idea to be aware of where there are missing values in data. The missingness
function helps with that. In addition to looking for values R sees as missing, it looks for other values that might represent missing, such as "NULL"
, and issues a warning if it finds any. Like many healthcareai
functions, it has a plot
method so you can inspect the results more quickly and intuitively by passing the output to plot
.
missingness(pima_diabetes) %>% plot()
It’s good that we don’t have any missingness in our ID or outcome columns. We’ll see how missingness in predictors is addressed further down.
To get an honest picture of how well a model performs (and an accurate estimate of how well it will perform on yet-unseen data), it is wise to hide a small portion of observations from model training and assess model performance on this “validation” or “test” dataset. In fact, healthcareai
does this automatically and repeatedly under the hood, so it’s not strictly necessary, but it’s still a good idea. The split_train_test
function simplifies this, and it ensures the test dataset has proportionally similar characteristics to the training dataset. By default, 80% of observations are used for training; that proportion can be adjusted with the p
parameter. The seed
parameter controls randomness so that you can get the same split every time you run the code if you want strict reproducability.
split_data <- split_train_test(d = pima_diabetes, outcome = diabetes, p = .9, seed = 84105)
split_data
contains two data frames, named train
and test
.
One of the major workhorse functions in healthcareai
is prep_data
. It is called under-the-hood by machine_learn
, so you don’t have to worry about these details if you don’t want to, but eventually you’ll want to customize how your data is prepared; this is where you do that. The helpfile ?prep_data
describes what the function does and how it can be customized. Here, let’s customize preparation to scale and center numeric variables and avoid collapsing rare factor levels into “other”.
The first arguments to prep_data
are the same as those to machine_learn
: data frame, ignored columns, and the outcome column. Then we can specify prep details.
prepped_training_data <- prep_data(split_data$train, patient_id, outcome = diabetes, center = TRUE, scale = TRUE, collapse_rare_factors = FALSE) # > Training new data prep recipe...
The “recipe” that the above message refers to is a set of instructions for how to transform a dataset the way we just transformed our training data. Any machine learning that we do (within healthcareai
) on prepped_training_data
will retain that recipe and apply it before making predictions on new data. That means that when you have models making predictions in production, you don’t have to figure out how to transform the data or worry about encountering missing data or new category levels.
machine_learn
takes care of data preparation and model training for you, but if you want more precise control, tune_models
and flash_models
are the model-training function you’re looking for. They differ in that tune_models
searches over hyperparameters to optimize model performance, while flash_models
trains models at set hyperparameter values. So, tune_models
produces better models, but takes longer (approaching 10x longer at default settings).
Let’s tune all three available models: random forests (“RF”), regularized regression (i.e. lasso and ridge, “GLM”), and gradient-boosted decision trees (i.e. XGBoost, “XGB”). To optimize model performance, let’s crank tune_depth
up a little from its default value of ten. That will tune the models over more combinations of hyperparameter values in the search for the best model. This will increasing training time, so be cautious with it at first, but for this modest-sized dataset, the entire process takes less than a minute to complete on a laptop.
Let’s also select “PR” as our model metric. That optimizes for area under the precision-recall curve rather than the default of area under the receiver operating characteristic curve (“ROC”). This is usually a good idea when one outcome category is much more common than the other category.
models <- tune_models(d = prepped_training_data, outcome = diabetes, tune_depth = 25, metric = "PR") # > Variable(s) ignored in prep_data won't be used to tune models: patient_id # > # > diabetes looks categorical, so training classification algorithms. # > # > After data processing, models are being trained on 13 features with 692 observations. # > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 125 rf's, 125 xgb's, and 250 glm's # > Training with cross validation: Random Forest # > Training with cross validation: eXtreme Gradient Boosting # > Training with cross validation: glmnet # > # > *** Models successfully trained. The model object contains the training data minus ignored ID columns. *** # > *** If there was PHI in training data, normal PHI protocols apply to the model object. ***
You can compare performance across models with evaluate
.
evaluate(models, all_models = TRUE) # > # A tibble: 3 x 3 # > model AUPR AUROC # > <chr> <dbl> <dbl> # > 1 Random Forest 0.703 0.842 # > 2 glmnet 0.688 0.836 # > 3 eXtreme Gradient Boosting 0.687 0.820
For more detail, you can examine how models perform across hyperparameters by plotting the model object. Here we plot only the best model’s performance over hyperparameter by extracting it by name. It looks like extratrees is a superior split rule for this model.
models["Random Forest"] %>% plot()
If you’re feeling the need for speed, flash_models
is the function for you. It uses fixed sets of hyperparameter values to train the models, so you still get a model customized to your data, but without burning the electricity and time to precisely optimize all the details. Here we’ll use models = "RF"
to train only a random forest.
If you want to train a model on fixed hyperparameter values, but you want to choose those values, you can pass them to the hyperparameters
argument of tune_models
. Run get_hyperparameter_defaults()
to see the default values and get a list you can customize.
untuned_rf <- flash_models(d = prepped_training_data, outcome = diabetes, models = "RF", metric = "PR") # > Variable(s) ignored in prep_data won't be used to tune models: patient_id # > # > diabetes looks categorical, so training classification algorithms. # > # > After data processing, models are being trained on 13 features with 692 observations. # > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 5 rf's # > Training at fixed values: Random Forest # > # > *** Models successfully trained. The model object contains the training data minus ignored ID columns. *** # > *** If there was PHI in training data, normal PHI protocols apply to the model object. ***
If you trained a GLM model, you can extract model coefficients from it with the interpret
function. These are coefficient estimates from a regularized logistic or linear regression model. If you didn’t scale your predictors (which is the default in prep_data
), these will be in natural units (e.g. in the plot below, a unit increase in plasma glucose corresponds to an expected log-odds increase of diabetes of just over one). Importantly, natural units mean that you can’t interpret the size of the coefficients as the importance of the predictor. To get that interpretation, scale your features during data preparation by calling prep_data
with scale = TRUE
and then flash_models
or tune_models
.
In this plot, the low value of weight_class_normal
signifies that people with normal weight are less likely to have diabetes. Similarly, plasma glucose is associated with increased risk of diabetes after accounting for other variables.
interpret(models) %>% plot() # > Warning in interpret(models): Interpreting glmnet model, but Random Forest # > performed best in cross-validation and will be used to make predictions. To use # > the glmnet model for predictions, extract it with x['glmnet'].
Tree based methods such as random forest and boosted decision trees can’t provide coefficients like regularized regression models can, but they can provide information about how important each feature (aka predictor, aka variable) is for making accurate predictions. You can see these “variable importances” by calling get_variable_importance
on your model object. Like interpret
and many other functions in healthcareai
, you can plot the output of get_variable_importance
with a simple plot
call.
get_variable_importance(models) %>% plot()
The explore
function reveals how a model makes its predictions. It takes the most important features in a model, and uses a variety of “counterfactual” observations across those features to see what predictions the model would make at various combinations of the features. To see the effect of more features adjust the n_use
argument to plot
, or for different features, specify x_var
and color_var
.
explore(models) %>% plot() # > With 4 varying features and n_use = 2, using median to aggregate predicted outcomes across age and pregnancies. You could turn `n_use` up to see the impact of more features.
predict
will automatically use the best-performing model from training (evaluated out-of-fold in cross validation). If no new data is passed to predict
it will return out-of-fold predictions from training. The predicted probabilities appear in the predicted_diabetes
column.
predict(models) # > "predicted_diabetes" predicted by Random Forest last trained: 2020-08-05 09:10:10 # > Performance in training: AUPR = 0.7 # > # A tibble: 692 x 11 # > diabetes predicted_diabe… patient_id pregnancies plasma_glucose diastolic_bp # > * <fct> <dbl> <int> <int> <int> <int> # > 1 Y 0.691 1 6 148 72 # > 2 N 0.142 2 1 85 66 # > 3 Y 0.432 3 8 183 64 # > 4 N 0.0219 4 1 89 66 # > 5 Y 0.534 5 0 137 40 # > # … with 687 more rows, and 5 more variables: skinfold <int>, insulin <int>, # > # weight_class <chr>, pedigree <dbl>, age <int>
To get predictions on a new dataset, pass the new data to predict
, and it will automatically be prepared based on the recipe generated on the training data. We can plot the predictions to see how well our model is doing, and we see that it’s separating diabetic from non-diabetic individuals pretty well, although there a fair number of non-diabetics with high predicted probabilities of diabetes. This may be due to optimizing for precision recall, or may indicate pre-diabetic patients.
Above, we saw how to make outcome-class predictions. Here, we make risk-group predictions, defining four risk groups (low, moderate, high, and extreme) containing 30%, 40%, 20% and 10% of patients, respectively.
test_predictions <- predict(models, split_data$test, risk_groups = c(low = 30, moderate = 40, high = 20, extreme = 10) ) # > Prepping data based on provided recipe plot(test_predictions)
Everything we have done above happens “in memory”. It’s all within one R session, so there’s no need to save anything to disk or load anything back into R. Putting a machine learning model in production typically means moving the model into a production environment. To do that, save the model with save_models
function.
save_models(models, file = "my_models.RDS")
The above code will store the models
object with all its metadata in the my_models.RDS
file in the working directory, which you can identify with getwd()
. You can move that file to any other directory or machine, even across operating systems, and pull it back into R with the load_models
function.
The only tricky thing here is you have to direct load_models
to the directory that the model file is in. If you don’t provide a filepath, i.e. call load_models()
, you’ll get a dialog box from which you can choose your model file. Otherwise, you can provide load_models
an absolute path to the file, e.g. load_models("C:/Users/user.name/Documents/diabetes/my_models.RDS")
, or a path relative to your working directory, which again you can find with getwd()
, e.g. load_models("data/my_models.RDS")
. If you put the models in the same directory as your R script or project, you can load the models without any file path.
models <- load_models("my_models.RDS")
That will reestablish the models
object in your R session. You can confirm this by clicking on the “Environment” tab in R Studio or running ls()
to list all objects in your R session.
All the examples above have been classification tasks, predicting a yes/no outcome. Here’s an example of a full regression modeling pipeline on a silly problem: predicting individuals’ ages. The code is very similar to classification.
regression_models <- machine_learn(pima_diabetes, patient_id, outcome = age) # > Training new data prep recipe... # > Variable(s) ignored in prep_data won't be used to tune models: patient_id # > # > age looks numeric, so training regression algorithms. # > # > After data processing, models are being trained on 14 features with 768 observations. # > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 50 rf's, 50 xgb's, and 100 glm's # > Training with cross validation: Random Forest # > Training with cross validation: eXtreme Gradient Boosting # > Training with cross validation: glmnet # > # > *** Models successfully trained. The model object contains the training data minus ignored ID columns. *** # > *** If there was PHI in training data, normal PHI protocols apply to the model object. *** summary(regression_models) # > Models trained: 2020-08-05 09:10:29 # > # > Models tuned via 5-fold cross validation over 10 combinations of hyperparameter values. # > Best performance: RMSE = 9.1, MAE = 6.5, Rsquared = 0.41 # > By Random Forest with hyperparameters: # > mtry = 4 # > splitrule = variance # > min.node.size = 17 # > # > Out-of-fold performance of all trained models: # > # > $`Random Forest` # > # A tibble: 10 x 9 # > mtry splitrule min.node.size RMSE Rsquared MAE RMSESD RsquaredSD MAESD # > <int> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # > 1 4 variance 17 9.07 0.410 6.53 0.655 0.0283 0.276 # > 2 3 variance 4 9.08 0.412 6.59 0.694 0.0283 0.305 # > 3 5 variance 7 9.09 0.404 6.53 0.601 0.0246 0.235 # > 4 4 extratrees 17 9.16 0.417 6.66 0.814 0.0408 0.441 # > 5 7 variance 2 9.20 0.391 6.62 0.587 0.0209 0.212 # > # … with 5 more rows # > # > $`eXtreme Gradient Boosting` # > # A tibble: 10 x 13 # > eta max_depth gamma colsample_bytree min_child_weight subsample nrounds # > <dbl> <int> <dbl> <dbl> <dbl> <dbl> <int> # > 1 0.0291 4 5.73 0.730 0.248 0.763 570 # > 2 0.176 7 9.76 0.518 3.53 0.744 46 # > 3 0.0990 2 0.350 0.624 2.33 0.526 626 # > 4 0.423 5 6.79 0.643 3.80 0.940 69 # > 5 0.432 5 6.23 0.505 14.2 0.356 30 # > # … with 5 more rows, and 6 more variables: RMSE <dbl>, Rsquared <dbl>, # > # MAE <dbl>, RMSESD <dbl>, RsquaredSD <dbl>, MAESD <dbl> # > # > $glmnet # > # A tibble: 20 x 8 # > alpha lambda RMSE Rsquared MAE RMSESD RsquaredSD MAESD # > <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # > 1 0 0.00128 9.37 0.377 6.74 0.578 0.0798 0.358 # > 2 0 0.00367 9.37 0.377 6.74 0.578 0.0798 0.358 # > 3 0 0.00896 9.37 0.377 6.74 0.578 0.0798 0.358 # > 4 0 0.0218 9.37 0.377 6.74 0.578 0.0798 0.358 # > 5 0 0.0367 9.37 0.377 6.74 0.578 0.0798 0.358 # > # … with 15 more rows
Let’s make a prediction on a hypothetical new patient. Note that the model handles missingness in insulin
and a new category level in weight_class
without a problem (but warns about it).
new_patient <- data.frame( pregnancies = 0, plasma_glucose = 80, diastolic_bp = 55, skinfold = 24, insulin = NA, weight_class = "???", pedigree = .2, diabetes = "N") predict(regression_models, new_patient) # > Warning in ready_with_prep(object, newdata, mi): The following variables(s) had the following value(s) in predict that were not observed in training. # > weight_class: ??? # > Prepping data based on provided recipe # > "predicted_age" predicted by Random Forest last trained: 2020-08-05 09:10:29 # > Performance in training: RMSE = 9.07 # > # A tibble: 1 x 9 # > predicted_age pregnancies plasma_glucose diastolic_bp skinfold insulin # > * <dbl> <dbl> <dbl> <dbl> <dbl> <lgl> # > 1 23.7 0 80 55 24 NA # > # … with 3 more variables: weight_class <chr>, pedigree <dbl>, diabetes <chr>