Make predictions using the best-performing model. For classification models, predicted probabilities are always returned, and you can get either predicted outcome class by specifying outcome_groups or risk groups by specifying risk_groups.

# S3 method for model_list
predict(
  object,
  newdata,
  risk_groups = NULL,
  outcome_groups = NULL,
  prepdata,
  write_log = FALSE,
  ...
)

Arguments

object

model_list object, as from `tune_models`

newdata

data on which to make predictions. If missing, out-of-fold predictions from training will be returned If you want new predictions on training data using the final model, pass the training data to this argument, but know that you're getting over-fit predictions that very likely overestimate model performance relative to what will be achieved on new data. Should have the same structure as the input to `prep_data`,`tune_models` or `train_models`. `predict` will try to figure out if the data need to be sent through `prep_data` before making predictions; this can be overridden by setting `prepdata = FALSE`, but this should rarely be needed.

risk_groups

Should predictions be grouped into risk groups and returned in column "predicted_group"? If this is NULL (default), they will not be. If this is a single number, that number of groups will be created with names "risk_group1", "risk_group2", etc. "risk_group1" is always the highest risk (highest predicted probability). The groups will have equal expected sizes, based on the distribution of out-of-fold predictions on the training data. If this is a character vector, its entries will be used as the names of the risk groups, in increasing order of risk, again with equal expected sizes of groups. If you want unequal-size groups, this can be a named numeric vector, where the names will be the names of the risk groups, in increasing order of risk, and the entries will be the relative proportion of observations in the group, again based on the distribution of out-of-fold predictions on the training data. For example, risk_groups = c(low = 2, mid = 1, high = 1) will put the bottom half of predicted probabilities in the "low" group, the next quarter in the "mid" group, and the highest quarter in the "high" group. You can get the cutoff values used to separate groups by passing the output of predict to get_cutoffs. Note that only one of risk_groups and outcome_groups can be specified.

outcome_groups

Should predictions be grouped into outcome classes and returned in column "predicted_group"? If this is NULL (default), they will not be. The threshold for splitting outcome classes is determined on the training data via get_thresholds. If this is TRUE, the threshold is chosen to maximize accuracy, i.e. false positives and false negatives are equally weighted. If this is a number it is the ratio of cost (badnesss) of false negatives (missed detections) to false positives (false alarms). For example, outcome_groups = 5 indicates a preferred ratio of five false alarms to every missed detection, and outcome_groups = .5 indicates that two missed detections is as bad as one false alarm. This value is passed to the cost_fn argument of get_thresholds. You can get the cutoff values used to separate groups by passing the output of predict to get_cutoffs. Note that only one of risk_groups and outcome_groups can be specified.

prepdata

Defunct. Data are always prepped in prediction.

write_log

Write prediction metadata to a file? Default is FALSE. If TRUE, will create or append a file called "prediction_log.txt" in the current directory with metadata about predictions. If a character, is the name of a file to create or append with prediction metadata. If you want a unique log file each time predictions are made, use something like write_log = paste0(Sys.time(), " predictions.txt"). This param modifies error behavior and is best used in production. See details.

...

Unused.

Value

A tibble data frame: newdata with an additional column for the predictions in "predicted_TARGET" where TARGET is the name of the variable being predicted. If classification, the new column will contain predicted probabilities. The tibble will have child class "predicted_df" and attribute "model_info" that contains information about the model used to make predictions. You can call plot or evaluate on a predicted_df. If write_log is TRUE and this function errors, a zero-row dataframe will be returned.

Returned data will contain an attribute, "prediction_log" that contains a tibble of logging info for writing to database. If write_log is TRUE and predict errors, an empty dataframe with the "prediction_log" attribute will still be returned. Extract this attribute using attr(pred, "prediction_log").

Data will also contain a "failed" attribute to easily filter for errors after prediction. Extract using attr(pred, "failed").

Details

The model and hyperparameter values with the best out-of-fold performance in model training according to the selected metric is used to make predictions. Prepping data inside `predict` has the advantage of returning your predictions with the newdata in its original format.

If write_log is TRUE and an error is encountered, predict will not stop. It will return the error message as: - A warning in the console - A field in the log file - A column in the "prediction_log" attribute - A zero-row data frame will be returned

See also

Examples

### Data prep and model training ### #################################### set.seed(7510) # Split the first 200 rows in pima_diabetes into a model-training dataset # containing 3/4 of the data and a test dataset containing 1/4 of the data. d <- split_train_test(pima_diabetes[1:200, ], diabetes, .75) # Prep the training data for model training and train regularized regression # and extreme gradient boosted models models <- d$train %>% prep_data(patient_id, outcome = diabetes) %>% flash_models(outcome = diabetes, models = c("glm", "xgb"))
#> Training new data prep recipe...
#> Variable(s) ignored in prep_data won't be used to tune models: patient_id
#> #> diabetes looks categorical, so training classification algorithms.
#> #> After data processing, models are being trained on 12 features with 151 observations. #> Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 50 glm's and 5 xgb's
#> Training at fixed values: glmnet
#> Training at fixed values: eXtreme Gradient Boosting
#> #> *** Models successfully trained. The model object contains the training data minus ignored ID columns. *** #> *** If there was PHI in training data, normal PHI protocols apply to the model object. ***
### Making predictions ### ########################## # Make prediction on test data using the model that performed best in # cross validation during model training. Before predictions are made, the test # data is automatically prepared the same way the training data was. predictions <- predict(models, newdata = d$test)
#> Prepping data based on provided recipe
predictions
#> "predicted_diabetes" predicted by glmnet last trained: 2020-08-05 09:08:42 #> Performance in training: AUROC = 0.84
#> # A tibble: 49 x 11 #> diabetes predicted_diabe… patient_id pregnancies plasma_glucose diastolic_bp #> * <fct> <dbl> <int> <int> <int> <int> #> 1 N 0.340 11 4 110 92 #> 2 Y 0.946 23 7 196 90 #> 3 Y 0.752 25 11 143 94 #> 4 N 0.0587 28 1 97 66 #> 5 Y 0.559 40 4 111 72 #> 6 N 0.853 41 3 180 64 #> 7 N 0.675 42 7 133 84 #> 8 Y 0.941 46 0 180 66 #> 9 N 0.0660 52 1 101 50 #> 10 N 0.815 59 0 146 82 #> # … with 39 more rows, and 5 more variables: skinfold <int>, insulin <int>, #> # weight_class <chr>, pedigree <dbl>, age <int>
evaluate(predictions)
#> AUPR AUROC #> 0.5902730 0.7365591
plot(predictions)
### Outcome class predictions ### ################################# # If you want class predictions in addition to predicted probabilities for # a classification model, specify outcome_groups. The number passed to # outcome groups is the cost of a false negative relative to a false positive. # This example specifies that one missed detection is as bad as ten false # alarms, and the resulting confusion matrix reflects this preference. class_preds <- predict(models, newdata = d$test, outcome_groups = 10)
#> Prepping data based on provided recipe
table(actual = class_preds$diabetes, predicted = class_preds$predicted_group)
#> predicted #> actual N Y #> Y 0 18 #> N 7 24
# You can extract the threshold used to separate predicted Y from predicted N get_cutoffs(class_preds)
#> Predicted outcomes N and Y separated at: 0.0792
# And you can visualize that cutoff by simply plotting the predictions plot(class_preds)
### Risk stratification ### ########################### # Alternatively, you can stratify observations into risk groups by specifying # the risk_groups parameter. For example, this creates five risk groups # with custom names. Risk group assignment is based on the distribution of # predicted probabilities in model training. This is useful because it preserves # a consistent notion of risk; for example, if you make daily predictions and # one day happens to contain only low-risk patients, those patients will all # be classified as low risk. Over the long run, group sizes will be consistent, # but in any given round of predictions they may differ. If you want fixed # group sizes, see the following examples. predict(models, d$test, risk_groups = c("very low", "low", "medium", "high", "very high")) %>% plot()
#> Prepping data based on provided recipe
### Fixed size groups ### ######################### # If you want groups of fixed sizes, e.g. say you have capacity to admit the three # highest-risk patients, treat the next five, and have to discharge the remainder, # you can use predicted probabilities to do that. One way to do that is to # arrange the predictions data frame in descending order of risk, and then use the # row numbers to stratify patients library(dplyr) predict(models, d$test) %>% arrange(desc(predicted_diabetes)) %>% mutate(action = case_when( row_number() <= 3 ~ "admit", row_number() <= 8 ~ "treat", TRUE ~ "discharge" )) %>% select(predicted_diabetes, action, everything())
#> Prepping data based on provided recipe
#> "predicted_diabetes" predicted by glmnet last trained: 2020-08-05 09:08:42 #> Performance in training: AUROC = 0.84
#> # A tibble: 49 x 12 #> predicted_diabe… action diabetes patient_id pregnancies plasma_glucose #> * <dbl> <chr> <fct> <int> <int> <int> #> 1 0.946 admit Y 23 7 196 #> 2 0.941 admit Y 46 0 180 #> 3 0.920 admit Y 187 8 181 #> 4 0.853 treat N 41 3 180 #> 5 0.815 treat N 59 0 146 #> 6 0.801 treat Y 121 0 162 #> 7 0.800 treat Y 112 8 155 #> 8 0.788 treat Y 196 5 158 #> 9 0.752 disch… Y 25 11 143 #> 10 0.711 disch… Y 116 4 146 #> # … with 39 more rows, and 6 more variables: diastolic_bp <int>, #> # skinfold <int>, insulin <int>, weight_class <chr>, pedigree <dbl>, #> # age <int>
# Finally, if you want a fixed group size that is further down on the risk # scale, you can achieve that with a combination of risk groups and the # stratifying approach in the last example. For example, say you have capacity # to admit 5 patients, but you don't want to admit patients in the top 10% of # risk scores. predict(models, d$test, risk_groups = c("risk acceptable" = 90, "risk too high" = 10)) %>% filter(predicted_group == "risk acceptable") %>% top_n(n = 5, wt = predicted_diabetes)
#> Prepping data based on provided recipe
#> "predicted_diabetes" predicted by glmnet last trained: 2020-08-05 09:08:42 #> Performance in training: AUROC = 0.84
#> # A tibble: 5 x 12 #> diabetes predicted_diabe… predicted_group patient_id pregnancies #> * <fct> <dbl> <fct> <int> <int> #> 1 Y 0.752 risk acceptable 25 11 #> 2 N 0.815 risk acceptable 59 0 #> 3 Y 0.800 risk acceptable 112 8 #> 4 Y 0.801 risk acceptable 121 0 #> 5 Y 0.788 risk acceptable 196 5 #> # … with 7 more variables: plasma_glucose <int>, diastolic_bp <int>, #> # skinfold <int>, insulin <int>, weight_class <chr>, pedigree <dbl>, #> # age <int>