Get model performance metrics

evaluate(x, ...)
# S3 method for predicted_df
evaluate(x, na.rm = FALSE, ...)
# S3 method for model_list
evaluate(x, all_models = FALSE, ...)

## Arguments

x |
Object to be evaluted |

... |
Not used |

na.rm |
Logical. If FALSE (default) performance metrics will be NA if
any rows are missing an outcome value. If TRUE, performance will be
evaluted on the rows that have an outcome value. Only used when evaluating
a prediction data frame. |

all_models |
Logical. If FALSE (default), a numeric vector giving
performance metrics for the best-performing model is returned. If TRUE,
a data frame with performance metrics for all trained models is returned.
Only used when evaluating a model_list. |

## Value

Either a numeric vector or a data frame depending on the value of
all_models

## Details

This function gets model performance from a model_list object that
comes from `machine_learn`

, `tune_models`

,
`flash_models`

, or a data frame of predictions from
`predict.model_list`

. For the latter, the data passed to
`predict.model_list`

must contain observed outcomes. If you have
predictions and outcomes in a different format, see
`evaluate_classification`

or `evaluate_regression`

instead.

You may notice that `evaluate(models)`

and
`evaluate(predict(models))`

return slightly different performance
metrics, even though they are being calculated on the same (out-of-fold)
predictions. This is because metrics in training (returned from
`evaluate(models)`

) are calculated within each cross-validation fold
and then averaged, while metrics calculated on the prediction data frame
(`evaluate(predict(models))`

) are calculated once on all observations.

## Examples

#> Training new data prep recipe...

#> Variable(s) ignored in prep_data won't be used to tune models: patient_id

#>
#> diabetes looks categorical, so training classification algorithms.

#>
#> After data processing, models are being trained on 12 features with 40 observations.
#> Based on n_folds = 3 and hyperparameter settings, the following number of models will be trained: 3 xgb's and 3 rf's

#> Training at fixed values: eXtreme Gradient Boosting

#> Training at fixed values: Random Forest

#>
#> *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***
#> *** If there was PHI in training data, normal PHI protocols apply to the model object. ***

# By default, evaluate returns performance of only the best model
evaluate(models)

#> AUPR AUROC
#> 0.5856454 0.6636905

# Set all_models = TRUE to see the performance of all trained models
evaluate(models, all_models = TRUE)

#> # A tibble: 2 x 3
#> model AUPR AUROC
#> <chr> <dbl> <dbl>
#> 1 Random Forest 0.586 0.664
#> 2 eXtreme Gradient Boosting 0.558 0.642

# Can also get performance on a test dataset
predictions <- predict(models, newdata = pima_diabetes[41:50, ])

#> Prepping data based on provided recipe

evaluate(predictions)

#> AUPR AUROC
#> 0.4305556 0.9047619