Get class-separating thresholds for classification predictions

healthcareai gives you predicted probabilities for classification problems, but sometimes you need to convert probabilities into predicted classes. That requires choosing a threshold, where probabilities above the threshold are predicted as the positive class and probabilities below the threshold are predicted as the negative class. This function helps you do that by calculating a bunch of model-performance metrics at every possible threshold.

"cost" is an especially useful measure as it allows you to weight how bad a false alarm is relative to a missed detection. E.g. if for your use case a missed detection is five times as bad as a false alarm (another way to say that is that you're willing to allow five false positives for every one false negative), set cost_fn = 5 and use the threshold that minimizes cost (see examples).

We recommend plotting the thresholds with their performance measures to see how optimizing for one measure affects performance on other measures. See plot.thresholds_df for how to do this.

get_thresholds(x, optimize = NULL, measures = "all", cost_fp = 1, cost_fn = 1)

Arguments

x	Either a predictions data frame (from `predict`) or a model_list (e.g. from `machine_learn`).
optimize	Optional. If provided, one of the entries in `measures`. A logical column named "optimal" will be added with one TRUE entry corresponding to the threshold that optimizes this measure.
measures	Character vector of performance metrics to calculate, or "all", which is equivalent to using all of the following measures. The returned data frame will have one column for each metric. cost: Captures how bad all the errors are. You can adjust the relative costs of false alarms and missed detections by setting `cost_fp` or `cost_fn`. At the default of equal costs, this is directly inversely proportional to accuracy. acc: Accuracy tpr: True positive rate, aka sensitivity, aka recall tnr: True negative rate, aka specificity fpr: False positive rate, aka fallout fnr: False negative rate ppv: Positive predictive value, aka precision npv: Negative predictive value
cost_fp	Cost of a false positive. Default = 1. Only affects cost.
cost_fn	Cost of a false negative. Default = 1. Only affects cost.

Value

Tibble with rows for each possible threshold and columns for the thresholds and each value in measures.

Examples

library(dplyr)
#> 
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#> 
#>     filter, lag
#> The following objects are masked from ‘package:base’:
#> 
#>     intersect, setdiff, setequal, union
models <- machine_learn(pima_diabetes[1:15, ], patient_id, outcome = diabetes,
                        models = "xgb", tune = FALSE)
#> Training new data prep recipe...
#> Variable(s) ignored in prep_data won't be used to tune models: patient_id
#> 
#> diabetes looks categorical, so training classification algorithms.
#> 
#> After data processing, models are being trained on 12 features with 15 observations.
#> Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 5 xgb's 
#> Training at fixed values: eXtreme Gradient Boosting
#> 
#> *** Models successfully trained. The model object contains the training data minus ignored ID columns. ***
#> *** If there was PHI in training data, normal PHI protocols apply to the model object. ***
get_thresholds(models)
#> # A tibble: 14 x 9
#>    threshold  cost   acc   tpr   fnr   tnr   fpr     ppv     npv
#>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl>
#>  1   Inf     0.6   0.4   0     1     1     0     NaN       0.4  
#>  2     0.839 0.6   0.4   0.111 0.889 0.833 0.167   0.5     0.385
#>  3     0.813 0.533 0.467 0.222 0.778 0.833 0.167   0.667   0.417
#>  4     0.761 0.40  0.6   0.444 0.556 0.833 0.167   0.8     0.5  
#>  5     0.676 0.333 0.667 0.556 0.444 0.833 0.167   0.833   0.556
#>  6     0.666 0.267 0.733 0.667 0.333 0.833 0.167   0.857   0.625
#>  7     0.634 0.2   0.8   0.778 0.222 0.833 0.167   0.875   0.714
#>  8     0.628 0.133 0.867 0.889 0.111 0.833 0.167   0.889   0.833
#>  9     0.563 0.2   0.8   0.889 0.111 0.667 0.333   0.8     0.8  
#> 10     0.560 0.267 0.733 0.889 0.111 0.5   0.5     0.727   0.75 
#> 11     0.457 0.333 0.667 0.889 0.111 0.333 0.667   0.667   0.667
#> 12     0.456 0.4   0.6   0.889 0.111 0.167 0.833   0.615   0.5  
#> 13     0.414 0.467 0.533 0.889 0.111 0     1       0.571   0    
#> 14     0.383 0.4   0.6   1     0     0     1       0.6   NaN    

# Identify the threshold that maximizes accuracy:
get_thresholds(models, optimize = "acc")
#> # A tibble: 14 x 10
#>    threshold  cost   acc   tpr   fnr   tnr   fpr     ppv     npv optimal
#>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl>   <dbl> <lgl>  
#>  1   Inf     0.6   0.4   0     1     1     0     NaN       0.4   FALSE  
#>  2     0.839 0.6   0.4   0.111 0.889 0.833 0.167   0.5     0.385 FALSE  
#>  3     0.813 0.533 0.467 0.222 0.778 0.833 0.167   0.667   0.417 FALSE  
#>  4     0.761 0.40  0.6   0.444 0.556 0.833 0.167   0.8     0.5   FALSE  
#>  5     0.676 0.333 0.667 0.556 0.444 0.833 0.167   0.833   0.556 FALSE  
#>  6     0.666 0.267 0.733 0.667 0.333 0.833 0.167   0.857   0.625 FALSE  
#>  7     0.634 0.2   0.8   0.778 0.222 0.833 0.167   0.875   0.714 FALSE  
#>  8     0.628 0.133 0.867 0.889 0.111 0.833 0.167   0.889   0.833 TRUE   
#>  9     0.563 0.2   0.8   0.889 0.111 0.667 0.333   0.8     0.8   FALSE  
#> 10     0.560 0.267 0.733 0.889 0.111 0.5   0.5     0.727   0.75  FALSE  
#> 11     0.457 0.333 0.667 0.889 0.111 0.333 0.667   0.667   0.667 FALSE  
#> 12     0.456 0.4   0.6   0.889 0.111 0.167 0.833   0.615   0.5   FALSE  
#> 13     0.414 0.467 0.533 0.889 0.111 0     1       0.571   0     FALSE  
#> 14     0.383 0.4   0.6   1     0     0     1       0.6   NaN     FALSE  

# Assert that one missed detection is as bad as five false alarms and
# filter to the threshold that minimizes "cost" based on that assertion:
get_thresholds(models, optimize = "cost", cost_fn = 5) %>%
  filter(optimal)
#> # A tibble: 1 x 10
#>   threshold  cost   acc   tpr   fnr   tnr   fpr   ppv   npv optimal
#>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <lgl>  
#> 1     0.628  0.40 0.867 0.889 0.111 0.833 0.167 0.889 0.833 TRUE   

# Use that threshold to make class predictions
(class_preds <- predict(models, outcome_groups = 5))
#> "predicted_diabetes" predicted by eXtreme Gradient Boosting last trained: 2020-08-05 09:08:13
#> Performance in training: AUROC = 0.8
#> # A tibble: 15 x 12
#>    diabetes predicted_diabe… predicted_group patient_id pregnancies
#>  * <fct>               <dbl> <fct>                <int>       <int>
#>  1 Y                   0.839 Y                        1           6
#>  2 N                   0.563 N                        2           1
#>  3 Y                   0.666 Y                        3           8
#>  4 N                   0.456 N                        4           1
#>  5 Y                   0.761 Y                        5           0
#>  6 N                   0.560 N                        6           5
#>  7 Y                   0.383 N                        7           3
#>  8 N                   0.457 N                        8          10
#>  9 Y                   0.676 Y                        9           2
#> 10 Y                   0.628 Y                       10           8
#> 11 N                   0.414 N                       11           4
#> 12 Y                   0.634 Y                       12          10
#> 13 N                   0.839 Y                       13          10
#> 14 Y                   0.813 Y                       14           1
#> 15 Y                   0.761 Y                       15           5
#> # … with 7 more variables: plasma_glucose <int>, diastolic_bp <int>,
#> #   skinfold <int>, insulin <int>, weight_class <chr>, pedigree <dbl>,
#> #   age <int>
attr(class_preds$predicted_group, "cutpoints")
#> [1] 0.6277816

# Plot performance on all measures across threshold values
get_thresholds(models) %>%
  plot()

# If a measure is provided to optimize, the best threshold will be highlighted in plots
get_thresholds(models, optimize = "acc") %>%
  plot()

## Transform probability predictions into classes based on an optimal threshold ##
# Pull the threshold that minimizes cost
optimal_threshold <-
  get_thresholds(models, optimize = "cost") %>%
  filter(optimal) %>%
  pull(threshold)

# Add a Y/N column to predictions based on whether the predicted probability
# is greater than the threshold
class_predictions <-
  predict(models) %>%
  mutate(predicted_class_diabetes = case_when(
    predicted_diabetes > optimal_threshold ~ "Y",
    predicted_diabetes <= optimal_threshold ~ "N"
  ))

class_predictions %>%
  select_at(vars(ends_with("diabetes"))) %>%
  arrange(predicted_diabetes)
#> "predicted_diabetes" predicted by eXtreme Gradient Boosting last trained: 2020-08-05 09:08:13
#> Performance in training: AUROC = 0.8
#> # A tibble: 15 x 3
#>    diabetes predicted_diabetes predicted_class_diabetes
#>  * <fct>                 <dbl> <chr>                   
#>  1 Y                     0.383 N                       
#>  2 N                     0.414 N                       
#>  3 N                     0.456 N                       
#>  4 N                     0.457 N                       
#>  5 N                     0.560 N                       
#>  6 N                     0.563 N                       
#>  7 Y                     0.628 N                       
#>  8 Y                     0.634 Y                       
#>  9 Y                     0.666 Y                       
#> 10 Y                     0.676 Y                       
#> 11 Y                     0.761 Y                       
#> 12 Y                     0.761 Y                       
#> 13 Y                     0.813 Y                       
#> 14 Y                     0.839 Y                       
#> 15 N                     0.839 Y                       

# Examine the expected volume of false-and-true negatives-and-positive
table(Actual = class_predictions$diabetes,
      Predicted = class_predictions$predicted_class_diabetes)
#>       Predicted
#> Actual N Y
#>      Y 2 7
#>      N 5 1