Tune multiple machine learning models using cross validation to optimize performance

tune_models(d, outcome, models, metric, positive_class, n_folds = 5,
  tune_depth = 10, hyperparameters = NULL, model_class, model_name = NULL,
  allow_parallel = FALSE)



A data frame


Name of the column to predict


Names of models to try. See get_supported_models for available models. Default is all available models.


What metric to use to assess model performance? Options for regression: "RMSE" (root-mean-squared error, default), "MAE" (mean-absolute error), or "Rsquared." For classification: "ROC" (area under the receiver operating characteristic curve), or "PR" (area under the precision-recall curve).


For classification only, which outcome level is the "yes" case, i.e. should be associated with high probabilities? Defaults to "Y" or "yes" if present, otherwise is the first level of the outcome variable (first alphabetically if the training data outcome was not already a factor).


How many folds to use in cross-validation? Default = 5.


How many hyperparameter combinations to try? Default = 10. Value is multiplied by 5 for regularized regression. Increasing this value when tuning XGBoost models may be particularly useful for performance.


Optional, a list of data frames containing hyperparameter values to tune over. If NULL (default) a random, tune_depth-deep search of the hyperparameter space will be performed. If provided, this overrides tune_depth. Should be a named list of data frames where the names of the list correspond to models (e.g. "rf") and each column in the data frame contains hyperparameter values. See hyperparameters for a template. If only one model is specified to the models argument, the data frame can be provided bare to this argument.


"regression" or "classification". If not provided, this will be determined by the class of `outcome` with the determination displayed in a message.


Quoted, name of the model. Defaults to the name of the outcome variable.


Logical, defaults to FALSE. If TRUE and a parallel backend is set up (e.g. with doMC) models with support for parallel training will be trained across cores.


A model_list object. You can call plot, summary, evaluate, or predict on a model_list.


Note that this function is training a lot of models (100 by default) and so can take a while to execute. In general a model is trained for each hyperparameter combination in each fold for each model, so run time is a function of length(models) x n_folds x tune_depth. At the default settings, a 1000 row, 10 column data frame should complete in about 30 seconds on a good laptop.

See also

For setting up model training: prep_data, supported_models, hyperparameters

For evaluating models: plot.model_list, evaluate.model_list

For making predictions: predict.model_list

For faster, but not-optimized model training: flash_models

To prepare data and tune models in a single step: machine_learn


### Examples take about 30 seconds to run
# Prepare data for tuning
d <- prep_data(pima_diabetes, patient_id, outcome = diabetes)

# Tune random forest, xgboost, and regularized regression classification models
m <- tune_models(d, outcome = diabetes)

# Get some info about the tuned models

# Get more detailed info

# Plot performance over hyperparameter values for each algorithm

# To specify hyperparameter values to tune over, pass a data frame
# of hyperparameter values to the hyperparameters argument:
rf_hyperparameters <-
    mtry = 1:5,
    splitrule = c("gini", "extratrees"),
    min.node.size = 1
grid_search_models <-
  tune_models(d = d,
              outcome = diabetes,
              models = "rf",
              hyperparameters = list(rf = rf_hyperparameters)
# }