Prepare data and train machine learning models.

  tune = TRUE,
  n_folds = 5,
  tune_depth = 10,
  impute = TRUE,
  model_name = NULL,
  allow_parallel = FALSE



A data frame


Columns to be ignored in model training, e.g. ID columns, unquoted.


Name of the target column, i.e. what you want to predict. Unquoted. Must be named, i.e. you must specify outcome =


Names of models to try. See get_supported_models for available models. Default is all available models.


Which metric should be used to assess model performance? Options for classification: "ROC" (default) (area under the receiver operating characteristic curve) or "PR" (area under the precision-recall curve). Options for regression: "RMSE" (default) (root-mean-squared error, default), "MAE" (mean-absolute error), or "Rsquared." Options for multiclass: "Accuracy" (default) or "Kappa" (accuracy, adjusted for class imbalance).


If TRUE (default) models will be tuned via tune_models. If FALSE, models will be trained via flash_models which is substantially faster but produces less-predictively powerful models.


For classification only, which outcome level is the "yes" case, i.e. should be associated with high probabilities? Defaults to "Y" or "yes" if present, otherwise is the first level of the outcome variable (first alphabetically if the training data outcome was not already a factor).


How many folds to use to assess out-of-fold accuracy? Default = 5. Models are evaluated on out-of-fold predictions whether tune is TRUE or FALSE.


How many hyperparameter combinations to try? Default = 10. Value is multiplied by 5 for regularized regression. Ignored if tune is FALSE.


Logical, if TRUE (default) missing values will be filled by hcai_impute


Quoted, name of the model. Defaults to the name of the outcome variable.


Depreciated. Instead, control the number of cores though your parallel back end (e.g. with doMC).


A model_list object. You can call plot, summary, evaluate, or predict on a model_list.


This is a high-level wrapper function. For finer control of data cleaning and preparation use prep_data or the functions it wraps. For finer control of model tuning use tune_models.


# These examples take about 30 seconds to execute so aren't run automatically, # but you should be able to execute this code locally. if (FALSE) { # Split the data into training and test sets d <- split_train_test(d = pima_diabetes, outcome = diabetes, percent_train = .9) ### Classification ### # Clean and prep the training data, specifying that patient_id is an ID column, # and tune algorithms over hyperparameter values to predict diabetes diabetes_models <- machine_learn(d$train, patient_id, outcome = diabetes) # Inspect model specification and performance diabetes_models # Make predictions (predicted probability of diabetes) on test data predict(diabetes_models, d$test) ### Regression ### # If the outcome variable is numeric, regression models will be trained age_model <- machine_learn(d$train, patient_id, outcome = age) # Get detailed information about performance over tuning values summary(age_model) # Get available performance metrics evaluate(age_model) # Plot training performance on tuning metric (default = RMSE) plot(age_model) # If new data isn't specifed, get predictions on training data predict(age_model) ### Faster model training without tuning hyperparameters ### # Train models at set hyperparameter values by setting tune to FALSE. This is # faster (especially on larger datasets), but produces models with less # predictive power. machine_learn(d$train, patient_id, outcome = diabetes, tune = FALSE) ### Train models optimizing given metric ### machine_learn(d$train, patient_id, outcome = diabetes, metric = "PR") }