Prepare data and train machine learning models.

machine_learn(
  d,
  ...,
  outcome,
  models,
  metric,
  tune = TRUE,
  positive_class,
  n_folds = 5,
  tune_depth = 10,
  impute = TRUE,
  model_name = NULL,
  allow_parallel = FALSE
)

Arguments

d

A data frame

...

Columns to be ignored in model training, e.g. ID columns, unquoted.

outcome

Name of the target column, i.e. what you want to predict. Unquoted. Must be named, i.e. you must specify outcome =

models

Names of models to try. See get_supported_models for available models. Default is all available models.

metric

Which metric should be used to assess model performance? Options for classification: "ROC" (default) (area under the receiver operating characteristic curve) or "PR" (area under the precision-recall curve). Options for regression: "RMSE" (default) (root-mean-squared error, default), "MAE" (mean-absolute error), or "Rsquared." Options for multiclass: "Accuracy" (default) or "Kappa" (accuracy, adjusted for class imbalance).

tune

If TRUE (default) models will be tuned via tune_models. If FALSE, models will be trained via flash_models which is substantially faster but produces less-predictively powerful models.

positive_class

For classification only, which outcome level is the "yes" case, i.e. should be associated with high probabilities? Defaults to "Y" or "yes" if present, otherwise is the first level of the outcome variable (first alphabetically if the training data outcome was not already a factor).

n_folds

How many folds to use to assess out-of-fold accuracy? Default = 5. Models are evaluated on out-of-fold predictions whether tune is TRUE or FALSE.

tune_depth

How many hyperparameter combinations to try? Default = 10. Value is multiplied by 5 for regularized regression. Ignored if tune is FALSE.

impute

Logical, if TRUE (default) missing values will be filled by hcai_impute

model_name

Quoted, name of the model. Defaults to the name of the outcome variable.

allow_parallel

Depreciated. Instead, control the number of cores though your parallel back end (e.g. with doMC).

Value

A model_list object. You can call plot, summary, evaluate, or predict on a model_list.

Details

This is a high-level wrapper function. For finer control of data cleaning and preparation use prep_data or the functions it wraps. For finer control of model tuning use tune_models.

Examples

# These examples take about 30 seconds to execute so aren't run automatically, # but you should be able to execute this code locally. if (FALSE) { # Split the data into training and test sets d <- split_train_test(d = pima_diabetes, outcome = diabetes, percent_train = .9) ### Classification ### # Clean and prep the training data, specifying that patient_id is an ID column, # and tune algorithms over hyperparameter values to predict diabetes diabetes_models <- machine_learn(d$train, patient_id, outcome = diabetes) # Inspect model specification and performance diabetes_models # Make predictions (predicted probability of diabetes) on test data predict(diabetes_models, d$test) ### Regression ### # If the outcome variable is numeric, regression models will be trained age_model <- machine_learn(d$train, patient_id, outcome = age) # Get detailed information about performance over tuning values summary(age_model) # Get available performance metrics evaluate(age_model) # Plot training performance on tuning metric (default = RMSE) plot(age_model) # If new data isn't specifed, get predictions on training data predict(age_model) ### Faster model training without tuning hyperparameters ### # Train models at set hyperparameter values by setting tune to FALSE. This is # faster (especially on larger datasets), but produces models with less # predictive power. machine_learn(d$train, patient_id, outcome = diabetes, tune = FALSE) ### Train models optimizing given metric ### machine_learn(d$train, patient_id, outcome = diabetes, metric = "PR") }