Prepare data for machine learning

prep_data will prepare your data for machine learning. Some steps enhance predictive power, some make sure that the data format is compatible with a wide array of machine learning algorithms, and others provide protection against common problems in model deployment. The following steps are available; those followed by * are applied by default. Many have customization options.

Convert columns with only 0/1 to factor*
Remove columns with near-zero variance*
Convert date columns to useful features*
Fill in missing values via imputation*
Collapse rare categories into "other"*
Center numeric columns
Standardize numeric columns
Create dummy variables from categorical variables*
Add protective levels to factors for rare and missing data*
Convert columns to principle components using PCA

While preparing your data, a recipe will be generated for identical transformation of future data and stored in the `recipe` attribute of the output data frame.If a recipe object is passed to `prep_data` via the `recipe` argument, thatrecipe will be applied to the data. This allows you to transform data inmodel training and apply exactly the same transformations in model testing and deployment. The new data must be identical in structure to the data that the recipe was prepared with.

prep_data(
  d,
  ...,
  outcome,
  recipe = NULL,
  remove_near_zero_variance = TRUE,
  convert_dates = TRUE,
  impute = TRUE,
  collapse_rare_factors = TRUE,
  PCA = FALSE,
  center = FALSE,
  scale = FALSE,
  make_dummies = TRUE,
  add_levels = TRUE,
  logical_to_numeric = TRUE,
  factor_outcome = TRUE,
  no_prep = FALSE
)

Arguments

d	A data frame
...	Optional. Columns to be ignored in preparation and model training, e.g. ID columns. Unquoted; any number of columns can be included here.
outcome	Optional. Unquoted column name that indicates the target variable. If provided, argument must be named. If this target is 0/1, it will be coerced to Y/N if factor_outcome is TRUE; other manipulation steps will not be applied to the outcome.
recipe	Optional. Recipe for how to prep d. In model deployment, pass the output from this function in training to this argument in deployment to prepare the deployment data identically to how the training data was prepared. If training data is big, pull the recipe from the "recipe" attribute of the prepped training data frame and pass that to this argument. If present, all following arguments will be ignored.
remove_near_zero_variance	Logical or numeric. If TRUE (default), columns with near-zero variance will be removed. These columns are either a single value, or the most common value is much more frequent than the second most common value. Example: In a column with 120 "Male" and 2 "Female", the frequency ratio is 0.0167. It would be excluded by default or if `remove_near_zero_variance` > 0.0166. Larger values will remove more columns and this value must lie between 0 and 1.
convert_dates	Logical or character. If TRUE (default), date and time columns are transformed to circular representation for hour, day, month, and year for machine learning optimization. If FALSE, date and time columns are removed. If character, use "continuous" (same as TRUE), "categories", or "none" (same as FALSE). "categories" makes hour, day, month, and year readable for interpretation. If `make_dummies` is TRUE, each unique value in these features will become a new dummy variable. This will create wide data, which is more challenging for some machine learning models. All features with the DTS suffix will be treated as a date.
impute	Logical or list. If TRUE (default), columns will be imputed using mean (numeric), and new category (nominal). If FALSE, data will not be imputed. If this is a list, it must be named, with possible entries for `numeric_method`, `nominal_method`, `numeric_params`, `nominal_params`, which are passed to `hcai_impute`.
collapse_rare_factors	Logical or numeric. If TRUE (default), factor levels representing less than 3 percent of observations will be collapsed into a new category, `other`. If numeric, must be in 0, 1, and is the proportion of observations below which levels will be grouped into other. See `recipes::step_other`.
PCA	Integer or Logical. PCA reduces training time, particularly for wide datasets, though it renders models less interpretable." If integer, represents the number of principal components to convert the numeric data into. If TRUE, will convert numeric data into 5 principal components. PCA requires that data is centered and scaled and will set those params to TRUE. Default is FALSE.
center	Logical. If TRUE, numeric columns will be centered to have a mean of 0. Default is FALSE, unless PCA is performed, in which case it is TRUE.
scale	Logical. If TRUE, numeric columns will be scaled to have a standard deviation of 1. Default is FALSE, unless PCA is performed, in which case it is TRUE.
make_dummies	Logical or list. If TRUE (default), dummy columns will be created for categorical variables. When dummy columns are created, columns are not created for reference levels. By default, the levels are reassigned so the mode value is the reference level. If a named list is provided, those values will replace the reference levels. See the example for details.
add_levels	Logical. If TRUE (default), "other" and "missing" will be added to all nominal columns. This is protective in deployment: new levels found in deployment will become "other" and missingness in deployment can become "missing" if the nominal imputation method is "new_category". If FALSE, these "other" will be added to all nominal variables if `collapse_rare_factors` is used, and "missingness" may be added depending on details of imputation.
logical_to_numeric	Logical. If TRUE (default), logical variables will be converted to 0/1 integer variables.
factor_outcome	Logical. If TRUE (default) and if all entries in outcome are 0 or 1 they will be converted to factor with levels N and Y for classification. Note that which level is the positive class is set in training functions rather than here.
no_prep	Logical. If TRUE, overrides all other arguments to FALSE so that d is returned unmodified, except that character variables may be coverted to factors and a tibble will be returned even if the input was a non-tibble data frame.

Value

Prepared data frame with reusable recipe object for future data preparation in attribute "recipe". Attribute recipe contains the names of ignored columns (those passed to ...) in attribute "ignored_columns".

Examples

d_train <- pima_diabetes[1:700, ]
d_test <- pima_diabetes[701:768, ]

# Prep data. Ignore patient_id (identifier) and treat diabetes as outcome
d_train_prepped <- prep_data(d = d_train, patient_id, outcome = diabetes)
#> Training new data prep recipe...

# Prep test data by reapplying the same transformations as to training data
d_test_prepped <- prep_data(d_test, recipe = d_train_prepped)
#> Prepping data based on provided recipe

# View the transformations applied and the prepped data
d_test_prepped
#> healthcareai-prepped data. Recipe used to prepare data:
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          8
#> 
#> Training data contained 700 data points and 340 incomplete rows. 
#> 
#> Operations:
#> 
#> Sparse, unbalanced variable filter removed no terms [trained]
#> Mean Imputation for pregnancies, plasma_glucose, ... [trained]
#> Filling NA with missing for weight_class [trained]
#> Adding levels to: other, missing [trained]
#> Collapsing factor levels for weight_class [trained]
#> Adding levels to: other, missing [trained]
#> Dummy variables from weight_class [trained]
#> Current data:
#> # A tibble: 68 x 13
#>    pregnancies plasma_glucose diastolic_bp skinfold insulin pedigree   age
#>          <int>          <int>        <int>    <int>   <int>    <dbl> <int>
#>  1           2            122           76       27     200    0.483    26
#>  2           6            125           78       31     154    0.565    49
#>  3           1            168           88       29     154    0.905    52
#>  4           2            129           72       29     154    0.304    41
#>  5           4            110           76       20     100    0.118    27
#>  6           6             80           80       36     154    0.177    28
#>  7          10            115           72       29     154    0.261    30
#>  8           2            127           46       21     335    0.176    22
#>  9           9            164           78       29     154    0.148    45
#> 10           2             93           64       32     160    0.674    23
#> # … with 58 more rows, and 6 more variables: diabetes <fct>,
#> #   weight_class_morbidly.obese <dbl>, weight_class_normal <dbl>,
#> #   weight_class_overweight <dbl>, weight_class_other <dbl>,
#> #   weight_class_missing <dbl>

# Customize preparations:
prep_data(d = d_train, patient_id, outcome = diabetes,
          impute = list(numeric_method = "bagimpute",
                        nominal_method = "bagimpute"),
          collapse_rare_factors = FALSE, center = TRUE, scale = TRUE,
          make_dummies = FALSE, remove_near_zero_variance = .02)
#> Training new data prep recipe...
#> healthcareai-prepped data. Recipe used to prepare data:
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          8
#> 
#> Training data contained 700 data points and 340 incomplete rows. 
#> 
#> Operations:
#> 
#> Sparse, unbalanced variable filter removed no terms [trained]
#> Bagged tree imputation for pregnancies, plasma_glucose, ... [trained]
#> Bagged tree imputation for weight_class [trained]
#> Centering for pregnancies, plasma_glucose, ... [trained]
#> Scaling for pregnancies, plasma_glucose, ... [trained]
#> Adding levels to: other, missing [trained]
#> Adding levels to: other, missing [trained]
#> Current data:
#> # A tibble: 700 x 10
#>    patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin
#>         <int>       <dbl>          <dbl>        <dbl>    <dbl>   <dbl>
#>  1          1       0.646          0.872      -0.0220  0.633     0.570
#>  2          2      -0.840         -1.19       -0.515   0.00774  -0.851
#>  3          3       1.24           2.02       -0.679  -0.826     0.550
#>  4          4      -0.840         -1.06       -0.515  -0.617    -0.599
#>  5          5      -1.14           0.512      -2.65    0.633     0.147
#>  6          6       0.349         -0.175       0.142  -0.305    -0.257
#>  7          7      -0.246         -1.42       -1.83    0.320    -0.660
#>  8          8       1.83          -0.208       0.142   0.424    -0.257
#>  9          9      -0.543          2.47       -0.186   1.67      3.93 
#> 10         10       1.24           0.119       1.95    0.320     0.570
#> # … with 690 more rows, and 4 more variables: weight_class <fct>,
#> #   pedigree <dbl>, age <dbl>, diabetes <fct>

# Picking reference levels:
# Dummy variables are not created for reference levels. Mode levels are
# chosen as reference levels by default. The list given to `make_dummies`
# sets the reference level for `weight_class` to "normal". All other values
# in `weight_class` will create a new dummy column that is relative to normal.
prep_data(d = d_train, patient_id, outcome = diabetes,
          make_dummies = list(weight_class = "normal"))
#> Training new data prep recipe...
#> healthcareai-prepped data. Recipe used to prepare data:
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>    outcome          1
#>  predictor          8
#> 
#> Training data contained 700 data points and 340 incomplete rows. 
#> 
#> Operations:
#> 
#> Sparse, unbalanced variable filter removed no terms [trained]
#> Mean Imputation for pregnancies, plasma_glucose, ... [trained]
#> Filling NA with missing for weight_class [trained]
#> Adding levels to: other, missing [trained]
#> Collapsing factor levels for weight_class [trained]
#> Adding levels to: other, missing [trained]
#> Dummy variables from weight_class [trained]
#> Current data:
#> # A tibble: 700 x 14
#>    patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin pedigree
#>         <int>       <int>          <int>        <int>    <int>   <int>    <dbl>
#>  1          1           6            148           72       35     154    0.627
#>  2          2           1             85           66       29     154    0.351
#>  3          3           8            183           64       29     154    0.672
#>  4          4           1             89           66       23      94    0.167
#>  5          5           0            137           40       35     168    2.29 
#>  6          6           5            116           74       29     154    0.201
#>  7          7           3             78           50       32      88    0.248
#>  8          8          10            115           72       29     154    0.134
#>  9          9           2            197           70       45     543    0.158
#> 10         10           8            125           96       29     154    0.232
#> # … with 690 more rows, and 7 more variables: age <int>, diabetes <fct>,
#> #   weight_class_morbidly.obese <dbl>, weight_class_obese <dbl>,
#> #   weight_class_overweight <dbl>, weight_class_other <dbl>,
#> #   weight_class_missing <dbl>

# `prep_data` also handles date and time features by default:
d <-
  pima_diabetes %>%
  cbind(
    admitted_DTS = seq(as.POSIXct("2005-1-1 0:00"),
                       length.out = nrow(pima_diabetes), by = "hour")
  )
d_train = d[1:700, ]
prep_data(d = d_train)
#> Training new data prep recipe with no outcome variable specified...
#> healthcareai-prepped data. Recipe used to prepare data:
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>  predictor         11
#> 
#> Training data contained 700 data points and 340 incomplete rows. 
#> 
#> Operations:
#> 
#> Sparse, unbalanced variable filter removed no terms [trained]
#> Date features from admitted_DTS [trained]
#> Variables removed admitted_DTS [trained]
#> Mean Imputation for patient_id, pregnancies, ... [trained]
#> Filling NA with missing for weight_class, diabetes [trained]
#> Adding levels to: other, missing [trained]
#> Collapsing factor levels for weight_class, diabetes [trained]
#> Adding levels to: other, missing [trained]
#> Dummy variables from weight_class and diabetes [trained]
#> Current data:
#> # A tibble: 700 x 23
#>    patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin pedigree
#>         <int>       <int>          <int>        <int>    <int>   <int>    <dbl>
#>  1          1           6            148           72       35     154    0.627
#>  2          2           1             85           66       29     154    0.351
#>  3          3           8            183           64       29     154    0.672
#>  4          4           1             89           66       23      94    0.167
#>  5          5           0            137           40       35     168    2.29 
#>  6          6           5            116           74       29     154    0.201
#>  7          7           3             78           50       32      88    0.248
#>  8          8          10            115           72       29     154    0.134
#>  9          9           2            197           70       45     543    0.158
#> 10         10           8            125           96       29     154    0.232
#> # … with 690 more rows, and 16 more variables: age <int>,
#> #   admitted_DTS_dow_sin <dbl>, admitted_DTS_dow_cos <dbl>,
#> #   admitted_DTS_month_sin <dbl>, admitted_DTS_month_cos <dbl>,
#> #   admitted_DTS_year <dbl>, admitted_DTS_hour_sin <dbl>,
#> #   admitted_DTS_hour_cos <dbl>, weight_class_morbidly.obese <dbl>,
#> #   weight_class_normal <dbl>, weight_class_overweight <dbl>,
#> #   weight_class_other <dbl>, weight_class_missing <dbl>, diabetes_Y <dbl>,
#> #   diabetes_other <dbl>, diabetes_missing <dbl>

# Customize how date and time features are handled:
# When `convert_dates` is set to "categories", the prepped data will be more
# readable, but will be wider.
prep_data(d = d_train, convert_dates = "categories")
#> Training new data prep recipe with no outcome variable specified...
#> healthcareai-prepped data. Recipe used to prepare data:
#> Data Recipe
#> 
#> Inputs:
#> 
#>       role #variables
#>  predictor         11
#> 
#> Training data contained 700 data points and 340 incomplete rows. 
#> 
#> Operations:
#> 
#> Sparse, unbalanced variable filter removed no terms [trained]
#> Date features from admitted_DTS [trained]
#> Variables removed admitted_DTS [trained]
#> Mean Imputation for patient_id, pregnancies, ... [trained]
#> Filling NA with missing for weight_class, diabetes, ... [trained]
#> Adding levels to: other, missing [trained]
#> Collapsing factor levels for weight_class, diabetes, ... [trained]
#> Adding levels to: other, missing [trained]
#> Dummy variables from weight_class, diabetes, admitted_DTS_dow, and admitted_DTS_month [trained]
#> Current data:
#> # A tibble: 700 x 28
#>    patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin pedigree
#>         <int>       <int>          <int>        <int>    <int>   <int>    <dbl>
#>  1          1           6            148           72       35     154    0.627
#>  2          2           1             85           66       29     154    0.351
#>  3          3           8            183           64       29     154    0.672
#>  4          4           1             89           66       23      94    0.167
#>  5          5           0            137           40       35     168    2.29 
#>  6          6           5            116           74       29     154    0.201
#>  7          7           3             78           50       32      88    0.248
#>  8          8          10            115           72       29     154    0.134
#>  9          9           2            197           70       45     543    0.158
#> 10         10           8            125           96       29     154    0.232
#> # … with 690 more rows, and 21 more variables: age <int>,
#> #   admitted_DTS_year <dbl>, admitted_DTS_hour <int>,
#> #   weight_class_morbidly.obese <dbl>, weight_class_normal <dbl>,
#> #   weight_class_overweight <dbl>, weight_class_other <dbl>,
#> #   weight_class_missing <dbl>, diabetes_Y <dbl>, diabetes_other <dbl>,
#> #   diabetes_missing <dbl>, admitted_DTS_dow_Sun <dbl>,
#> #   admitted_DTS_dow_Mon <dbl>, admitted_DTS_dow_Tue <dbl>,
#> #   admitted_DTS_dow_Wed <dbl>, admitted_DTS_dow_Thu <dbl>,
#> #   admitted_DTS_dow_Fri <dbl>, admitted_DTS_dow_other <dbl>,
#> #   admitted_DTS_dow_missing <dbl>, admitted_DTS_month_other <dbl>,
#> #   admitted_DTS_month_missing <dbl>

# PCA to reduce training time:
if (FALSE) {
start_time <- Sys.time()
pd <- prep_data(pima_diabetes, patient_id, outcome = diabetes, PCA = FALSE)
ncol(pd)
m <- machine_learn(pd, patient_id, outcome = diabetes)
end_time <- Sys.time()
end_time - start_time

start_time <- Sys.time()
pcapd <- prep_data(pima_diabetes, patient_id, outcome = diabetes, PCA = TRUE)
ncol(pcapd)
m <- machine_learn(pcapd, patient_id, outcome = diabetes)
Sys.time() - start_time
}

Arguments

Value

See also

Examples