impute will impute your data using a variety of methods for both nominal and numeric data. Currently supports mean (numeric only), new_category (categorical only), bagged trees, or knn.

impute(
  d = NULL,
  ...,
  recipe = NULL,
  numeric_method = "mean",
  nominal_method = "new_category",
  numeric_params = NULL,
  nominal_params = NULL,
  verbose = FALSE
)

Arguments

d

A dataframe or tibble containing data to impute.

...

Optional. Unquoted variable names to not be imputed. These will be returned unaltered.

recipe

Optional, a recipe object or an imputed data frame (containing a recipe object as an attribute). If provided, this recipe will be applied to impute new data contained in d with values saved in the recipe. Use this param if you'd like to apply the same values used for imputation on a training dataset in production.

numeric_method

Defaults to "mean". Other choices are "bagimpute" or "knnimpute".

nominal_method

Defaults to "new_category". Other choices are "bagimpute" or "knnimpute".

numeric_params

A named list with parmeters to use with chosen imputation method on numeric data. Options are bag_model (bagimpute only), bag_trees (bagimpute only), bag_options (bagimpute only), bag_trees (bagimpute only), knn_K (knnimpute only), impute_with (knnimpute only), (bag or knn) or seed_val (bag or knn). See step_bagimpute or step_knnimpute for details.

nominal_params

A named list with parmeters to use with chosen imputation method on nominal data. Options are bag_model (bagimpute only), bag_trees (bagimpute only), bag_options (bagimpute only), bag_trees (bagimpute only), knn_K (knnimpute only), impute_with (knnimpute only), (bag or knn) or seed_val (bag or knn). See step_bagimpute or step_knnimpute for details.

verbose

Gives a print out of what will be imputed and which method will be used.

Value

Imputed data frame with reusable recipe object for future imputation in attribute "recipe".

Examples

d <- pima_diabetes d_train <- d[1:700, ] d_test <- d[701:768, ] # Train imputer train_imputed <- impute(d = d_train, patient_id, diabetes) # Apply to new data impute(d = d_test, patient_id, diabetes, recipe = train_imputed)
#> Original missingness and methods used in imputation: #> #> # A tibble: 4 x 3 #> variable percent_missing imputation_method_used #> <chr> <dbl> <chr> #> 1 weight_class 1.47 new_category #> 2 diastolic_bp 2.94 mean #> 3 skinfold 26.5 mean #> 4 insulin 52.9 mean #> #> Current data: #> #> # A tibble: 68 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <int> <int> <int> <int> <int> #> 1 701 2 122 76 27 200 #> 2 702 6 125 78 31 154 #> 3 703 1 168 88 29 154 #> 4 704 2 129 72 29 154 #> 5 705 4 110 76 20 100 #> 6 706 6 80 80 36 154 #> 7 707 10 115 72 29 154 #> 8 708 2 127 46 21 335 #> 9 709 9 164 78 29 154 #> 10 710 2 93 64 32 160 #> # … with 58 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <int>, diabetes <chr>
# Specify methods: impute(d = d_train, patient_id, diabetes, numeric_method = "bagimpute", nominal_method = "new_category")
#> Original missingness and methods used in imputation: #> #> # A tibble: 5 x 3 #> variable percent_missing imputation_method_used #> <chr> <dbl> <chr> #> 1 plasma_glucose 0.714 bagimpute #> 2 weight_class 1.43 new_category #> 3 diastolic_bp 4.71 bagimpute #> 4 skinfold 29.9 bagimpute #> 5 insulin 48.3 bagimpute #> #> Current data: #> #> # A tibble: 700 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <int> <int> <int> <int> <int> #> 1 1 6 148 72 35 213 #> 2 2 1 85 66 29 70 #> 3 3 8 183 64 22 203 #> 4 4 1 89 66 23 94 #> 5 5 0 137 40 35 168 #> 6 6 5 116 74 24 131 #> 7 7 3 78 50 32 88 #> 8 8 10 115 71 33 136 #> 9 9 2 197 70 45 543 #> 10 10 8 125 96 32 251 #> # … with 690 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <int>, diabetes <chr>
# Specify method and param: impute(d = d_train, patient_id, diabetes, nominal_method = "knnimpute", nominal_params = list(knn_K = 4))
#> Original missingness and methods used in imputation: #> #> # A tibble: 5 x 3 #> variable percent_missing imputation_method_used #> <chr> <dbl> <chr> #> 1 plasma_glucose 0.714 mean #> 2 weight_class 1.43 knnimpute #> 3 diastolic_bp 4.71 mean #> 4 skinfold 29.9 mean #> 5 insulin 48.3 mean #> #> Current data: #> #> # A tibble: 700 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <int> <int> <int> <int> <int> #> 1 1 6 148 72 35 154 #> 2 2 1 85 66 29 154 #> 3 3 8 183 64 29 154 #> 4 4 1 89 66 23 94 #> 5 5 0 137 40 35 168 #> 6 6 5 116 74 29 154 #> 7 7 3 78 50 32 88 #> 8 8 10 115 72 29 154 #> 9 9 2 197 70 45 543 #> 10 10 8 125 96 29 154 #> # … with 690 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <int>, diabetes <chr>