impute
will impute your data using a variety of methods
for both nominal and numeric data. Currently supports mean (numeric only),
new_category (categorical only), bagged trees, or knn.
impute( d = NULL, ..., recipe = NULL, numeric_method = "mean", nominal_method = "new_category", numeric_params = NULL, nominal_params = NULL, verbose = FALSE )
d | A dataframe or tibble containing data to impute. |
---|---|
... | Optional. Unquoted variable names to not be imputed. These will be returned unaltered. |
recipe | Optional, a recipe object or an imputed data frame (containing a recipe object as an attribute). If provided, this recipe will be applied to impute new data contained in d with values saved in the recipe. Use this param if you'd like to apply the same values used for imputation on a training dataset in production. |
numeric_method | Defaults to |
nominal_method | Defaults to |
numeric_params | A named list with parmeters to use with chosen
imputation method on numeric data. Options are
|
nominal_params | A named list with parmeters to use with chosen
imputation method on nominal data. Options are
|
verbose | Gives a print out of what will be imputed and which method will be used. |
Imputed data frame with reusable recipe object for future imputation in attribute "recipe".
d <- pima_diabetes d_train <- d[1:700, ] d_test <- d[701:768, ] # Train imputer train_imputed <- impute(d = d_train, patient_id, diabetes) # Apply to new data impute(d = d_test, patient_id, diabetes, recipe = train_imputed)#> Original missingness and methods used in imputation: #> #> # A tibble: 4 x 3 #> variable percent_missing imputation_method_used #> <chr> <dbl> <chr> #> 1 weight_class 1.47 new_category #> 2 diastolic_bp 2.94 mean #> 3 skinfold 26.5 mean #> 4 insulin 52.9 mean #> #> Current data: #> #> # A tibble: 68 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <int> <int> <int> <int> <int> #> 1 701 2 122 76 27 200 #> 2 702 6 125 78 31 154 #> 3 703 1 168 88 29 154 #> 4 704 2 129 72 29 154 #> 5 705 4 110 76 20 100 #> 6 706 6 80 80 36 154 #> 7 707 10 115 72 29 154 #> 8 708 2 127 46 21 335 #> 9 709 9 164 78 29 154 #> 10 710 2 93 64 32 160 #> # … with 58 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <int>, diabetes <chr># Specify methods: impute(d = d_train, patient_id, diabetes, numeric_method = "bagimpute", nominal_method = "new_category")#> Original missingness and methods used in imputation: #> #> # A tibble: 5 x 3 #> variable percent_missing imputation_method_used #> <chr> <dbl> <chr> #> 1 plasma_glucose 0.714 bagimpute #> 2 weight_class 1.43 new_category #> 3 diastolic_bp 4.71 bagimpute #> 4 skinfold 29.9 bagimpute #> 5 insulin 48.3 bagimpute #> #> Current data: #> #> # A tibble: 700 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <int> <int> <int> <int> <int> #> 1 1 6 148 72 35 213 #> 2 2 1 85 66 29 70 #> 3 3 8 183 64 22 203 #> 4 4 1 89 66 23 94 #> 5 5 0 137 40 35 168 #> 6 6 5 116 74 24 131 #> 7 7 3 78 50 32 88 #> 8 8 10 115 71 33 136 #> 9 9 2 197 70 45 543 #> 10 10 8 125 96 32 251 #> # … with 690 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <int>, diabetes <chr># Specify method and param: impute(d = d_train, patient_id, diabetes, nominal_method = "knnimpute", nominal_params = list(knn_K = 4))#> Original missingness and methods used in imputation: #> #> # A tibble: 5 x 3 #> variable percent_missing imputation_method_used #> <chr> <dbl> <chr> #> 1 plasma_glucose 0.714 mean #> 2 weight_class 1.43 knnimpute #> 3 diastolic_bp 4.71 mean #> 4 skinfold 29.9 mean #> 5 insulin 48.3 mean #> #> Current data: #> #> # A tibble: 700 x 10 #> patient_id pregnancies plasma_glucose diastolic_bp skinfold insulin #> <int> <int> <int> <int> <int> <int> #> 1 1 6 148 72 35 154 #> 2 2 1 85 66 29 154 #> 3 3 8 183 64 29 154 #> 4 4 1 89 66 23 94 #> 5 5 0 137 40 35 168 #> 6 6 5 116 74 29 154 #> 7 7 3 78 50 32 88 #> 8 8 10 115 72 29 154 #> 9 9 2 197 70 45 543 #> 10 10 8 125 96 29 154 #> # … with 690 more rows, and 4 more variables: weight_class <fct>, #> # pedigree <dbl>, age <int>, diabetes <chr>