If your healthcareai models are running out of memory or are very slow, you might be thinking that something is wrong with the package or that you need a bigger computer. While that can be frustrating, there are several tricks that you can use to speed things up and save memory. Read on to see some options for using healthcareai with a large data set.
If you have a dataset with more than 20k rows or 50 columns, you might run into performance issues. The size of your data and the type of model training you use influence how long it will take. Some easy steps to reduce training time:
prep_data
.get_best_levels
or prep_data
’s collapse_rare_factors
to limit them.prep_data
transforms date columns into many columns by default. Use convert_dates = FALSE
to prevent that.flash_models
instead of machine_learn
to train fewer models during development work.machine_learn
or tune_models
with faster settings. Pick 1 model, and use a tune depth of 5.The plot below provides a rough idea of how long you can expect model training to take for various sizes of datasets, models, and tuning settings. The largest dataset there, approximately 200k rows x 100 columns, requires about 30 minutes to train all three models if tuning isn’t performed, and about six hours to tune all three models.
In general, glm tunes very efficiently; however due to its linear constraints, it may not always provide a performant model. xgb is fast if tuning isn’t performed, but it is a complex model for which tuning can be both important and computationally intensive. When you decide to use xgb, we recommend a final round of model tuning with tune_depth
turned up at least several times higher than the default value of 10. Tuning time will increase linearly with tune_depth
, so you can expect turning it up from 10 to 30 to approximately triple model training time. Both xgb and rf can be quite expensive to tune; this is a result of healthcare.ai
exploring some computationally expensive regions in their hyperparameter space. It may be more efficient to examine the results of an initial random search with plot(models)
, and then tune over the most-promising region in hyperparameter space by passing a hyperparameters
data frame to tune_models
. You can see how the default hyperparameter search spaces are defined in healthcareai:::get_random_hyperparameters
.
If you want to squeeze every ounce of performance from one of these models; we suggest iteratively zeroing in on the region in hyperparameter space that optimizes performance.
model training time
If you are running out of memory, it’s probably because your data is too large for the operations that you’ve asked R to do.
In general, machine learning can ignore useless columns. But with large data sets, it’s better to use fewer columns that are more predictive to save computational cost. Remember, machine learning should be iterative. Add columns and see if they help, remove them and see if it hurts, try different transformations.
This document walks through some steps to help you build a better model in less time.
Start by loading up the flights dataset.
library(tidyverse) # > ── Attaching packages ───────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ── # > ✓ ggplot2 3.3.2 ✓ purrr 0.3.4 # > ✓ tibble 3.0.3 ✓ dplyr 1.0.1 # > ✓ tidyr 1.1.1 ✓ stringr 1.4.0 # > ✓ readr 1.3.1 ✓ forcats 0.5.0 # > ── Conflicts ──────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── # > x dplyr::filter() masks stats::filter() # > x dplyr::lag() masks stats::lag() library(nycflights13) library(healthcareai) # > healthcareai version 2.5.0 # > Please visit https://docs.healthcare.ai for full documentation and vignettes. Join the community at https://healthcare-ai.slack.com d <- nycflights13::flights d # > # A tibble: 336,776 x 19 # > year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time # > <int> <int> <int> <int> <int> <dbl> <int> <int> # > 1 2013 1 1 517 515 2 830 819 # > 2 2013 1 1 533 529 4 850 830 # > 3 2013 1 1 542 540 2 923 850 # > 4 2013 1 1 544 545 -1 1004 1022 # > 5 2013 1 1 554 600 -6 812 837 # > # … with 336,771 more rows, and 11 more variables: arr_delay <dbl>, # > # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # > # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
You can get the size of a dataset with dim
, nrow
, ncol
and object_size
. 40.6 MB really doesn’t seem like that much. Your computer probably has at least 8 GB of RAM. But you are going to prepare that data for machine learning and then train several models to see which one fits the data best. Larger data sets take more memory and time to process. Despite its small size, if you were to train a model on this data, it would take a long time.
os <- object.size(d) print(paste("Data is", round(as.numeric(os)/1000000, 1), "mb.")) # > [1] "Data is 40.7 mb."
In general, 336k rows and 19 columns is large but should be workable. Keep in mind that the data changes as you manipulate it though. For example, if we prep_data
, the size of the data changes. This is because prep_data
transforms, adds, and removes columns to prepare your data for machine learning. It will only modify columns, never rows.
Prepping this data increased the number of columns from 19 to 155. The size went up an order of magnitude, to 408 MB. The categorical columns, those made up of characters or factors, are the reason for the extra columns, and thus, the size change. The amount of memory your data takes up is proportional to the number of cells in the table, rows times columns.
R works with data “in-memory.” When the size of data loaded in the session exceeds available memory, it starts using harddisk for memory, which is about an order of magnitude slower. Check out the Grid View
of the Environment
tab in Rstudio. You can sort by object size to see if you’re lugging around a bunch of big items and what’s taking up space. You can remove items from the active environment (this doesn’t touch anything on disk, but you’ll have to re-load/create it to get it back in R) with rm(object_name)
. More detail on memory usage can be found in Hadley’s book, “Advanced R.”
library(healthcareai) d <- d %>% mutate(arr_delay = case_when(arr_delay > 15 ~ "Y", TRUE ~ "N"), flight_id = 1:nrow(d)) %>% select(-dep_time, -arr_time, -dep_delay, hour, minute) d_clean <- d %>% prep_data(outcome = arr_delay, tailnum, collapse_rare_factors = FALSE, add_levels = FALSE) # > Warning in prep_data(., outcome = arr_delay, tailnum, collapse_rare_factors = # > FALSE, : These ignored variables have missingness: tailnum # > Training new data prep recipe... # > Removing the following 1 near-zero variance column(s). If you don't want to remove them, call prep_data with remove_near_zero_variance as a smaller numeric or FALSE. # > year dim(d_clean) os <- object.size(d_clean) print(paste("Prepped data is", round(as.numeric(os)/1000000, 1), "mb.")) # > [1] 336776 143 # > [1] "Prepped data is 789.6 mb."
Categorical columns must be multiplied into dummy columns for machine learning. For example, carrier
has 16 unique values and those will be turned into 16 unique columns. These dummy columns are made up of 1s and 0s. There is one dummy column for each unique value in the original column except one. The missing value can be inferred from a 0 in the rest of the columns. For example, if your gender column contained Male
and Female
, you would get 1 dummy column, Gender_Male
, where males were 1
. Females would then be the 0
’s. If you wanted that dummy column to instead be Gender_Female
, you can use the prep data argument, ref_data
, like this: ref_levels = c(gender = "Female")
. prep_data
then adds a column to collect missing values in the original column.
Be careful with categorical columns, as they can blow up your data set. If you had a column of DRG codes (or tailnums
, which was removed above) with 100s of unique values, your data would become hundreds of columns!
print(paste("Carrier has", unique(d$carrier) %>% length(), "unique values.")) d_clean %>% select(starts_with("carrier")) # > healthcareai-prepped data. Recipe used to prepare data: # > Current data: # > [1] "Carrier has 16 unique values." # > Data Recipe # > # > Inputs: # > # > role #variables # > outcome 1 # > predictor 15 # > # > Training data contained 336776 data points and 9430 incomplete rows. # > # > Operations: # > # > Sparse, unbalanced variable filter removed year [trained] # > Date features from time_hour [trained] # > Variables removed time_hour [trained] # > Mean Imputation for month, day, sched_dep_time, ... [trained] # > Filling NA with missing for carrier, origin, dest [trained] # > Dummy variables from carrier, origin, and dest [trained] # > # A tibble: 336,776 x 16 # > carrier_X9E carrier_AA carrier_AS carrier_B6 carrier_DL carrier_EV carrier_F9 # > <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # > 1 0 0 0 0 0 0 0 # > 2 0 0 0 0 0 0 0 # > 3 0 1 0 0 0 0 0 # > 4 0 0 0 1 0 0 0 # > 5 0 0 0 0 1 0 0 # > # … with 336,771 more rows, and 9 more variables: carrier_FL <dbl>, # > # carrier_HA <dbl>, carrier_MQ <dbl>, carrier_OO <dbl>, carrier_US <dbl>, # > # carrier_VX <dbl>, carrier_WN <dbl>, carrier_YV <dbl>, carrier_missing <dbl>
One of the best ways to reduce training time and memory requirements is to limit the number of columns in the data. Here are some easy ways to do it.
Since character variables with lots of unique values, or high cardinality variables, cause very wide datasets, the best way to limit width is understanding which columns those are and whether or not you need them in your data. How many unique values are in a character column, like destination
?
d %>% summarize_if(~ is.character(.x) | is.factor(.x) , n_distinct) # > # A tibble: 1 x 5 # > arr_delay carrier tailnum origin dest # > <int> <int> <int> <int> <int> # > 1 2 16 4044 3 105
What about how those values correlate with the outcome? In this case, we’ll do a visualization of destination
, looking at the proportion of each that belongs to the delayed group.
d %>% ggplot(aes(x = reorder(dest, arr_delay, function(x) -sum(x == "Y") / length(x)), fill = arr_delay)) + geom_bar(stat = "count", position = "fill") + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = .5)) + xlab("Destination") + ylab("Proporition Delayed")
But really, you want to know what the distribution of all your variables looks like. You should profile your data using the DataExplorer
package. It gives you all sorts of good info about your data and generates an HTML report that you can refer back to later to know if your data might have changed. Check out an example here.
The simplest option is to remove categorical or date columns with more than say, 50 categories. You might throw out some information, but you can always add it back in later if you need. You’ll notice that’s what I did above by ignoring tailnum
, which has 4000 categories and doesn’t likely contain much useful information.
We saw in the plot that destination
, definitely has useful information as some destinations are rarely delayed while others are routinely delayed. 105 categories is simply too many to just include it though. Here, the collapse_rare_factors
argument to prep_data
can help with that lumping some of the rare categories into an “other” category. This example puts any category that contains less than 2% of the total data into “other.” destination
is shrunk to the 16 most common destinations.
d_clean2 <- d %>% mutate(hour_of_day = lubridate::hour(time_hour)) %>% prep_data(outcome = arr_delay, tailnum, collapse_rare_factors = 0.02) # > Warning in prep_data(., outcome = arr_delay, tailnum, collapse_rare_factors = # > 0.02): These ignored variables have missingness: tailnum # > Training new data prep recipe... # > Removing the following 1 near-zero variance column(s). If you don't want to remove them, call prep_data with remove_near_zero_variance as a smaller numeric or FALSE. # > year d_clean2 %>% select(starts_with("dest")) # > healthcareai-prepped data. Recipe used to prepare data: # > Current data: # > Data Recipe # > # > Inputs: # > # > role #variables # > outcome 1 # > predictor 16 # > # > Training data contained 336776 data points and 9430 incomplete rows. # > # > Operations: # > # > Sparse, unbalanced variable filter removed year [trained] # > Date features from time_hour [trained] # > Variables removed time_hour [trained] # > Mean Imputation for month, day, sched_dep_time, ... [trained] # > Filling NA with missing for carrier, origin, dest [trained] # > Adding levels to: other, missing [trained] # > Collapsing factor levels for carrier, origin, dest [trained] # > Adding levels to: other, missing [trained] # > Dummy variables from carrier, origin, and dest [trained] # > # A tibble: 336,776 x 18 # > dest_ATL dest_BOS dest_CLT dest_DCA dest_DEN dest_DFW dest_DTW dest_FLL # > <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> # > 1 0 0 0 0 0 0 0 0 # > 2 0 0 0 0 0 0 0 0 # > 3 0 0 0 0 0 0 0 0 # > 4 0 0 0 0 0 0 0 0 # > 5 1 0 0 0 0 0 0 0 # > # … with 336,771 more rows, and 10 more variables: dest_IAH <dbl>, # > # dest_LAX <dbl>, dest_MCO <dbl>, dest_MIA <dbl>, dest_MSP <dbl>, # > # dest_ORD <dbl>, dest_RDU <dbl>, dest_SFO <dbl>, dest_TPA <dbl>, # > # dest_missing <dbl>
By default, prep_data
will also transform dates into circular numeric columns that can be used by a machine learning model. The simplest conversion is to use categorical columns that eventually become 0/1 columns for each month and day of week. That method is interpretable and will capture the fact that March is not greater than February. However, it will cause the the dataset to grow large. Circular representation allows all the same information to be stored in fewer, numeric columns. It ensures that the switch from December (12) to January (1) is equivalent to changing 1 month. Fewer columns are created and models will train faster.
Using circular numeric dates, the one date-time column is converted into 7 numeric columns.
library(lubridate) # > # > Attaching package: 'lubridate' # > The following objects are masked from 'package:base': # > # > date, intersect, setdiff, union d_clean2 <- d %>% select(-year, -month, -day) %>% prep_data(outcome = arr_delay, tailnum, collapse_rare_factors = 0.02, convert_dates = TRUE) # > Warning in prep_data(., outcome = arr_delay, tailnum, collapse_rare_factors = # > 0.02, : These ignored variables have missingness: tailnum # > Training new data prep recipe... d_clean2 %>% select(starts_with("time_hour")) # > healthcareai-prepped data. Recipe used to prepare data: # > Current data: # > Data Recipe # > # > Inputs: # > # > role #variables # > outcome 1 # > predictor 12 # > # > Training data contained 336776 data points and 9430 incomplete rows. # > # > Operations: # > # > Sparse, unbalanced variable filter removed no terms [trained] # > Date features from time_hour [trained] # > Variables removed time_hour [trained] # > Mean Imputation for sched_dep_time, sched_arr_time, ... [trained] # > Filling NA with missing for carrier, origin, dest [trained] # > Adding levels to: other, missing [trained] # > Collapsing factor levels for carrier, origin, dest [trained] # > Adding levels to: other, missing [trained] # > Dummy variables from carrier, origin, and dest [trained] # > # A tibble: 336,776 x 7 # > time_hour_dow_s… time_hour_dow_c… time_hour_month… time_hour_month… # > <dbl> <dbl> <dbl> <dbl> # > 1 0.434 -0.901 0.500 0.866 # > 2 0.434 -0.901 0.500 0.866 # > 3 0.434 -0.901 0.500 0.866 # > 4 0.434 -0.901 0.500 0.866 # > 5 0.434 -0.901 0.500 0.866 # > # … with 336,771 more rows, and 3 more variables: time_hour_year <dbl>, # > # time_hour_hour_sin <dbl>, time_hour_hour_cos <dbl>
Using categorical dates, the one data column is turned into 23 sparse (mostly 0) columns.
d_clean2 <- d %>% select(-year, -month, -day) %>% prep_data(outcome = arr_delay, tailnum, collapse_rare_factors = 0.02, convert_dates = "categories") # > Warning in prep_data(., outcome = arr_delay, tailnum, collapse_rare_factors = # > 0.02, : These ignored variables have missingness: tailnum # > Training new data prep recipe... d_clean2 %>% select(starts_with("time_hour")) # > healthcareai-prepped data. Recipe used to prepare data: # > Current data: # > Data Recipe # > # > Inputs: # > # > role #variables # > outcome 1 # > predictor 12 # > # > Training data contained 336776 data points and 9430 incomplete rows. # > # > Operations: # > # > Sparse, unbalanced variable filter removed no terms [trained] # > Date features from time_hour [trained] # > Variables removed time_hour [trained] # > Mean Imputation for sched_dep_time, sched_arr_time, ... [trained] # > Filling NA with missing for carrier, origin, dest, ... [trained] # > Adding levels to: other, missing [trained] # > Collapsing factor levels for carrier, origin, dest, ... [trained] # > Adding levels to: other, missing [trained] # > Dummy variables from carrier, origin, dest, time_hour_dow, and time_hour_month [trained] # > # A tibble: 336,776 x 23 # > time_hour_year time_hour_hour time_hour_dow_S… time_hour_dow_T… # > <dbl> <int> <dbl> <dbl> # > 1 2013 5 0 1 # > 2 2013 5 0 1 # > 3 2013 5 0 1 # > 4 2013 5 0 1 # > 5 2013 6 0 1 # > # … with 336,771 more rows, and 19 more variables: time_hour_dow_Wed <dbl>, # > # time_hour_dow_Thu <dbl>, time_hour_dow_Fri <dbl>, time_hour_dow_Sat <dbl>, # > # time_hour_dow_other <dbl>, time_hour_dow_missing <dbl>, # > # time_hour_month_Jan <dbl>, time_hour_month_Feb <dbl>, # > # time_hour_month_Mar <dbl>, time_hour_month_Apr <dbl>, # > # time_hour_month_May <dbl>, time_hour_month_Jun <dbl>, # > # time_hour_month_Aug <dbl>, time_hour_month_Sep <dbl>, # > # time_hour_month_Oct <dbl>, time_hour_month_Nov <dbl>, # > # time_hour_month_Dec <dbl>, time_hour_month_other <dbl>, # > # time_hour_month_missing <dbl>
Some categories, like tailnum
, are not handled well by the collapse_rare_factors
argument. There are too many categories and they are fairly evenly distributed. They all get moved into category other
, which doesn’t help at all. You could remove them by passing them to the ...
argument of prep_data
, as I did with tailnum.
An alternative to grouping rare categories is to use add_best_levels
to make columns for the categories (or levels) that are likely to help differentiate the outcome variable. This function can be used for any grouping variable, like zip code.
The following example will identify 20 tail numbers that could be predictive of the outcome. Instead of removing tail number, These are good levels to add to try adding to the data with bind_cols
. Similarly, you could select the best dest
values by using dest
as the grouping variable.
data <- d %>% select(flight_id, arr_delay) ls <- d %>% select(flight_id, tailnum, arr_delay) d_best_levels <- data %>% add_best_levels(longsheet = ls, id = flight_id, groups = tailnum, outcome = arr_delay, n_levels = 20, min_obs = 50, missing_fill = 0) # > No fill column was provided, so using "1" for present entities d_best_levels # > # A tibble: 336,776 x 22 # > flight_id arr_delay tailnum_N13133 tailnum_N13970 tailnum_N14953 # > <int> <chr> <int> <int> <int> # > 1 1 N 0 0 0 # > 2 2 Y 0 0 0 # > 3 3 Y 0 0 0 # > 4 4 N 0 0 0 # > 5 5 N 0 0 0 # > # … with 336,771 more rows, and 17 more variables: tailnum_N15910 <int>, # > # tailnum_N17984 <int>, tailnum_N363NB <int>, tailnum_N36915 <int>, # > # tailnum_N3HKAA <int>, tailnum_N3HWAA <int>, tailnum_N3HYAA <int>, # > # tailnum_N4XDAA <int>, tailnum_N514UA <int>, tailnum_N669DN <int>, # > # tailnum_N688DL <int>, tailnum_N710UW <int>, tailnum_N822MQ <int>, # > # tailnum_N844VA <int>, tailnum_N8611A <int>, tailnum_N8869B <int>, # > # tailnum_N992DL <int>
At some point, adding more rows of data will no longer improve performance. Better data will always win against more of the same bad data.
If you have a large data set, like flights
, you probably have enough information to do machine learning on a subset and still get good results. Some common ways to reduce length:
Sample a random subset of data. This keeps half the original data with the same ratio of delayed to not delayed.
stratified_sample_d <- d %>% split_train_test(outcome = arr_delay, percent_train = .5) stratified_sample_d <- stratified_sample_d$train d %>% count(arr_delay) %>% arrange(desc(n)) stratified_sample_d %>% count(arr_delay) %>% arrange(desc(n)) # > # A tibble: 2 x 2 # > arr_delay n # > <chr> <int> # > 1 N 259146 # > 2 Y 77630 # > # A tibble: 2 x 2 # > arr_delay n # > <chr> <int> # > 1 N 129573 # > 2 Y 38815
Use only data from the last year or 2. Healthcare data often goes back several years, but you might find that data from 5 years ago isn’t as predictive as more current data anyways!
# From an integer month column d_recent <- d %>% filter(month >= 6) # From a date column d_recent <- d %>% filter(lubridate::month(time_hour) >= 6)
In the case of unbalanced data sets, where there are many more “No” than “Yes” outcomes, throw out some of the “No” rows.
downsampled_d <- caret::downSample(d, as.factor(d$arr_delay)) %>% select(-Class, -tailnum) downsampled_d %>% count(arr_delay) %>% arrange(desc(n)) # > arr_delay n # > 1 N 77630 # > 2 Y 77630
Let’s train a model to predict whether a plane arrived more than 15 minutes late or not. The first thing you might do is use machine_learn
. I stopped this code after I saw the warning message.
m <- machine_learn(downsampled_d, outcome = arr_delay)
Training new data prep recipe
Removing the following 1 near-zero variance column(s). If you don't want to remove them, call prep_data with remove_near_zero_variance = FALSE.
year
arr_delay looks categorical, so training classification algorithms.
You've chosen to tune 150 models (n_folds = 5 x tune_depth = 10 x length(models) = 3) on a 336,776 row dataset. This may take a while...
Training with cross validation: Random Forest
Notice that machine_learn
gives the warning message that says it will be training 150 models! Under the hood, this function is doing some seriously rigorous machine learning for you:
These steps ensure you get the best performing model while still being resistant to overfitting. They come at the cost of computational complexity. You can speed things up by using a shorter model list, models = "rf"
, less hyperparameter tuning, tune_depth = 3
, and fewer folds, n_folds = 4
. Just know that you’re cutting corners for the sake of time. We recommend doing your development work quickly using flash_models
but then doing the full tune before saving your final model.
We can use prep_data
and flash_models
to get an idea of how long our models will take to train. flash_models
requires at least 2 folds, meaning 2 models, each trained on half the data. We recommend using at least 4 folds. If 4 models takes 6.5 minutes, 150 models will be in the ballpark of 4 hours.
start <- Sys.time() d_clean <- downsampled_d %>% prep_data(outcome = arr_delay, collapse_rare_factors = 0.03, convert_dates = FALSE) # > Training new data prep recipe... # > Removing the following 1 near-zero variance column(s). If you don't want to remove them, call prep_data with remove_near_zero_variance as a smaller numeric or FALSE. # > year m_rf_1 <- flash_models(d_clean, outcome = arr_delay, models = "rf", n_folds = 4) # > # > arr_delay looks categorical, so training classification algorithms. # > # > After data processing, models are being trained on 34 features with 155,260 observations. # > Based on n_folds = 4 and hyperparameter settings, the following number of models will be trained: 4 rf's # > Training at fixed values: Random Forest # > You may, or may not, see messages about progress in growing trees. The estimates are very rough, and you should expect the progress ticker to cycle 5 times. # > # > *** Models successfully trained. The model object contains the training data minus ignored ID columns. *** # > *** If there was PHI in training data, normal PHI protocols apply to the model object. *** Sys.time() - start # > Growing trees.. Progress: 78%. Estimated remaining time: 8 seconds. # > Growing trees.. Progress: 78%. Estimated remaining time: 8 seconds. # > Growing trees.. Progress: 73%. Estimated remaining time: 11 seconds. # > Growing trees.. Progress: 79%. Estimated remaining time: 8 seconds. # > Growing trees.. Progress: 53%. Estimated remaining time: 27 seconds. # > Time difference of 5.320908 mins
m_rf_1 # > Algorithms Trained: Random Forest # > Model Name: arr_delay # > Target: arr_delay # > Class: Classification # > Performance Metric: AUROC # > Number of Observations: 155260 # > Number of Features: 34 # > Models Trained: 2020-08-05 09:16:40 # > # > Models have not been tuned. Performance estimated via 4-fold cross validation at fixed hyperparameter values. # > Best model: Random Forest # > AUPR = 0.71, AUROC = 0.72 # > User-selected hyperparameter values: # > mtry = 5 # > splitrule = extratrees # > min.node.size = 1
If you don’t need the finer control of your data preparation, machine_learn
with tune = FALSE
is equivalent to prep_data %>% flash_models
.
m <- machine_learn(d, outcome = arr_delay, models = "rf", tune = FALSE, n_folds = 4)
As different models take different amounts of time, it’s good to know what performance looks like using both before choosing one to go forward with. Hyperparameter tuning with GLM is very efficient. However, in some situations, it won’t be as accurate as other models, and it can be slower on very wide datasets (hundreds of columns).
On our flights
data, GLM was faster, taking only a quarter of the time RF did. Its performance was not as high though. RF had an AUPR of 0.72 while GLM was 0.69. The performance increase could potentially be worth your time as random forest appears to better fit the data.
start <- Sys.time() d_clean <- downsampled_d %>% prep_data(outcome = arr_delay, collapse_rare_factors = 0.03, convert_dates = FALSE) # > Training new data prep recipe... # > Removing the following 1 near-zero variance column(s). If you don't want to remove them, call prep_data with remove_near_zero_variance as a smaller numeric or FALSE. # > year m_glm_1 <- flash_models(d_clean, outcome = arr_delay, models = "glm", n_folds = 4) # > # > arr_delay looks categorical, so training classification algorithms. # > # > After data processing, models are being trained on 34 features with 155,260 observations. # > Based on n_folds = 4 and hyperparameter settings, the following number of models will be trained: 40 glm's # > Model training may take a few minutes. # > Training at fixed values: glmnet # > # > *** Models successfully trained. The model object contains the training data minus ignored ID columns. *** # > *** If there was PHI in training data, normal PHI protocols apply to the model object. *** Sys.time() - start # > Time difference of 1.384417 mins
m_glm_1 # > Algorithms Trained: glmnet # > Model Name: arr_delay # > Target: arr_delay # > Class: Classification # > Performance Metric: AUROC # > Number of Observations: 155260 # > Number of Features: 34 # > Models Trained: 2020-08-05 09:18:03 # > # > Models have not been tuned. Performance estimated via 4-fold cross validation at fixed hyperparameter values. # > Best model: glmnet # > AUPR = 0.63, AUROC = 0.67 # > User-selected hyperparameter values: # > alpha = 1 # > lambda = 0.00098
Now that you’ve settled on a model, you could use either machine_learn
with just that model or tune_models
with selected tuning options. This gives you a little more rigor than our svelte flash_models
but won’t take 4 hours.
start <- Sys.time() d_clean <- downsampled_d %>% prep_data(outcome = arr_delay, collapse_rare_factors = 0.03, convert_dates = FALSE) # > Training new data prep recipe... # > Removing the following 1 near-zero variance column(s). If you don't want to remove them, call prep_data with remove_near_zero_variance as a smaller numeric or FALSE. # > year m_glm_2 <- tune_models(d = d_clean, outcome = arr_delay, models = "glm", tune_depth = 5) # > # > arr_delay looks categorical, so training classification algorithms. # > # > After data processing, models are being trained on 34 features with 155,260 observations. # > Based on n_folds = 5 and hyperparameter settings, the following number of models will be trained: 50 glm's # > Model training may take a few minutes. # > Training with cross validation: glmnet # > # > *** Models successfully trained. The model object contains the training data minus ignored ID columns. *** # > *** If there was PHI in training data, normal PHI protocols apply to the model object. *** Sys.time() - start # > Time difference of 2.343012 mins
m_glm_2 # > Algorithms Trained: glmnet # > Model Name: arr_delay # > Target: arr_delay # > Class: Classification # > Performance Metric: AUROC # > Number of Observations: 155260 # > Number of Features: 34 # > Models Trained: 2020-08-05 09:20:23 # > # > Models tuned via 5-fold cross validation over 10 combinations of hyperparameter values. # > Best model: glmnet # > AUPR = 0.62, AUROC = 0.66 # > Optimal hyperparameter values: # > alpha = 1 # > lambda = 0.0035
We got different performance than when we used flash_models
. Why? Hyperparameters are chosen randomly. We could have gotten lucky before. Using 5-fold cross validation is more rigorous as well. It will uncover any differences there might be in subsets of the data and more closely mimic the production environment.
Before simply getting more memory on your computer, try the above on your data. Hopefully, you end up with a better model in less time!