Now that you’ve put in the work of building a model and it’s performing well on historical data, congratulations! It’s now time to prepare it to make reoccurring predictions. This document focuses on the requirements needed before you actually talk to the DBA/ETL team about creating the reoccurring schedule. At this point, you have
If you’ve done those, push on below!
In order to work with models in a production environment, first you must be able to have a saved model on disk. While we keep this section minimal, you can find out more about model training in our Getting Started vignette. If curious, you can find more about database connections here
Note, install tidyverse via install.packages('tidyverse')
library(healthcareai) library(DBI) library(tidyverse) my_con <- build_connection_string(server = "HCS-GM0004", database = "SAM") con <- dbConnect(odbc::odbc(), .connection_string = my_con) train_query <- "SELECT [FacilityAccountID] ,[AdmitAgeNBR] ,[GenderCD] ,[EthnicGroupDSC] ,[MaritalStatusDSC] ,[Readmit30FLG] FROM [SAM].[Schema].[SummaryVisitBASE] WHERE InpatientFLG = 'F'" d <- db_read(con, train_query) models <- machine_learn(d = d, FacilityAccountID, models = "rf", outcome = Readmit30FLG) # Save my model healthcareai::save_models(models, "readmit_models.RDS")
While the frequency will vary depending on the use case, for most healthcare ML scenarios you’ll need predictions on a reoccurring basis. Daily batch predictions are common. To create predictions, new rows (representing patients or patient encounters) are run against the model that lives in the RDA file.
Instead of simply creating predictions in R, however, in a production scenario you’ll be pushing these predictions to a database. Typically we append to the same table each night, such that a person will get an updated risk score for each day they’re in the hospital (for example).
Now we focus on the code that will run each night to create predictions. Note a couple of things:
WHERE
clause)prod_query <- "SELECT [FacilityAccountID] ,[AdmitAgeNBR] ,[GenderCD] ,[EthnicGroupDSC] ,[MaritalStatusDSC] FROM [SAM].[Schema].[SummaryVisitBASE] WHERE InpatientFLG = 'T'" d_out <- db_read(con, prod_query) models <- healthcareai::load_models("readmit_models.RDS") predictions <- predict(models, d_out)
Before we worry about pushing rows to the database, look at the format of your output by typing glimpse(predictions)
You might want push fewer columns than that to a table each night. Here’s how you slim this down:
predictions <- predictions %>% select(FacilityAccountID, predicted_Readmit30FLG) %>% # Grab just two columns rename(PredictedProbabilityNBR = predicted_Readmit30FLG) # Change col name
Now that we have only the columns necessary, let’s shift to the database table.
Use a tool like SQL Server Management Studio and a CREATE TABLE
statement.
predictions <- predictions %>% add_SAM_utility_cols
WHERE 0 = 1
, such that only R is populating the output table.Incremental
, such that the table isn’t dropped and reloaded each night.Here we use the RODBC
package, which can be installed via install.packages('RODBC')
. Note that the database isn’t specified here, since it was defined above.
library(RODBC) con <- odbcDriverConnect(connection = my_con) sqlSave(con, predictions, "Schema.ReadmitMLOutputTable", append = TRUE, rownames = FALSE) odbcClose(con)
If your predictions arrive in the database, congrats! You’re now ready to work with the DBA/ETL* team to schedule this reoccurring job, such that it runs in a manner that’s dependency-aware.
Before users start consuming your awesome ML work, it’s imperative that you verify that the model’s accuracy holds up in production. Performance on retrospective data isn’t the same as performance on incoming production data. Sometimes issues like data leakage or other issues cause the performance to drop.
Effectively, if you’re creating CLABSI predictions, you should wait a few weeks until a large number of the patients who have received risk predictions have left the hospital and had an outcome (CLABSI = Y/N) recorded so you can compare predictions to actual outcomes. Similarly, if you’re predicting 30-day readmissions, you need to wait ~45 days until you have enough data to verify if you accurately predicted who did or didn’t return to the hospital. If you have a heavy class imbalance (with rare CLABSIs, etc), you’ll need to wait longer. This evaluation can be done via healthcare.ai:
evaluate_classification(predicted, actual)
library(healthcareai) library(DBI) library(RODBC) library(tidyverse) my_con <- build_connection_string(server = "HCS-GM0004", database = "SAM") con <- dbConnect(odbc::odbc(), .connection_string = my_con) prod_query <- "SELECT [FacilityAccountID] ,[AdmitAgeNBR] ,[GenderCD] ,[EthnicGroupDSC] ,[MaritalStatusDSC] ,[Readmit30FLG] FROM [SAM].[Schema].[SummaryVisitBASE] WHERE InpatientFLG = 'T'" d_out <- db_read(con, prod_query) models <- healthcareai::load_models("readmit_models.RDS") predictions <- predict(models, d_out) con <- odbcDriverConnect(connection = my_con) sqlSave(con, predictions, "Schema.ReadmitMLOutputTableBASE", append = TRUE, rownames = FALSE) odbcClose(con)
*Note: for those working in a Health Catalyst environment, proceed with the ML-Prod document in Spark