Now that you’ve put in the work of building a model and it’s performing well on historical data, congratulations! It’s now time to prepare it to make reoccurring predictions. This document focuses on the requirements needed before you actually talk to the DBA/ETL team about creating the reoccurring schedule. At this point, you have

  • Chosen a use case and an outcome to predict
  • Gathered your feature (i.e., input variable) list to make predictions
  • Trained a model that meets your performance requirements

If you’ve done those, push on below!

Creating and saving a model

In order to work with models in a production environment, first you must be able to have a saved model on disk. While we keep this section minimal, you can find out more about model training in our Getting Started vignette. If curious, you can find more about database connections here

Note, install tidyverse via install.packages('tidyverse')

Pushing predictions to a database

While the frequency will vary depending on the use case, for most healthcare ML scenarios you’ll need predictions on a reoccurring basis. Daily batch predictions are common. To create predictions, new rows (representing patients or patient encounters) are run against the model that lives in the RDA file.

Instead of simply creating predictions in R, however, in a production scenario you’ll be pushing these predictions to a database. Typically we append to the same table each night, such that a person will get an updated risk score for each day they’re in the hospital (for example).

Creating predictions

Now we focus on the code that will run each night to create predictions. Note a couple of things:

  • The query is different from the development step
    • Here you’re only querying patients that need a prediction (notice WHERE clause)
    • Here you won’t have the label (i.e., outcome) column in your query
  • We read the model from disk; if needed, a web service could be involved

Create the output table

Before we worry about pushing rows to the database, look at the format of your output by typing glimpse(predictions)

You might want push fewer columns than that to a table each night. Here’s how you slim this down:

Now that we have only the columns necessary, let’s shift to the database table.

Outside of a Health Catalyst environment

Use a tool like SQL Server Management Studio and a CREATE TABLE statement.

Inside a Health Catalyst environment

  1. Append a few utility columns to your output, so that it matches the typical structure of tables generated by Subject Area Mart Designer (SAMD):
  1. Use SAMD to create the table (since this creates metadata)
  • This binding requires a query that not only has the column structure, but also WHERE 0 = 1, such that only R is populating the output table.
  • Edit the column types, name the binding something like ReadmitMLOutputTable, and ensure that’s it generated as expected.
  • Set the binding to Incremental, such that the table isn’t dropped and reloaded each night.

Writing predictions to the table

Here we use the RODBC package, which can be installed via install.packages('RODBC'). Note that the database isn’t specified here, since it was defined above.

If your predictions arrive in the database, congrats! You’re now ready to work with the DBA/ETL* team to schedule this reoccurring job, such that it runs in a manner that’s dependency-aware.

Testing in production

Before users start consuming your awesome ML work, it’s imperative that you verify that the model’s accuracy holds up in production. Performance on retrospective data isn’t the same as performance on incoming production data. Sometimes issues like data leakage or other issues cause the performance to drop.

Effectively, if you’re creating CLABSI predictions, you should wait a few weeks until a large number of the patients who have received risk predictions have left the hospital and had an outcome (CLABSI = Y/N) recorded so you can compare predictions to actual outcomes. Similarly, if you’re predicting 30-day readmissions, you need to wait ~45 days until you have enough data to verify if you accurately predicted who did or didn’t return to the hospital. If you have a heavy class imbalance (with rare CLABSIs, etc), you’ll need to wait longer. This evaluation can be done via

evaluate_classification(predicted, actual)