Let’s be honest. You need to get data from a database, do some really awesome stuff with R, and then write the results back to a database. RStudio (and R) can help with that. There are lots of different databases, but this document will focus on Microsoft SQL Server. RStudio supports most databases, and I’d encourage you to look at the db help pages for additional resources. Andiamo!

If you simply want a code snippet to copy and paste, scroll to the bottom.

Connecting to a Database

Before you can do anything, you must make a connection to your database. The build_connection_string function takes care of the syntax for you and returns a connection string that’s ready to be passed into dbConnect. After you have a valid connection, you should be able to browse the database using RStudio’s Connections pane. If you don’t have a connections pane, upgrade your RStudio. It’s worth it.

Your database connection is now stored in the object, con. You can use it in database functions that require a connection. dbConnect will create a persistent connection, meaning that it’s available until you disconnect. If you have a recent version of RStudio, you’ll see the connections pane has updated as well.

Each database, schema, and table is available as a drop down menu for exploration. Clicking the button in the far right column next to a table will bring up a preview of the top 1000 rows of the table.

Reading data

Using SQL Code

From here, you can read data from the database in a couple different ways. You can use db_read to execute SQL code, or you can use dbplyr functionality to execute dplyr code directly against the database.

This query counts admits by year. If you don’t want to pull the results into memory, as might the be the case with large tables, set the pull_into_memory flag to FALSE. This will create a pointer to the database that can be executed “lazily.” In other words R will wait until the last moment to execute the statement.

Using dplyr Code

The other option is to use dplyr code. Set up a reference to the table, then use it to pull data. Here, tbl is creating a reference and storing it in the encounter object. R will treat that object like a data frame and can filter, aggregate, join, etc using R or dplyr code. The last line, collect(), is what actually pulls the data into memory. Be sure to first install.packages('tidyverse')

See here for help with joins using dplyr.

Writing Data

Note: the odbc/DBI route doesn’t work as of 4/18

As of 4/1/2018, there are two ways to write to a database:

  1. The latest version of odbc, and DBI, installed from Github. The CRAN version does not support database schemas.

  2. RODBC

In either case, the data you are trying to write must match the destination table’s column names and data types. To create the table, you use a CREATE TABLE statement in a tool like SSMS (or SAMD, if in the Health Catalyst environment).

This code assumes there is a table called SAM.Sepsis.Predictions with columns:

  • patient_id (int)
  • predicted_probability (float).

Writing with DBI

Note: as of April 18th, DBI is not working for writing to a database

Writing with RODBC

While it’s the route that works (as of April 18th), RODBC won’t be the preferred way to interact with a database long-term because it doesn’t work with RStudio’s connections pane. However, the package is available on CRAN and does support schemas.