Pivot multiple rows per observation to one row with multiple columns

pivot(d, grain, spread, fill, fun = sum, missing_fill = NA, extra_cols)

Arguments

d

data frame

grain

Column that defines rows. Unquoted.

spread

Column that will become multiple columns. Unquoted.

fill

Column to be used to fill the values of cells in the output, perhaps after aggregation by fun. If fill is not provided, counts will be used, as though a fill column of 1s had been provided.

fun

Function for aggregation, defaults to sum. Custom functions can be used with the same syntax as the apply family of functions, e.g. fun = function(x) some_function(another_fun(x)).

missing_fill

Value to fill for combinations of grain and spread that are not present. Defaults to NA, but 0 may be useful as well.

extra_cols

Values of spread to create all-missing_fill columns, for e.g. if you want to add levels that were observed in training but are not present in deployment.

Value

A tibble data frame with one row for each unique value of grain, and one column for each unique value of spread plus one column for the entries in grain.

Entries in the tibble are defined by the fill column. Combinations of grain x spread that are not present in d will be filled in with missing_fill. If there are grain x spread pairs that appear more than once in d, they will be aggregated by fun.

Details

pivot is useful when you want to change the grain of your data, for example from the procedure grain to the patient grain. In that example, each patient might have 0, 1, or more medications. To make a patient-level table, we need a column for each medication, which is what it means to make a wide table. The fill argument dictates what to put in each of the medication columns, e.g. the dose the patient got. fill defaults to "1", as an indicator variable. If any patients have multiple rows for the same medication (say they recieved a med more than once), we need a way to deal with that, which is what the fun argument handles. By default it uses sum, so if fill is left as its default, the count of instances for each patient will be used.

Examples

meds <- tibble::tibble( patient_id = c("A", "A", "A", "B"), medication = c("zoloft", "asprin", "lipitor", "asprin"), pills_per_day = c(1, 8, 2, 4) ) meds
#> # A tibble: 4 x 3 #> patient_id medication pills_per_day #> <chr> <chr> <dbl> #> 1 A zoloft 1 #> 2 A asprin 8 #> 3 A lipitor 2 #> 4 B asprin 4
# Number of pills of each medication each patient gets: pivot( d = meds, grain = patient_id, spread = medication, fill = pills_per_day, missing_fill = 0 )
#> # A tibble: 2 x 4 #> patient_id medication_asprin medication_lipitor medication_zoloft #> <chr> <dbl> <dbl> <dbl> #> 1 A 8 2 1 #> 2 B 4 0 0
bills <- tibble::tibble( patient_id = rep(c("A", "B"), each = 4), dept_id = rep(c("ED", "ICU"), times = 4), charge = runif(8, 0, 1e4), date = as.Date("2024-12-25") - sample(0:2, 8, TRUE) ) bills
#> # A tibble: 8 x 4 #> patient_id dept_id charge date #> <chr> <chr> <dbl> <date> #> 1 A ED 7545. 2024-12-24 #> 2 A ICU 3689. 2024-12-25 #> 3 A ED 3774. 2024-12-23 #> 4 A ICU 4198. 2024-12-24 #> 5 B ED 3632. 2024-12-23 #> 6 B ICU 2180. 2024-12-24 #> 7 B ED 502. 2024-12-24 #> 8 B ICU 1893. 2024-12-24
# Total charges per patient x department: pivot(bills, patient_id, dept_id, charge, sum)
#> # A tibble: 2 x 3 #> patient_id dept_id_ED dept_id_ICU #> <chr> <dbl> <dbl> #> 1 A 11320. 7887. #> 2 B 4134. 4073.
# Count of charges per patient x day: pivot(bills, patient_id, date)
#> No fill column was provided, so using "1" for present entities
#> There are rows that contain the same values of both patient_id and date but you didn't provide a function to 'fun' for their aggregation. Proceeding with the default: fun = sum.
#> # A tibble: 2 x 4 #> patient_id `date_2024-12-23` `date_2024-12-24` `date_2024-12-25` #> <chr> <int> <int> <int> #> 1 A 1 2 1 #> 2 B 1 3 NA
# Can provide a custom function to fun, which will take fill as input. # Get the difference between the greatest and smallest charge in each # department for each patient and format it as currency. pivot(d = bills, grain = patient_id, spread = dept_id, fill = charge, fun = function(x) paste0("$", round(max(x) - min(x), 2)) )
#> # A tibble: 2 x 3 #> patient_id dept_id_ED dept_id_ICU #> <chr> <chr> <chr> #> 1 A $3771.17 $508.65 #> 2 B $3130.63 $287.2