All of Us in R

The Basics

Before diving into the specifics, let’s clarify some vocabulary:

  • SQL database : A structured collection of data where information is stored in tables. Each table is like a spreadsheet with rows and columns. Data can be added, removed, or modified using SQL queries.
  • SQL query : A request for data from a database written in a language called SQL (Structured Query Language).
  • Google BigQuery : A cloud-based SQL database that is used to store the All of Us data.
  • bigrquery : An R package that allows you to interact with Google BigQuery from R.
  • dplyr: A part of the tidyverse in R, dplyr is a package for data manipulation. It provides a set of functions that can be used to filter, select, arrange, mutate, summarize, and join data.
  • dbplyr : Also a part of the tidyverse in R, dbplyr is a database backend for dplyr. It allows you to write R code that is then translated into SQL queries.

Installing and Loading allofus

You can install the allofus R package directly from CRAN, or you can install the development version from Github (https://github.com/roux-ohdsi/allofus).

# Install from CRAN
install.packages("allofus")

# Or install development version
install.packages("remotes")
remotes::install_github("roux-ohdsi/allofus")

Use the library() command to load the allofus package and the dplyr package. Most functionality in allofus relies on the dplyr package, which comes pre-installed in the researcher workbench. Installing allofus will also make sure that the version of dplyr is up-to-date.

library(allofus)
library(dplyr)

Accessing Data

Connecting to the database

A connection to a database is an object that allows you to interact with it from R. The allofus package relies on the bigrquery R package to create a connection to the Google BigQuery database when you run aou_connect().

con <- aou_connect()

The object con is used to refer to the connection to the All of Us database. When you run the aou_connect() function, it also gets stored as the default connection for a session, so you don’t need to include it in other functions from the allofus package. For instance, you can run aou_tables() to see a list of tables in the database:

aou_tables()
#> # A tibble: 68 × 2
#>    table_name               columns                                             
#>    <chr>                    <chr>                                               
#>  1 concept_ancestor         ancestor_concept_id, descendant_concept_id, min_lev…
#>  2 cb_criteria_ancestor     ancestor_id, descendant_id                          
#>  3 attribute_definition     attribute_definition_id, attribute_name, attribute_…
#>  4 cdm_source               cdm_source_name, cdm_source_abbreviation, cdm_holde…
#>  5 concept_class            concept_class_id, concept_class_name, concept_class…
#>  6 concept                  concept_id, concept_name, domain_id, vocabulary_id,…
#>  7 concept_synonym          concept_id, concept_synonym_name, language_concept_…
#>  8 cb_criteria_relationship concept_id_1, concept_id_2                          
#>  9 concept_relationship     concept_id_1, concept_id_2, relationship_id, valid_…
#> 10 condition_occurrence     condition_occurrence_id, person_id, condition_conce…
#> # ℹ 58 more rows

Accessing a table

To access a table in the database, use tbl(con, "tablename"). The resulting object is reference to a table in a database that allows you to interact with it from R. In order to connect to a table in the database, you must first create the database connection (con). For example, to create a reference to the person table:

person_tbl <- tbl(con, "person")

Although the person_tbl object behaves similarly to a data frame, it is not actually a data frame. Instead, it is actually a SQL query that only gets run when you need to access the data. This is a feature of the dbplyr package that allows you to manipulate data without actually retrieving it. The SQL query behind the person_tbl object is:

SELECT *
FROM `person`

When you print person_tbl, you’ll get something like this:

person_tbl
#> # Source:   table<person> [?? x 23]
#> # Database: BigQueryConnection
#>    person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#>      <int64>           <int64>       <int64>        <int64>      <int64>
#>  1   xxxxxxx            903096          1955             NA           NA
#>  2   xxxxxxx            903096          1978             NA           NA
#>  3   xxxxxxx            903096          2000             NA           NA
#>  4   xxxxxxx            903096          1988             NA           NA
#>  5   xxxxxxx            903096          1993             NA           NA
#>  6   xxxxxxx            903096          1959             NA           NA
#>  7   xxxxxxx            903096          1976             NA           NA
#>  8   xxxxxxx            903096          1961             NA           NA
#>  9   xxxxxxx            903096          1952             NA           NA
#> 10   xxxxxxx            903096          1980             NA           NA
#> # ℹ more rows
#> # ℹ 18 more variables: birth_datetime <dttm>, race_concept_id <int64>,
#> #   ethnicity_concept_id <int64>, location_id <int64>, provider_id <int64>,
#> #   care_site_id <int64>, person_source_value <chr>, gender_source_value <chr>,
#> #   gender_source_concept_id <int64>, race_source_value <chr>,
#> #   race_source_concept_id <int64>, ethnicity_source_value <chr>,
#> #   ethnicity_source_concept_id <int64>, …

You’ll only see the first 10 rows of the person table (person ids were omitted from the output). This allows you to see what the data looks like without loading the entire table into R, which can be slow or even crash your session.

Instead, you want to perform as much data manipulation as possible on the database. This is more efficient because the operations are translated into SQL and executed on the server, which is faster and requires less memory than processing in R.

Data manipulation on the database

Before bringing data into R, you can manipulate it on the database using the dplyr functions. This allows you to perform operations on the database without bringing the data into R’s memory.

For example, you can subset the person table to women born after 1980:

young_women <- person_tbl %>%
  filter(gender_concept_id == 45878463, year_of_birth > 1980)

Before we print out young_women, the SQL query has not actually been run. In fact, person_tbl is not run either. When we do print it, it will run the following SQL:

SELECT `person`.*
FROM `person`
WHERE (`gender_concept_id` = 45878463.0) AND (`year_of_birth` > 1980.0)

and print the first 10 rows:

young_women
#> # Source:   SQL [?? x 23]
#> # Database: BigQueryConnection
#>    person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#>      <int64>           <int64>       <int64>        <int64>      <int64>
#>  1   xxxxxxx          45878463          1992             NA           NA
#>  2   xxxxxxx          45878463          1989             NA           NA
#>  3   xxxxxxx          45878463          1981             NA           NA
#>  4   xxxxxxx          45878463          1990             NA           NA
#>  5   xxxxxxx          45878463          1990             NA           NA
#>  6   xxxxxxx          45878463          1985             NA           NA
#>  7   xxxxxxx          45878463          1987             NA           NA
#>  8   xxxxxxx          45878463          1986             NA           NA
#>  9   xxxxxxx          45878463          1983             NA           NA
#> 10   xxxxxxx          45878463          1998             NA           NA
# ℹ more rows
# ℹ 18 more variables: birth_datetime <dttm>, race_concept_id <int64>,
#   ethnicity_concept_id <int64>, location_id <int64>, provider_id <int64>,
#   care_site_id <int64>, person_source_value <chr>, gender_source_value <chr>,
#   gender_source_concept_id <int64>, race_source_value <chr>,
#   race_source_concept_id <int64>, ethnicity_source_value <chr>,
#   ethnicity_source_concept_id <int64>, …

Note that we don’t know how many observations match these conditions yet (the dimensions are [?? x 23]), because it hasn’t been fully executed – only the first 10 rows. To get the total number of observations, we can use tally():

tally(young_women)
#> # Source:   SQL [1 x 1]
#> # Database: BigQueryConnection
#>         n
#>   <int64>
#> 1   76135

This is actually a SQL query that only results in 1 row, so we do get to see the entire thing. It’s much faster to run than to bring the entire table into R and then count the number of rows, because the code is executed on the database:

SELECT count(*) AS `n`
FROM (
  SELECT `person`.*
  FROM `person`
  WHERE (`gender_concept_id` = 45878463.0) AND (`year_of_birth` > 1980.0)
)

The collect() Function

We can bring the result of a query into the local R session using collect():

young_women %>% collect()

This brings the table into your local R workspace as a tibble, or dataframe. This is useful for performing operations that cannot be performed directly on the database, such as certain statistical analyses or plotting. For example, if you’re planning to run a regression analysis on the filtered data, you would first use collect() to bring the data into R.

We can bring in the result of tally() as well:

tally(young_women) %>% collect()
#> A tibble: 1 × 1
#>       n
#> <int64>
#>   76135

Or even the entire person table, although that’s not recommended because it’s so large!

person_data <- person_tbl %>% collect()
person_data
#> # A tibble: 413457 × 23
#> person_id    gender_concept_id   year_of_birth   month_of_birth  day_of_birth    
#>   <int64>              <int64>         <int64>          <int64>       <int64>
#>   xxxxxxx               903096            1955               NA            NA    
#>   xxxxxxx               903096            1978               NA            NA
#>   xxxxxxx               903096            2000               NA            NA
#>   xxxxxxx               903096            1988               NA            NA
#>   xxxxxxx               903096            1993               NA            NA
#>   xxxxxxx               903096            1959               NA            NA
#>   xxxxxxx               903096            1976               NA            NA
#>   xxxxxxx               903096            1961               NA            NA
#>   xxxxxxx               903096            1952               NA            NA
#>   xxxxxxx               903096            1980               NA            NA    

Data manipulation with multiple tables

The All of Us data is spread across multiple tables, for the most part corresponding to the OMOP Common Data Model. This allows for efficient storage and retrieval of data, but it can be a bit tricky to work with at first. Fortunately, dbplyr makes it easy to join tables together.

For example, how did we know that gender_concept_id == 45878463 referred to women? We can look up the names of concept ids in the concept table:

concept_tbl <- tbl(con, "concept") %>%
  select(concept_id, concept_name)
concept_tbl
#> # Source:   SQL [?? x 2]
#> # Database: BigQueryConnection
#>    concept_id concept_name                                      
#>       <int64> <chr>                                             
#>  1   38003166 Durable Medical Equipment - General Classification
#>  2   35805830 DexaBEAM                                          
#>  3   38003221 Blood - Plasma                                    
#>  4    1147839 survey_conduct.survey_start_date                  
#>  5       8623 log reduction                                     
#>  6   38004063 Rehabilitation Practitioner                       
#>  7   38003186 Radiology - Diagnostic - General Classification   
#>  8   35805115 VNCOP-B                                           
#>  9   35805457 VAdCA                                             
#> 10       8581 heartbeat                                         
#> # ℹ more rows

We just want to extract the names of the gender concept ids. To do this, we can join the person table with the concept table. So that we can see the full range of gender ids, first we will count them:

genders_in_aou <- person_tbl %>%
  count(gender_concept_id) %>%
  left_join(concept_tbl, by = join_by(gender_concept_id == concept_id))
genders_in_aou
#> # Source:   SQL [9 x 3]
#> # Database: BigQueryConnection
#>   gender_concept_id       n concept_name                                        
#>             <int64> <int64> <chr>                                               
#> 1           1177221     602 I prefer not to answer                              
#> 2                 0      97 No matching concept                                 
#> 3          45878463  247453 Female                                              
#> 4           1585843     407 Gender Identity: Additional Options                 
#> 5            903096    7356 PMI: Skip                                           
#> 6          45880669  154241 Male                                                
#> 7           1585842     562 Gender Identity: Transgender                        
#> 8           1585841    1213 Gender Identity: Non Binary                         
#> 9        2000000002    1526 Not man only, not woman only, prefer not to answer,…

The result of this SQL query is just 9 rows, so we get to see all of them. Both the counting and the joining were done directly on the database, so this was very efficient.

aou_join()

The allofus package includes a function called aou_join() that makes it easy to join tables together. It includes some additional checks to help avoid mistakes in joining. For example, if we wanted to join the person table with the observation table, dropping people with no observations, we could do it like this:

obs <- person_tbl %>%
  aou_join("observation", type = "inner", by = "person_id")
Warning message:
“There are shared column names not specified in the `by` argument.
→ These column names now end in '_x' and '_y'.
ℹ You can change these suffixes using the `suffix` argument but it cannot
  contain periods (`.`).
→ Consider specifing all shared columns in the `by` argument.
→ Or if these additional shared columns are `NA`, remove them prior to joining.”

The warning message tells us that the person and observation tables share some column names that we didn’t specify as part of the join argument. That is because both tables have a column called provider_id. We can see this by looking at the column names of the obs table that have the default added suffix, “_x” and “_y”:

obs %>%
  select(ends_with("_x"), ends_with("_y")) %>%
  colnames()
#> [1] "provider_id_x" "provider_id_y"

Because this is often a mistake occurring because we are not working with the tables directly, aou_join() warns us about this. We can avoid this warning by specifying all of the columns that we want to join on and removing the columns that we don’t want to join on. For example, we could remove the provider_id column from the person table before joining:

obs <- person_tbl %>%
  select(-provider_id) %>%
  aou_join("observation", type = "inner", by = "person_id")

Joins from different sources?

Unfortunately, we can’t join a table on the database with a dataframe in R. If you end up with one of each, you have a couple of options:

  1. See if you can avoid collecting the dataframe into R. Most allofus functions have a collect = FALSE argument, but sometimes it’s unavoidable.
  2. Bring the table from the database into R using collect() and then join it with the dataframe in R. This can be inefficient if part of the reason for joining is to subset the table down to only data you care about.
  3. First subset the table on the database, then bring it into R and join it with the dataframe in R. For example, if you have a cohort of participants as a dataframe, e.g., as created by aou_atlas_cohort(), and you want to bring in activity data for those participants, you could run:
# instead of aou_join(cohort, "activity_summary", type = "left", by = "person_id")
activity_data <- tbl(con, "activity_summary") %>%
  filter(person_id %in% !!cohort$person_id) %>%
  collect() %>%
  right_join(cohort, by = "person_id")

Viewing the Underlying SQL with show_query()

Understanding the SQL code that dbplyr generates can be insightful, especially if you’re debugging or simply curious about the translation from R to SQL. To view the SQL query that corresponds to your dbplyr operations, use the show_query() function:

obs %>%
  show_query()
SELECT
  `edjnngldox`.`person_id` AS `person_id`,
  `gender_concept_id`,
  `year_of_birth`,
  `month_of_birth`,
  `day_of_birth`,
  `birth_datetime`,
  `race_concept_id`,
  `ethnicity_concept_id`,
  `location_id`,
  `care_site_id`,
  `person_source_value`,
  `gender_source_value`,
  `gender_source_concept_id`,
  `race_source_value`,
  `race_source_concept_id`,
  `ethnicity_source_value`,
  `ethnicity_source_concept_id`,
  `state_of_residence_concept_id`,
  `state_of_residence_source_value`,
  `sex_at_birth_concept_id`,
  `sex_at_birth_source_concept_id`,
  `sex_at_birth_source_value`,
  `observation_id`,
  `observation_concept_id`,
  `observation_date`,
  `observation_datetime`,
  `observation_type_concept_id`,
  `value_as_number`,
  `value_as_string`,
  `value_as_concept_id`,
  `qualifier_concept_id`,
  `unit_concept_id`,
  `zwcwezaowf`.`provider_id` AS `provider_id`,
  `visit_occurrence_id`,
  `visit_detail_id`,
  `observation_source_value`,
  `observation_source_concept_id`,
  `unit_source_value`,
  `qualifier_source_value`,
  `value_source_concept_id`,
  `value_source_value`,
  `questionnaire_response_id`
FROM `person` `edjnngldox`
INNER JOIN `observation` `zwcwezaowf`
  ON (`edjnngldox`.`person_id` = `zwcwezaowf`.`person_id`)

This function prints the SQL query that would be sent to the database. It’s a great way to learn SQL and understand how dbplyr optimizes data manipulation. (Why the gibberish table names? Bugs in previous versions of dbplyr resulted in table names that would break the query, and giving them unique names is a workaround.)

Running SQL code directly

Another approach to working with the data is to write SQL code directly. This is especially useful for complex queries that are difficult to express in dplyr syntax. The allofus package includes a function called aou_sql() that makes it easy to run SQL code directly on the database. For example, we could count the number of people in the person table like this:

aou_sql("SELECT COUNT(*) AS n FROM {CDR}.person")

There are a few important things to note about this code. First, the CDR variable is a special variable referring to what All of Us calls the “curated data repository”. When writing SQL code directly, we don’t need the database connection object con, instead we need to direct the code to the correct tables by preceding the table names with “{CDR}”. This means we can’t run the code we get from show_query() without modification. For example, we could count the number of young women in the dataset, as we did above with the dbplyr approach, like this:

aou_sql("
SELECT count(*) AS `n`
FROM (
  SELECT `person`.*
  FROM {CDR}.`person`
  WHERE (`gender_concept_id` = 45878463.0) AND (`year_of_birth` > 1980.0)
)
")

Second, the aou_sql() function returns a dataframe – the entire result of the SQL query is brought into memory. This means that we want to run an entire query at once, instead of breaking it into multiple steps like we did with dbplyr.