# Install from CRAN
install.packages("allofus")
# Or install development version
install.packages("remotes")
::install_github("roux-ohdsi/allofus") remotes
All of Us in R
The Basics
Before diving into the specifics, let’s clarify some vocabulary:
- SQL database : A structured collection of data where information is stored in tables. Each table is like a spreadsheet with rows and columns. Data can be added, removed, or modified using SQL queries.
- SQL query : A request for data from a database written in a language called SQL (Structured Query Language).
- Google BigQuery : A cloud-based SQL database that is used to store the All of Us data.
- bigrquery : An R package that allows you to interact with Google BigQuery from R.
- dplyr: A part of the
tidyverse
in R,dplyr
is a package for data manipulation. It provides a set of functions that can be used to filter, select, arrange, mutate, summarize, and join data. - dbplyr : Also a part of the
tidyverse
in R,dbplyr
is a database backend fordplyr
. It allows you to write R code that is then translated into SQL queries.
Installing and Loading allofus
You can install the allofus
R package directly from CRAN, or you can install the development version from Github (https://github.com/roux-ohdsi/allofus).
Use the library()
command to load the allofus package and the dplyr
package. Most functionality in allofus
relies on the dplyr
package, which comes pre-installed in the researcher workbench. Installing allofus
will also make sure that the version of dplyr
is up-to-date.
library(allofus)
library(dplyr)
Accessing Data
Connecting to the database
A connection to a database is an object that allows you to interact with it from R. The allofus
package relies on the bigrquery
R package to create a connection to the Google BigQuery database when you run aou_connect()
.
<- aou_connect() con
The object con
is used to refer to the connection to the All of Us database. When you run the aou_connect()
function, it also gets stored as the default connection for a session, so you don’t need to include it in other functions from the allofus
package. For instance, you can run aou_tables()
to see a list of tables in the database:
aou_tables()
#> # A tibble: 68 × 2
#> table_name columns
#> <chr> <chr>
#> 1 concept_ancestor ancestor_concept_id, descendant_concept_id, min_lev…
#> 2 cb_criteria_ancestor ancestor_id, descendant_id
#> 3 attribute_definition attribute_definition_id, attribute_name, attribute_…
#> 4 cdm_source cdm_source_name, cdm_source_abbreviation, cdm_holde…
#> 5 concept_class concept_class_id, concept_class_name, concept_class…
#> 6 concept concept_id, concept_name, domain_id, vocabulary_id,…
#> 7 concept_synonym concept_id, concept_synonym_name, language_concept_…
#> 8 cb_criteria_relationship concept_id_1, concept_id_2
#> 9 concept_relationship concept_id_1, concept_id_2, relationship_id, valid_…
#> 10 condition_occurrence condition_occurrence_id, person_id, condition_conce…
#> # ℹ 58 more rows
Accessing a table
To access a table in the database, use tbl(con, "tablename")
. The resulting object is reference to a table in a database that allows you to interact with it from R. In order to connect to a table in the database, you must first create the database connection (con
). For example, to create a reference to the person table:
<- tbl(con, "person") person_tbl
Although the person_tbl
object behaves similarly to a data frame, it is not actually a data frame. Instead, it is actually a SQL query that only gets run when you need to access the data. This is a feature of the dbplyr
package that allows you to manipulate data without actually retrieving it. The SQL query behind the person_tbl
object is:
SELECT *
FROM `person`
When you print person_tbl
, you’ll get something like this:
person_tbl
#> # Source: table<person> [?? x 23]
#> # Database: BigQueryConnection
#> person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#> <int64> <int64> <int64> <int64> <int64>
#> 1 xxxxxxx 903096 1955 NA NA
#> 2 xxxxxxx 903096 1978 NA NA
#> 3 xxxxxxx 903096 2000 NA NA
#> 4 xxxxxxx 903096 1988 NA NA
#> 5 xxxxxxx 903096 1993 NA NA
#> 6 xxxxxxx 903096 1959 NA NA
#> 7 xxxxxxx 903096 1976 NA NA
#> 8 xxxxxxx 903096 1961 NA NA
#> 9 xxxxxxx 903096 1952 NA NA
#> 10 xxxxxxx 903096 1980 NA NA
#> # ℹ more rows
#> # ℹ 18 more variables: birth_datetime <dttm>, race_concept_id <int64>,
#> # ethnicity_concept_id <int64>, location_id <int64>, provider_id <int64>,
#> # care_site_id <int64>, person_source_value <chr>, gender_source_value <chr>,
#> # gender_source_concept_id <int64>, race_source_value <chr>,
#> # race_source_concept_id <int64>, ethnicity_source_value <chr>,
#> # ethnicity_source_concept_id <int64>, …
You’ll only see the first 10 rows of the person table (person ids were omitted from the output). This allows you to see what the data looks like without loading the entire table into R, which can be slow or even crash your session.
Instead, you want to perform as much data manipulation as possible on the database. This is more efficient because the operations are translated into SQL and executed on the server, which is faster and requires less memory than processing in R.
Data manipulation on the database
Before bringing data into R, you can manipulate it on the database using the dplyr
functions. This allows you to perform operations on the database without bringing the data into R’s memory.
For example, you can subset the person table to women born after 1980:
<- person_tbl %>%
young_women filter(gender_concept_id == 45878463, year_of_birth > 1980)
Before we print out young_women
, the SQL query has not actually been run. In fact, person_tbl
is not run either. When we do print it, it will run the following SQL:
SELECT `person`.*
FROM `person`
WHERE (`gender_concept_id` = 45878463.0) AND (`year_of_birth` > 1980.0)
and print the first 10 rows:
young_women
#> # Source: SQL [?? x 23]
#> # Database: BigQueryConnection
#> person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#> <int64> <int64> <int64> <int64> <int64>
#> 1 xxxxxxx 45878463 1992 NA NA
#> 2 xxxxxxx 45878463 1989 NA NA
#> 3 xxxxxxx 45878463 1981 NA NA
#> 4 xxxxxxx 45878463 1990 NA NA
#> 5 xxxxxxx 45878463 1990 NA NA
#> 6 xxxxxxx 45878463 1985 NA NA
#> 7 xxxxxxx 45878463 1987 NA NA
#> 8 xxxxxxx 45878463 1986 NA NA
#> 9 xxxxxxx 45878463 1983 NA NA
#> 10 xxxxxxx 45878463 1998 NA NA
# ℹ more rows
# ℹ 18 more variables: birth_datetime <dttm>, race_concept_id <int64>,
# ethnicity_concept_id <int64>, location_id <int64>, provider_id <int64>,
# care_site_id <int64>, person_source_value <chr>, gender_source_value <chr>,
# gender_source_concept_id <int64>, race_source_value <chr>,
# race_source_concept_id <int64>, ethnicity_source_value <chr>,
# ethnicity_source_concept_id <int64>, …
Note that we don’t know how many observations match these conditions yet (the dimensions are [?? x 23]
), because it hasn’t been fully executed – only the first 10 rows. To get the total number of observations, we can use tally()
:
tally(young_women)
#> # Source: SQL [1 x 1]
#> # Database: BigQueryConnection
#> n
#> <int64>
#> 1 76135
This is actually a SQL query that only results in 1 row, so we do get to see the entire thing. It’s much faster to run than to bring the entire table into R and then count the number of rows, because the code is executed on the database:
SELECT count(*) AS `n`
FROM (
SELECT `person`.*
FROM `person`
WHERE (`gender_concept_id` = 45878463.0) AND (`year_of_birth` > 1980.0)
)
The collect()
Function
We can bring the result of a query into the local R session using collect()
:
%>% collect() young_women
This brings the table into your local R workspace as a tibble, or dataframe. This is useful for performing operations that cannot be performed directly on the database, such as certain statistical analyses or plotting. For example, if you’re planning to run a regression analysis on the filtered data, you would first use collect()
to bring the data into R.
We can bring in the result of tally()
as well:
tally(young_women) %>% collect()
#> A tibble: 1 × 1
#> n
#> <int64>
#> 76135
Or even the entire person table, although that’s not recommended because it’s so large!
<- person_tbl %>% collect()
person_data person_data
#> # A tibble: 413457 × 23
#> person_id gender_concept_id year_of_birth month_of_birth day_of_birth
#> <int64> <int64> <int64> <int64> <int64>
#> xxxxxxx 903096 1955 NA NA
#> xxxxxxx 903096 1978 NA NA
#> xxxxxxx 903096 2000 NA NA
#> xxxxxxx 903096 1988 NA NA
#> xxxxxxx 903096 1993 NA NA
#> xxxxxxx 903096 1959 NA NA
#> xxxxxxx 903096 1976 NA NA
#> xxxxxxx 903096 1961 NA NA
#> xxxxxxx 903096 1952 NA NA
#> xxxxxxx 903096 1980 NA NA
Data manipulation with multiple tables
The All of Us data is spread across multiple tables, for the most part corresponding to the OMOP Common Data Model. This allows for efficient storage and retrieval of data, but it can be a bit tricky to work with at first. Fortunately, dbplyr
makes it easy to join tables together.
For example, how did we know that gender_concept_id == 45878463
referred to women? We can look up the names of concept ids in the concept
table:
<- tbl(con, "concept") %>%
concept_tbl select(concept_id, concept_name)
concept_tbl
#> # Source: SQL [?? x 2]
#> # Database: BigQueryConnection
#> concept_id concept_name
#> <int64> <chr>
#> 1 38003166 Durable Medical Equipment - General Classification
#> 2 35805830 DexaBEAM
#> 3 38003221 Blood - Plasma
#> 4 1147839 survey_conduct.survey_start_date
#> 5 8623 log reduction
#> 6 38004063 Rehabilitation Practitioner
#> 7 38003186 Radiology - Diagnostic - General Classification
#> 8 35805115 VNCOP-B
#> 9 35805457 VAdCA
#> 10 8581 heartbeat
#> # ℹ more rows
We just want to extract the names of the gender concept ids. To do this, we can join the person
table with the concept
table. So that we can see the full range of gender ids, first we will count them:
<- person_tbl %>%
genders_in_aou count(gender_concept_id) %>%
left_join(concept_tbl, by = join_by(gender_concept_id == concept_id))
genders_in_aou
#> # Source: SQL [9 x 3]
#> # Database: BigQueryConnection
#> gender_concept_id n concept_name
#> <int64> <int64> <chr>
#> 1 1177221 602 I prefer not to answer
#> 2 0 97 No matching concept
#> 3 45878463 247453 Female
#> 4 1585843 407 Gender Identity: Additional Options
#> 5 903096 7356 PMI: Skip
#> 6 45880669 154241 Male
#> 7 1585842 562 Gender Identity: Transgender
#> 8 1585841 1213 Gender Identity: Non Binary
#> 9 2000000002 1526 Not man only, not woman only, prefer not to answer,…
The result of this SQL query is just 9 rows, so we get to see all of them. Both the counting and the joining were done directly on the database, so this was very efficient.
aou_join()
The allofus
package includes a function called aou_join()
that makes it easy to join tables together. It includes some additional checks to help avoid mistakes in joining. For example, if we wanted to join the person
table with the observation
table, dropping people with no observations, we could do it like this:
<- person_tbl %>%
obs aou_join("observation", type = "inner", by = "person_id")
Warning message:
“There are shared column names not specified in the `by` argument.
→ These column names now end in '_x' and '_y'.
ℹ You can change these suffixes using the `suffix` argument but it cannot
contain periods (`.`).
→ Consider specifing all shared columns in the `by` argument.
→ Or if these additional shared columns are `NA`, remove them prior to joining.”
The warning message tells us that the person
and observation
tables share some column names that we didn’t specify as part of the join argument. That is because both tables have a column called provider_id
. We can see this by looking at the column names of the obs
table that have the default added suffix, “_x” and “_y”:
%>%
obs select(ends_with("_x"), ends_with("_y")) %>%
colnames()
#> [1] "provider_id_x" "provider_id_y"
Because this is often a mistake occurring because we are not working with the tables directly, aou_join()
warns us about this. We can avoid this warning by specifying all of the columns that we want to join on and removing the columns that we don’t want to join on. For example, we could remove the provider_id
column from the person
table before joining:
<- person_tbl %>%
obs select(-provider_id) %>%
aou_join("observation", type = "inner", by = "person_id")
Joins from different sources?
Unfortunately, we can’t join a table on the database with a dataframe in R. If you end up with one of each, you have a couple of options:
- See if you can avoid collecting the dataframe into R. Most
allofus
functions have acollect = FALSE
argument, but sometimes it’s unavoidable. - Bring the table from the database into R using
collect()
and then join it with the dataframe in R. This can be inefficient if part of the reason for joining is to subset the table down to only data you care about. - First subset the table on the database, then bring it into R and join it with the dataframe in R. For example, if you have a cohort of participants as a dataframe, e.g., as created by
aou_atlas_cohort()
, and you want to bring in activity data for those participants, you could run:
# instead of aou_join(cohort, "activity_summary", type = "left", by = "person_id")
<- tbl(con, "activity_summary") %>%
activity_data filter(person_id %in% !!cohort$person_id) %>%
collect() %>%
right_join(cohort, by = "person_id")
Viewing the Underlying SQL with show_query()
Understanding the SQL code that dbplyr
generates can be insightful, especially if you’re debugging or simply curious about the translation from R to SQL. To view the SQL query that corresponds to your dbplyr
operations, use the show_query()
function:
%>%
obs show_query()
SELECT
`edjnngldox`.`person_id` AS `person_id`,
`gender_concept_id`,
`year_of_birth`,
`month_of_birth`,
`day_of_birth`,
`birth_datetime`,
`race_concept_id`,
`ethnicity_concept_id`,
`location_id`,
`care_site_id`,
`person_source_value`,
`gender_source_value`,
`gender_source_concept_id`,
`race_source_value`,
`race_source_concept_id`,
`ethnicity_source_value`,
`ethnicity_source_concept_id`,
`state_of_residence_concept_id`,
`state_of_residence_source_value`,
`sex_at_birth_concept_id`,
`sex_at_birth_source_concept_id`,
`sex_at_birth_source_value`,
`observation_id`,
`observation_concept_id`,
`observation_date`,
`observation_datetime`,
`observation_type_concept_id`,
`value_as_number`,
`value_as_string`,
`value_as_concept_id`,
`qualifier_concept_id`,
`unit_concept_id`,
`zwcwezaowf`.`provider_id` AS `provider_id`,
`visit_occurrence_id`,
`visit_detail_id`,
`observation_source_value`,
`observation_source_concept_id`,
`unit_source_value`,
`qualifier_source_value`,
`value_source_concept_id`,
`value_source_value`,
`questionnaire_response_id`
FROM `person` `edjnngldox`
INNER JOIN `observation` `zwcwezaowf`
ON (`edjnngldox`.`person_id` = `zwcwezaowf`.`person_id`)
This function prints the SQL query that would be sent to the database. It’s a great way to learn SQL and understand how dbplyr
optimizes data manipulation. (Why the gibberish table names? Bugs in previous versions of dbplyr
resulted in table names that would break the query, and giving them unique names is a workaround.)
Running SQL code directly
Another approach to working with the data is to write SQL code directly. This is especially useful for complex queries that are difficult to express in dplyr
syntax. The allofus
package includes a function called aou_sql()
that makes it easy to run SQL code directly on the database. For example, we could count the number of people in the person
table like this:
aou_sql("SELECT COUNT(*) AS n FROM {CDR}.person")
There are a few important things to note about this code. First, the CDR
variable is a special variable referring to what All of Us calls the “curated data repository”. When writing SQL code directly, we don’t need the database connection object con
, instead we need to direct the code to the correct tables by preceding the table names with “{CDR}”. This means we can’t run the code we get from show_query()
without modification. For example, we could count the number of young women in the dataset, as we did above with the dbplyr
approach, like this:
aou_sql("
SELECT count(*) AS `n`
FROM (
SELECT `person`.*
FROM {CDR}.`person`
WHERE (`gender_concept_id` = 45878463.0) AND (`year_of_birth` > 1980.0)
)
")
Second, the aou_sql()
function returns a dataframe – the entire result of the SQL query is brought into memory. This means that we want to run an entire query at once, instead of breaking it into multiple steps like we did with dbplyr
.