Relational Databases in R, Part 1

Data is often hierarchical, structuring it in a relational database is an efficient way to account for this.
relational databases
learning R
data science
Author

Aaron Wenger

Published

May 13, 2023

The first step in any data analysis workflow is ensuring that the data is clean and has a consistent, known structure. Analysis scripts that start with clean, consistent data are simpler, facilitate easy exploratory data analysis, and are extensible. There are at least several ways for data to be structured, but likely only one that is well-suited to the data analysis tools which will be used. The tidyverse tools in R - and other tools which follow “tidy data” conventions - are “tidy tools” meaning they take tidy data and output tidy data.

Hadley Wickham, one of the main developers of tidyverse packages, wrote a technical paper explaining the concepts of tidy data and tidy tools (Wickham 2014). Thankfully he also coauthored a very approachable book (Wickham, Çetinkaya-Rundel, and Grolemund 2023) in which tidy data is explained in simpler terms. In a few words, tidy data follows three principles:

  1. each variable has its own column
  2. each observation has its own row
  3. each value has its own cell

The third principle may seem obvious but it becomes important in more complex data structures. Figure 1, from chapter 12, shows these three principles visually.

Figure 1: Tidy Data Principles

Introduction to Relational Databases

Often when data is collected, it will have a nested or hierarchical structure. This especially true for education data. For instance, students are nested in classrooms which are nested in schools, which are nested in districts … you get the point. At each level of nesting different variables will be relevant which affects data collection and analysis. For example, students have test scores and schools have average test scores - if all individual test scores are collected, it doesn’t make sense to contact the principle and collect the school average score. And yet we may still collect other, non-aggregated information about the school (public/sectarian, urban/rural, etc.) which are relevant to the analysis.

Each level in a hierarchical dataset is called an observational unit. In order to get this data into a tidy format it will need to restructured so that each observational unit gets it own table. Remember the third principle of tidy data? The third principle may be expressed differently as “each value is expressed only once” (my words) or “each observational unit has its own table” (Wickham 2014). In the above example, all student observations would be collected in one table and all school observations in another.

A set of tables are related to each using key variables, which allow the observations in one table to be related to observations in another. For example, in order to identify the school a student attends, a school_id variable is given to each student.

There are two types of keys. One primary key in included in each table to uniquely identify each observation. One or more foreign keys are included in each table in order to relate observations to those tables where the key is primary. So on the student table, “student_id” is the primary key and “school_id” is the foreign key, corresponding to the primary key on the school table.

Why Use Relational Databases?

From a quick reading of the Wikipedia page on the relational model (data science term for relational databases) there are alternative ways to structure nested data. What those alternatives are and what the relative theoretical advantages of the relational model are, I cannot say. (Perhaps that’s fodder for another post.) Relational databases do have the advantage of being both human- and machine-readable. This means that code can be easily written to manipulate a relational database and people can also open up an excel spreadsheet and peruse each table.

When compared to “flat” data (all variables and values on one table), I have found two distinct advantages: interpretability and extensibility. The relational model represents a structural aspect of the system being investigated in the structure of the data. This enables much more intuitive assessments of the data and interpretations of the results than is possible with a flat structure. Additionally, a flat structure is constraining when a project evolves and more data is collected. Either new variables must added or existing variables must be renamed/redefined to incorporate the new observational unit. In a relational database, a new table and key variable(s) is simply added.

Tidyverse’s dplyr package has “two table verbs” which make it super easy to manipulate data structured as a relational database. One table verbs are functions that take one table as input, perform some operation, and output the resulting table. Two table verbs do the very same thing except they take two tables as input, require a matching variable (a key), and output a single table.

In part two, an example of a “flat” data file will be used to show how a relational database can be constructed. In part three (coming soon), our relational database will be used to generate summary statistics and extended to a fourth table.

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10).
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd edition. Sebastopol: O’Reilly. https://r4ds.hadley.nz/.