Relational Databases in R, Part 2

In part 1 of this series, the relational model for structuring data was introduced in terms of Hadley Wickham’s tidy data. This post demonstrates how “flat” data can be restructured as a three-table relational database using standard tidyverse tools.

Introduction to Dataset

The dataset used in this example was downloaded from the Michigan government site MI School Data. It holds contact information and various other data points for all current Michigan public educational entities. All 22 variables are shown with the dplyr::glimpse() function:

Rows: 5,010
Columns: 22
$ SchoolYear      <chr> "22 - 23 School Year", "22 - 23 School Year", "22 - 23…
$ ISDCode         <chr> "03", "03", "03", "03", "03", "03", "03", "03", "03", …
$ ISDName         <chr> "Allegan Area Educational Service Agency", "Allegan Ar…
$ DistrictCode    <chr> "03000", "03000", "03000", "03000", "03000", "03010", …
$ DistrictName    <chr> "Allegan Area Educational Service Agency", "Allegan Ar…
$ BuildingCode    <chr> "00000", "03816", "03816", "06730", "07254", "00000", …
$ BuildingName    <chr> "Allegan Area Educational Service Agency", "Early Coll…
$ COUNTY_CODE     <chr> "03", "03", "03", "03", "03", "03", "39", "03", "03", …
$ COUNTY_NAME     <chr> "Allegan", "Allegan", "Allegan", "Allegan", "Allegan",…
$ EntityType      <chr> "ISD", "ISD Non-Instructional Ancillary Facility", "IS…
$ SCHOOL_LEVEL    <chr> "Elem thru High School", "Other", "Other", "Elem thru …
$ LOCALE_NAME     <chr> "Town: Distant", "Not Specified", "Not Specified", "To…
$ SCHOOL_EMPHASIS <chr> "Not Applicable", "Not Applicable", "Not Applicable", …
$ SETTING         <chr> NA, NA, NA, "Multiple Settings", "Multiple Settings", …
$ EMAIL_ADDRESS   <chr> "william.brown@alleganaesa.org", "yhouser@alleganaesa.…
$ PhoneNumber     <dbl> 2695127705, 2695127801, 2695127801, 2695127900, 269512…
$ ADDRESS_LINE_1  <chr> "310 Thomas St", "310 Thomas St", "310 Thomas St", "21…
$ CITY            <chr> "Allegan", "Allegan", "Allegan", "Allegan", "Allegan",…
$ STATE           <chr> "MI", "MI", "MI", "MI", "MI", "MI", "MI", "MI", "MI", …
$ ZIP_CODE        <dbl> 49010, 49010, 49010, 49010, 49010, 49080, 49009, 49080…
$ SCHOOL_TYPE     <chr> NA, NA, NA, "Special Education", "Vocational/CTE", NA,…
$ Status          <chr> "Open-Active", "Open-Active", "Open-Active", "Open-Act…

This is a flat dataset as it contains three observational units: Intermediate school districts (ISDs, also referred to as regional educational service agencies, RESAs), school districts, and buildings (typically schools but inclusive of administrative offices and other facilities). These three observational units are hierarchically nested, with ISDs containing districts, which contain schools. The “flatness” of the data is apparent as some rows represent ISDs and the buildings they operate outside of any one district (i.e., administrative offices, supplemental educational facilities). Other rows represent districts and their administrative offices. Generally each row represents a unique building with a distinct address, however some buildings are listed multiple times. This likely represents the co-location of different types of facilities (e.g., administrative offices located in a K-12 school). Some rows appear to be duplicated for no apparent reason ISDs themselves may contain one or more buildings directly, without an intervening district. Often this occurs because ISD administrative offices are located in their own building rather in one local districts. The first step Because the dataset is flat, summary statistics cannot easily be calculated for each level without using complex filtering conditions and grouping.

We won’t be using all these variables and will rename the relevant ones so that they are easier to work with. Also it appears that some closed buildings/schools are present, as well as a few duplicate rows present, so we will remove those (using filter() and unique()) before we start.

Code

mischool_flat <- mischool_flat |> 
  filter(Status == "Open-Active") |> 
  distinct() |>
  select(
    isd_code     = ISDCode,
    isd_name     = ISDName,
    dst_code     = DistrictCode,
    dst_name     = DistrictName,
    bld_code     = BuildingCode,
    bld_name     = BuildingName,
    bld_type     = SCHOOL_EMPHASIS,
    bld_level    = SCHOOL_LEVEL,
    ent_category = EntityType,
    add_street   = ADDRESS_LINE_1,
    add_city     = CITY,
    add_county   = COUNTY_NAME,
    add_zip      = ZIP_CODE
  )

Identifying key variables

Given that there are three observational units, our relational database should include three tables. Each table will require a primary key and at least one foreign key. Our first step will be to establish which variables will serve as primary keys for each table. The only criterion for primary keys is that they have unique values for each observation in their table.

The tables below show that for isd_code, dst_code, and bld_code, not all values are unique in the flat format.

Code

# The occurence of each potential key variable is counted and filtered for values that occur more than once.
# For a better presentation, counts are arranged in ascending order
# kable(head(mischool_flat |> count(isd_code) |> filter(n > 1) |> arrange(n)))
# kable(head(mischool_flat |> count(dst_code) |> filter(n > 1) |> arrange(n)))
# kable(head(mischool_flat |> count(sch_code) |> filter(n > 1) |> arrange(n)))
mischool_flat |> count(isd_code) |> filter(n > 1) |> arrange(n)
mischool_flat |> count(dst_code) |> filter(n > 1) |> arrange(n)
mischool_flat |> count(bld_code) |> filter(n > 1) |> arrange(n)

Table 1: Examining potential key variables

However, if we take the combination of isd_code, dst_code, and bld_code, we find that each row has a unique combined value (the number of distinct combinations, 4882, is the same as the total number of rows, 4882). Thus, it seems that isd_code, dst_code, and sch_code may be good key variables for the three tables that we want to construct.

Creating ISD, District, and School Tables

The administrative offices of ISDs and districts are indicated by a bld_code value of “00000”, according to the available documentation. Oftentimes, these entries duplicate the address (among other variables) of schools with students, representing how administrative offices often share a building with a K-12 school. Thus, bld_code != 00000 can be used as a filtering condition to separate out all entries representing schools.

In a similar manner, all ISD administrative offices appear to have a dst_code value that ends with three zeros. Thus, a condition with str_detect() can be constructed to separate out all entries representing district offices. Using these filtering conditions, we create our tables. Another check, shows that each table has a primary variable - as indicated by the fact there there are no duplicate values in the respective variables:

Code

mischool_database_sch <- mischool_flat |> 
  filter(bld_code != "00000")

mischool_database_dst <- mischool_flat |> 
  filter(!str_detect(dst_code, "000$") & bld_code == "00000") |> 
  select(
    isd_code,
    dst_code,
    bld_code,
    dst_name,
    ent_category,
    add_city,
    add_county
  )

mischool_database_isd <- mischool_flat |>
  filter(str_detect(dst_code, "000$") & bld_code == "00000") |> 
  select(
    isd_code,
    isd_name,
    dst_code,
    bld_code,
    ent_category,
    add_county
  )

# A simpler way to determine if each variable is a good primary key is to find the number of values which appear more than once.
# A primary key should never appear more than once!
# mischool_database_sch |> count(sch_code) |> filter(n > 1)
# kable(head(mischool_database_dst |> count(dst_code) |> filter(n > 1) |> arrange(n)))
# kable(head(mischool_database_isd |> count(isd_code) |> filter(n > 1) |> arrange(n)))
mischool_database_sch |> count(bld_code) |> filter(n > 1)
mischool_database_dst |> count(dst_code) |> filter(n > 1)
mischool_database_isd |> count(isd_code) |> filter(n > 1)

ISD Code

{#tbl-anonymous-1438179-1}

District Code

{#tbl-anonymous-1438179-2}

School Code

{#tbl-anonymous-1438179-3}

Checking primary key variables

Below is shown the school table which contains two additional variables isd_code and dst_code which act as foreign keys for the other two tables.

School Table

With our three tables, we have a relational database. Now, data can be added to each table and extracted for analysis in a consistent manner. Let’s start in the next post by adding some summary statistics that will make the database more useful.

Saving the Database as Separate Tables

It is important to save this state of the database as it represents the end of the beginning of cleaning and restructuring of the MiSchool dataset. Reproducing this database will be very simple and if any breaking changes are introduced in the MiSchool dataset (changing variable names, dropped variables, etc.) then it will be easy to introduce fixes.

Code

filepath <- here("posts", "learning_relational_databases_p2", "data", "relational_database")

write_csv(mischool_database_sch, here(filepath, "mischool_database_sch.csv"))
write_csv(mischool_database_dst, here(filepath, "mischool_database_dst.csv"))
write_csv(mischool_database_isd, here(filepath, "mischool_database_isd.csv"))