Data import and cleaning

This notebook is a starting point for the Special Education project following up the Houston Chronicle’s Denied series. This project has the data necessary to complete the Denied Cleaning chapter in Reporting with Data in R.

Setup

In this step we are setting up the packages we’ll use, tidyverse and janitor.

library(tidyverse)
library(janitor)

Working on DSTUD data

We’ll use a function called list.files() to create a vector of the files we want. The functions below simply pulls up the list of files that we are going to use.

dstud_files <- list.files(
  "data-raw",
  pattern = "DSTUD",
  full.names = TRUE
)

dstud_files

 [1] "data-raw/DSTUD_13.csv" "data-raw/DSTUD_14.csv" "data-raw/DSTUD_15.csv"
 [4] "data-raw/DSTUD_16.csv" "data-raw/DSTUD_17.csv" "data-raw/DSTUD_18.csv"
 [7] "data-raw/DSTUD_19.csv" "data-raw/DSTUD_20.csv" "data-raw/DSTUD_21.csv"
[10] "data-raw/DSTUD_22.csv" "data-raw/DSTUD_23.csv"

We use map() to walk us through the list where we want to apply read_csv() to call out specific columns. Then, we’ll bind the files together with list_rbind() and finish out by normalizing the names.

dstud_raw <- dstud_files |>
  set_names(basename) |>
  map(
    read_csv,
    col_select = c(DISTRICT, DPETALLC, DPETSPEC, DPETSPEP)
  ) |> 
  list_rbind(names_to = "source") |>
  clean_names()

Rows: 1228 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1227 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1219 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1207 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1203 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1200 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1201 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1202 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1204 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1207 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 1209 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): DISTRICT
dbl (3): DPETALLC, DPETSPEC, DPETSPEP

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dstud_raw |> nrow()

[1] 13307

dstud_raw |> head()

dstud_raw |> tail()

We also need to clean the file to remove the apostrophe from district and to build a year variable out of the source.

dstud <- dstud_raw |>
  mutate(
    district = str_remove(district, "'"),
    year = str_c("20", str_sub(source, 7, 8))
  ) |> 
  select(!source)

dstud |> head()

dstud |> tail()

Since the districts aren’t labeled, we can use the DISTRICT variable to “join” the district name to all the other data.

We need the DFLCHART and DFLALTED fields so we can filter out charter and alternative education schools. For this analysis, we only want “traditional” public schools.

Working on reference data

We’ll use read_csv to pull in the variables that we need.

dref_raw <- read_csv(
  "data-raw/DREF_22.csv",
  col_select = c(
    DISTRICT,
    CNTYNAME,
    DISTNAME,
    DFLCHART,
    DFLALTED
    )
  ) |> 
  clean_names()

Rows: 1207 Columns: 5
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): DISTRICT, CNTYNAME, DFLALTED, DFLCHART, DISTNAME

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dref_raw |> head()

Now that we have our reference file, we need to clean up the apostrophes so we can join this data to our student data.

dref <- dref_raw |> 
  mutate(district = str_remove(district, "'"))

dref |> head()

Joining the reference data

In this case, we need to use inner_join to merge the unique district number to its actual name that we can work with later.

sped_joined <- dref |>
  inner_join(dstud, by = "district")

sped_joined |> head()

Let’s glimpse() your new data so you can see all the columns have been added. There should be 13087 rows of joined data.

sped_joined |> glimpse()

Rows: 13,087
Columns: 9
$ district <chr> "001902", "001902", "001902", "001902", "001902", "001902", "…
$ cntyname <chr> "ANDERSON", "ANDERSON", "ANDERSON", "ANDERSON", "ANDERSON", "…
$ distname <chr> "CAYUGA ISD", "CAYUGA ISD", "CAYUGA ISD", "CAYUGA ISD", "CAYU…
$ dflchart <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "…
$ dflalted <chr> "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "N", "…
$ dpetallc <dbl> 595, 553, 577, 568, 576, 575, 564, 557, 535, 574, 593, 1236, …
$ dpetspec <dbl> 73, 76, 76, 78, 82, 83, 84, 82, 78, 84, 83, 113, 107, 126, 14…
$ dpetspep <dbl> 12.3, 13.7, 13.2, 13.7, 14.2, 14.4, 14.9, 14.7, 14.6, 14.6, 1…
$ year     <chr> "2013", "2014", "2015", "2016", "2017", "2018", "2019", "2020…

Cleaning the data

In this section we will filter, select, and remove certain columns to focus on traditional schools in our analysis. We will also remove and rename columns to make them more descriptive.

sped_cleaned <- sped_joined |> 
  filter(dflalted == "N" & dflchart == "N") |> 
  select(
    district,
    distname,
    cntyname,
    year,
    all_count = dpetallc,
    sped_count = dpetspec,
    sped_percent = dpetspep
  )

Adding an audit benchmark

We want to add a column called audit_flag that tells us where the sped_percent is above 8.5. This is a simple true/false condition that is perfect for the if_else() function.

sped_flag <- sped_cleaned |>
  mutate(audit_flag = if_else(
    sped_percent > 8.5,
    "ABOVE",
    "BELOW"
  ))

# this pulls 10 random rows so I can check results
sped_flag |>
  sample_n(10) |>
  select(distname, sped_percent, audit_flag)

Exporting the data

As always, we use write_rds() to locate the path to processed folder.

sped_flag |>
   write_rds("data-processed/01-sped.rds")