library(tidyverse)
library(janitor)
library(DT)Cleaning
The goal of this notebook is to prepare data from the Texas Commission on Environmental Quality - Notices Of Violation (NOV) and Violation Citations - for further analysis. Cleaned data means each data source will have clean variable names, correct data types, and clean categorical values.
Set up
The first step is to load the library packages that will assist with certain tasks in this project.
Importing the data
The data I downloaded comes from the Texas Open Data Portal. It is already in my files’ data_raw folder, so the next steps would be to import it, assign it to an R object, and glimpse at it.
To lowercase the columns into code-friendly format, I will use the clean_names() function.
citations_raw <- read_csv("data_raw/20241112.csv") |> clean_names()Rows: 241025 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Investigation Approved Date, Current Violation Status, Violation S...
dbl (3): Investigation Number, Notice of Violation ID, Violation Tracking N...
lgl (1): Violation Repeat Indicator
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
It is a good idea to glimpse at the data to ensure the columns are in proper format.
citations_raw |> glimpse()Rows: 241,025
Columns: 15
$ investigation_number <dbl> 1708769, 912908, 1385762, 1747889, 1…
$ investigation_approved_date <chr> "May 13, 2021", "Jun 20, 2011", "Nov…
$ notice_of_violation_id <dbl> 5.734687e+14, 1.123664e+14, 4.860436…
$ violation_tracking_number <dbl> 775850, 420378, 629005, 885714, 7216…
$ current_violation_status <chr> "RESOLVED", "RESOLVED", "RESOLVED", …
$ violation_status_date <chr> "Aug 26, 2021", "Sep 29, 2011", "Nov…
$ violation_allegation <chr> "Failure to maintain inadequate stor…
$ violation_resolution <chr> "On July 9, 2021, the TCEQ DFW Regio…
$ violation_repeat_indicator <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ compliance_due_date <chr> "Jun 13, 2021", "Jul 20, 2011", NA, …
$ compliance_history_classification <chr> "MODERATE", "MINOR", "MODERATE", "MO…
$ violation_category <chr> "B", "C", "B", "B", "B", "B", "C", "…
$ investigation_type <chr> "Compliance Investigation", "Complia…
$ county <chr> "JOHNSON", "TRINITY", "RAINS", "POLK…
$ tceq_region <chr> "REGION 04 - DFW METROPLEX", "REGION…
Right off the bat, I notice that I have to change the date formats and the investigation number, notice of violation ID, and violation tracking number columns to be character values since they are categorical variables.
Fixing the formats
For the date to not show up as a numerical value, I will use one of the lubridate functions, mutate(), to modify it. I referred to lubridate documentation for the right format.
I am arranging the dates in a descending order to see the most recent data.
citations_clean <- citations_raw |>
mutate(
investigation_approved_date = mdy(investigation_approved_date),
violation_status_date = mdy(violation_status_date),
compliance_due_date = mdy(compliance_due_date),
investigation_number = as.character(investigation_number),
notice_of_violation_id = as.character(notice_of_violation_id),
violation_tracking_number = as.character(violation_tracking_number)
) |>
arrange(desc(investigation_approved_date))
citations_clean# A tibble: 241,025 × 15
investigation_number investigation_approved_date notice_of_violation_id
<chr> <date> <chr>
1 2021703 2024-11-08 534344812024310
2 2022240 2024-11-08 360584652024304
3 2022445 2024-11-08 219368582024312
4 2015194 2024-11-08 399510332024303
5 2022467 2024-11-08 467347032024305
6 2022467 2024-11-08 467347032024305
7 2021304 2024-11-08 276390142024313
8 2010056 2024-11-08 695336872024313
9 2015194 2024-11-08 399510332024303
10 2018465 2024-11-08 110615902024310
# ℹ 241,015 more rows
# ℹ 12 more variables: violation_tracking_number <chr>,
# current_violation_status <chr>, violation_status_date <date>,
# violation_allegation <chr>, violation_resolution <chr>,
# violation_repeat_indicator <lgl>, compliance_due_date <date>,
# compliance_history_classification <chr>, violation_category <chr>,
# investigation_type <chr>, county <chr>, tceq_region <chr>
I will quickly look at the summary of the 240,000 rows I will be working with.
summary(citations_clean) investigation_number investigation_approved_date notice_of_violation_id
Length:241025 Min. :1998-12-09 Length:241025
Class :character 1st Qu.:2015-01-27 Class :character
Mode :character Median :2018-07-20 Mode :character
Mean :2018-04-28
3rd Qu.:2021-09-22
Max. :2024-11-08
violation_tracking_number current_violation_status violation_status_date
Length:241025 Length:241025 Min. :2001-02-20
Class :character Class :character 1st Qu.:2015-07-23
Mode :character Mode :character Median :2019-02-11
Mean :2018-09-29
3rd Qu.:2022-03-29
Max. :2024-11-08
violation_allegation violation_resolution violation_repeat_indicator
Length:241025 Length:241025 Mode :logical
Class :character Class :character FALSE:233706
Mode :character Mode :character TRUE :7319
compliance_due_date compliance_history_classification violation_category
Min. :1999-06-21 Length:241025 Length:241025
1st Qu.:2015-07-22 Class :character Class :character
Median :2019-03-29 Mode :character Mode :character
Mean :2020-04-05
3rd Qu.:2022-07-15
Max. :3000-12-31
NA's :157222
investigation_type county tceq_region
Length:241025 Length:241025 Length:241025
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
The most recent data is from November 8, 2024. The oldest data goes back to December 9, 1998.
To track specific investigation numbers, I will put this data in a look-up table format.
#citations_clean |> datatable()A different set - import and clean
This data comes from the same organization, but it includes the name and specific geographic location of the regulated entity. Tied together, the two sets provide a more detailed picture.
nov_raw <- read_csv("data_raw/20241028.csv") |> clean_names()Rows: 65639 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): Regulated Entity Number, Regulated Entity Name, Business Type, Phy...
dbl (3): Investigation Number, Notice of Violation ID, Total Violation Count
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
nov_raw |> glimpse()Rows: 65,639
Columns: 21
$ regulated_entity_number <chr> "RN100209931", "RN100209931", "RN1002…
$ regulated_entity_name <chr> "COVESTRO INDUSTRIAL PARK BAYTOWN PLA…
$ business_type <chr> "INDUSTRIAL CHEMICAL MANUFACTURING PL…
$ physical_location <chr> "8500 W BAY RD", "8500 W BAY RD", NA,…
$ physical_city <chr> "BAYTOWN", "BAYTOWN", NA, "CROSBY", "…
$ physical_state <chr> "TX", "TX", NA, "TX", "TX", "TX", "TX…
$ physical_zip_code <chr> "77523", "77523", NA, "77532", "77571…
$ county <chr> "CHAMBERS", "CHAMBERS", "WEBB", "HARR…
$ investigation_number <dbl> 1395926, 1956272, 1910546, 907867, 19…
$ investigation_approved_date <chr> "May 10, 2017", "Feb 28, 2024", "Aug …
$ nov_date <chr> "May 11, 2017", "Feb 28, 2024", "Aug …
$ coordinates_county_centroid <chr> "POINT (-94.72369894531248 29.7123136…
$ tceq_region <chr> "REGION 12 - HOUSTON", "REGION 12 - H…
$ notice_of_violation_id <dbl> 2.295461e+14, 4.613774e+14, 2.395373e…
$ cat_a_violation_tracking_numbers <chr> NA, NA, NA, NA, "862725", NA, NA, NA,…
$ cat_a_violation_citations <chr> NA, NA, NA, NA, "116.115(c); 382.085(…
$ cat_b_violation_tracking_numbers <chr> NA, "871474", "850936", NA, NA, "7273…
$ cat_b_violation_citations <chr> NA, "111.111(a)(1)(B); 122.143(4); 38…
$ cat_c_violation_tracking_numbers <chr> "633665", "871471; 871479", NA, "4340…
$ cat_c_violation_citations <chr> "116.115(c); 117.345(c); 122.143(4); …
$ total_violation_count <dbl> 1, 3, 1, 1, 1, 1, 4, 1, 2, 3, 1, 1, 1…
I will adjust some of the columns to be character values like previously. I will also look at the summary here to see the date range.
nov_clean <- nov_raw |>
mutate(investigation_approved_date = mdy(investigation_approved_date),
nov_date = mdy(nov_date),
investigation_number = as.character(investigation_number),
notice_of_violation_id = as.character(notice_of_violation_id)) |>
filter(year(nov_date) != 3000) |>
arrange(investigation_approved_date)
summary(nov_clean) regulated_entity_number regulated_entity_name business_type
Length:65635 Length:65635 Length:65635
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
physical_location physical_city physical_state physical_zip_code
Length:65635 Length:65635 Length:65635 Length:65635
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
county investigation_number investigation_approved_date
Length:65635 Length:65635 Min. :2010-09-16
Class :character Class :character 1st Qu.:2014-10-28
Mode :character Mode :character Median :2017-11-13
Mean :2017-11-24
3rd Qu.:2021-01-11
Max. :2024-07-01
nov_date coordinates_county_centroid tceq_region
Min. :2010-09-16 Length:65635 Length:65635
1st Qu.:2014-08-27 Class :character Class :character
Median :2017-07-27 Mode :character Mode :character
Mean :2017-09-24
3rd Qu.:2020-11-19
Max. :2024-07-19
notice_of_violation_id cat_a_violation_tracking_numbers
Length:65635 Length:65635
Class :character Class :character
Mode :character Mode :character
cat_a_violation_citations cat_b_violation_tracking_numbers
Length:65635 Length:65635
Class :character Class :character
Mode :character Mode :character
cat_b_violation_citations cat_c_violation_tracking_numbers
Length:65635 Length:65635
Class :character Class :character
Mode :character Mode :character
cat_c_violation_citations total_violation_count
Length:65635 Min. : 1.000
Class :character 1st Qu.: 1.000
Mode :character Median : 1.000
Mean : 2.433
3rd Qu.: 3.000
Max. :73.000
The latest date shows up as the year 3000. I will check whether this is a repeating error by filtering the original data for years after 2024.
nov_date_check <- nov_raw |>
mutate(
nov_date = mdy(nov_date),
extracted_year = year(nov_date)
) |>
filter(extracted_year > 2024)
nov_date_check# A tibble: 4 × 22
regulated_entity_number regulated_entity_name business_type physical_location
<chr> <chr> <chr> <chr>
1 RN100673813 VINCE HAGAN <NA> 330 CLAY RD BLDG …
2 RN106076573 PHILLIPS 66 STATION <NA> 229 N HWY 259
3 RN105779458 OSSF 1749 US HWY 79 N <NA> 1749 US HIGHWAY 7…
4 RN109272617 MJ GROCERY GARY TX <NA> 4883 FM 999
# ℹ 18 more variables: physical_city <chr>, physical_state <chr>,
# physical_zip_code <chr>, county <chr>, investigation_number <dbl>,
# investigation_approved_date <chr>, nov_date <date>,
# coordinates_county_centroid <chr>, tceq_region <chr>,
# notice_of_violation_id <dbl>, cat_a_violation_tracking_numbers <chr>,
# cat_a_violation_citations <chr>, cat_b_violation_tracking_numbers <chr>,
# cat_b_violation_citations <chr>, cat_c_violation_tracking_numbers <chr>, …
There are four dates labeled with the year 3000, which needs to be followed up with a TCEQ information officer.
To see the actual latest date, I will edit an earlier chunk to exclude the mismatched year (at least for now). The clean data set ranges from November 16, 2010 all the way to July 19, 2024. As with the citations data set, I want to have this one in a searchable table format for reference.
This table is temporary commented out for its large size.
#nov_clean |> datatable()Exporting the data
Last but not least, I will export the cleaned data in a native R format.
citations_clean |> write_rds("data_processed/01-citations.rds")
nov_clean |> write_rds("data_processed/01-nov.rds")