Cleaning

The goal of this notebook is to prepare data from the Texas Commission on Environmental Quality - Notices Of Violation (NOV) and Violation Citations - for further analysis. Cleaned data means each data source will have clean variable names, correct data types, and clean categorical values.

Set up

The first step is to load the library packages that will assist with certain tasks in this project.

library(tidyverse)
library(janitor)
library(DT)

Importing the data

The data I downloaded comes from the Texas Open Data Portal. It is already in my files’ data_raw folder, so the next steps would be to import it, assign it to an R object, and glimpse at it.

To lowercase the columns into code-friendly format, I will use the clean_names() function.

citations_raw <- read_csv("data_raw/20241112.csv") |> clean_names()

Rows: 241025 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (11): Investigation Approved Date, Current Violation Status, Violation S...
dbl  (3): Investigation Number, Notice of Violation ID, Violation Tracking N...
lgl  (1): Violation Repeat Indicator

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

It is a good idea to glimpse at the data to ensure the columns are in proper format.

citations_raw |> glimpse()

Rows: 241,025
Columns: 15
$ investigation_number              <dbl> 1708769, 912908, 1385762, 1747889, 1…
$ investigation_approved_date       <chr> "May 13, 2021", "Jun 20, 2011", "Nov…
$ notice_of_violation_id            <dbl> 5.734687e+14, 1.123664e+14, 4.860436…
$ violation_tracking_number         <dbl> 775850, 420378, 629005, 885714, 7216…
$ current_violation_status          <chr> "RESOLVED", "RESOLVED", "RESOLVED", …
$ violation_status_date             <chr> "Aug 26, 2021", "Sep 29, 2011", "Nov…
$ violation_allegation              <chr> "Failure to maintain inadequate stor…
$ violation_resolution              <chr> "On July 9, 2021, the TCEQ DFW Regio…
$ violation_repeat_indicator        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, F…
$ compliance_due_date               <chr> "Jun 13, 2021", "Jul 20, 2011", NA, …
$ compliance_history_classification <chr> "MODERATE", "MINOR", "MODERATE", "MO…
$ violation_category                <chr> "B", "C", "B", "B", "B", "B", "C", "…
$ investigation_type                <chr> "Compliance Investigation", "Complia…
$ county                            <chr> "JOHNSON", "TRINITY", "RAINS", "POLK…
$ tceq_region                       <chr> "REGION 04 - DFW METROPLEX", "REGION…

Right off the bat, I notice that I have to change the date formats and the investigation number, notice of violation ID, and violation tracking number columns to be character values since they are categorical variables.

Fixing the formats

For the date to not show up as a numerical value, I will use one of the lubridate functions, mutate(), to modify it. I referred to lubridate documentation for the right format.

I am arranging the dates in a descending order to see the most recent data.

citations_clean <- citations_raw |>
  mutate(
    investigation_approved_date = mdy(investigation_approved_date),
    violation_status_date = mdy(violation_status_date),
    compliance_due_date = mdy(compliance_due_date),
    investigation_number = as.character(investigation_number),
    notice_of_violation_id = as.character(notice_of_violation_id),
    violation_tracking_number = as.character(violation_tracking_number)
  ) |>
  arrange(desc(investigation_approved_date))

citations_clean

# A tibble: 241,025 × 15
   investigation_number investigation_approved_date notice_of_violation_id
   <chr>                <date>                      <chr>                 
 1 2021703              2024-11-08                  534344812024310       
 2 2022240              2024-11-08                  360584652024304       
 3 2022445              2024-11-08                  219368582024312       
 4 2015194              2024-11-08                  399510332024303       
 5 2022467              2024-11-08                  467347032024305       
 6 2022467              2024-11-08                  467347032024305       
 7 2021304              2024-11-08                  276390142024313       
 8 2010056              2024-11-08                  695336872024313       
 9 2015194              2024-11-08                  399510332024303       
10 2018465              2024-11-08                  110615902024310       
# ℹ 241,015 more rows
# ℹ 12 more variables: violation_tracking_number <chr>,
#   current_violation_status <chr>, violation_status_date <date>,
#   violation_allegation <chr>, violation_resolution <chr>,
#   violation_repeat_indicator <lgl>, compliance_due_date <date>,
#   compliance_history_classification <chr>, violation_category <chr>,
#   investigation_type <chr>, county <chr>, tceq_region <chr>

I will quickly look at the summary of the 240,000 rows I will be working with.

summary(citations_clean)

 investigation_number investigation_approved_date notice_of_violation_id
 Length:241025        Min.   :1998-12-09          Length:241025         
 Class :character     1st Qu.:2015-01-27          Class :character      
 Mode  :character     Median :2018-07-20          Mode  :character      
                      Mean   :2018-04-28                                
                      3rd Qu.:2021-09-22                                
                      Max.   :2024-11-08                                
                                                                        
 violation_tracking_number current_violation_status violation_status_date
 Length:241025             Length:241025            Min.   :2001-02-20   
 Class :character          Class :character         1st Qu.:2015-07-23   
 Mode  :character          Mode  :character         Median :2019-02-11   
                                                    Mean   :2018-09-29   
                                                    3rd Qu.:2022-03-29   
                                                    Max.   :2024-11-08   
                                                                         
 violation_allegation violation_resolution violation_repeat_indicator
 Length:241025        Length:241025        Mode :logical             
 Class :character     Class :character     FALSE:233706              
 Mode  :character     Mode  :character     TRUE :7319                
                                                                     
                                                                     
                                                                     
                                                                     
 compliance_due_date  compliance_history_classification violation_category
 Min.   :1999-06-21   Length:241025                     Length:241025     
 1st Qu.:2015-07-22   Class :character                  Class :character  
 Median :2019-03-29   Mode  :character                  Mode  :character  
 Mean   :2020-04-05                                                       
 3rd Qu.:2022-07-15                                                       
 Max.   :3000-12-31                                                       
 NA's   :157222                                                           
 investigation_type    county          tceq_region       
 Length:241025      Length:241025      Length:241025     
 Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character

The most recent data is from November 8, 2024. The oldest data goes back to December 9, 1998.

To track specific investigation numbers, I will put this data in a look-up table format.

#citations_clean |> datatable()

A different set - import and clean

This data comes from the same organization, but it includes the name and specific geographic location of the regulated entity. Tied together, the two sets provide a more detailed picture.

nov_raw <- read_csv("data_raw/20241028.csv") |> clean_names()

Rows: 65639 Columns: 21
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (18): Regulated Entity Number, Regulated Entity Name, Business Type, Phy...
dbl  (3): Investigation Number, Notice of Violation ID, Total Violation Count

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

nov_raw |> glimpse()

Rows: 65,639
Columns: 21
$ regulated_entity_number          <chr> "RN100209931", "RN100209931", "RN1002…
$ regulated_entity_name            <chr> "COVESTRO INDUSTRIAL PARK BAYTOWN PLA…
$ business_type                    <chr> "INDUSTRIAL CHEMICAL MANUFACTURING PL…
$ physical_location                <chr> "8500 W BAY RD", "8500 W BAY RD", NA,…
$ physical_city                    <chr> "BAYTOWN", "BAYTOWN", NA, "CROSBY", "…
$ physical_state                   <chr> "TX", "TX", NA, "TX", "TX", "TX", "TX…
$ physical_zip_code                <chr> "77523", "77523", NA, "77532", "77571…
$ county                           <chr> "CHAMBERS", "CHAMBERS", "WEBB", "HARR…
$ investigation_number             <dbl> 1395926, 1956272, 1910546, 907867, 19…
$ investigation_approved_date      <chr> "May 10, 2017", "Feb 28, 2024", "Aug …
$ nov_date                         <chr> "May 11, 2017", "Feb 28, 2024", "Aug …
$ coordinates_county_centroid      <chr> "POINT (-94.72369894531248 29.7123136…
$ tceq_region                      <chr> "REGION 12 - HOUSTON", "REGION 12 - H…
$ notice_of_violation_id           <dbl> 2.295461e+14, 4.613774e+14, 2.395373e…
$ cat_a_violation_tracking_numbers <chr> NA, NA, NA, NA, "862725", NA, NA, NA,…
$ cat_a_violation_citations        <chr> NA, NA, NA, NA, "116.115(c); 382.085(…
$ cat_b_violation_tracking_numbers <chr> NA, "871474", "850936", NA, NA, "7273…
$ cat_b_violation_citations        <chr> NA, "111.111(a)(1)(B); 122.143(4); 38…
$ cat_c_violation_tracking_numbers <chr> "633665", "871471; 871479", NA, "4340…
$ cat_c_violation_citations        <chr> "116.115(c); 117.345(c); 122.143(4); …
$ total_violation_count            <dbl> 1, 3, 1, 1, 1, 1, 4, 1, 2, 3, 1, 1, 1…

I will adjust some of the columns to be character values like previously. I will also look at the summary here to see the date range.

nov_clean <- nov_raw |>
  mutate(investigation_approved_date = mdy(investigation_approved_date),
         nov_date = mdy(nov_date),
         investigation_number = as.character(investigation_number),
         notice_of_violation_id = as.character(notice_of_violation_id)) |>
  filter(year(nov_date) != 3000) |>
  arrange(investigation_approved_date)

summary(nov_clean)

 regulated_entity_number regulated_entity_name business_type     
 Length:65635            Length:65635          Length:65635      
 Class :character        Class :character      Class :character  
 Mode  :character        Mode  :character      Mode  :character  
                                                                 
                                                                 
                                                                 
 physical_location  physical_city      physical_state     physical_zip_code 
 Length:65635       Length:65635       Length:65635       Length:65635      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
    county          investigation_number investigation_approved_date
 Length:65635       Length:65635         Min.   :2010-09-16         
 Class :character   Class :character     1st Qu.:2014-10-28         
 Mode  :character   Mode  :character     Median :2017-11-13         
                                         Mean   :2017-11-24         
                                         3rd Qu.:2021-01-11         
                                         Max.   :2024-07-01         
    nov_date          coordinates_county_centroid tceq_region       
 Min.   :2010-09-16   Length:65635                Length:65635      
 1st Qu.:2014-08-27   Class :character            Class :character  
 Median :2017-07-27   Mode  :character            Mode  :character  
 Mean   :2017-09-24                                                 
 3rd Qu.:2020-11-19                                                 
 Max.   :2024-07-19                                                 
 notice_of_violation_id cat_a_violation_tracking_numbers
 Length:65635           Length:65635                    
 Class :character       Class :character                
 Mode  :character       Mode  :character                
                                                        
                                                        
                                                        
 cat_a_violation_citations cat_b_violation_tracking_numbers
 Length:65635              Length:65635                    
 Class :character          Class :character                
 Mode  :character          Mode  :character                
                                                           
                                                           
                                                           
 cat_b_violation_citations cat_c_violation_tracking_numbers
 Length:65635              Length:65635                    
 Class :character          Class :character                
 Mode  :character          Mode  :character                
                                                           
                                                           
                                                           
 cat_c_violation_citations total_violation_count
 Length:65635              Min.   : 1.000       
 Class :character          1st Qu.: 1.000       
 Mode  :character          Median : 1.000       
                           Mean   : 2.433       
                           3rd Qu.: 3.000       
                           Max.   :73.000

The latest date shows up as the year 3000. I will check whether this is a repeating error by filtering the original data for years after 2024.

nov_date_check <- nov_raw |>
  mutate(
    nov_date = mdy(nov_date),
    extracted_year = year(nov_date)
    ) |>
  filter(extracted_year > 2024)

nov_date_check

# A tibble: 4 × 22
  regulated_entity_number regulated_entity_name business_type physical_location 
  <chr>                   <chr>                 <chr>         <chr>             
1 RN100673813             VINCE HAGAN           <NA>          330 CLAY RD BLDG …
2 RN106076573             PHILLIPS 66 STATION   <NA>          229 N HWY 259     
3 RN105779458             OSSF 1749 US HWY 79 N <NA>          1749 US HIGHWAY 7…
4 RN109272617             MJ GROCERY GARY TX    <NA>          4883 FM 999       
# ℹ 18 more variables: physical_city <chr>, physical_state <chr>,
#   physical_zip_code <chr>, county <chr>, investigation_number <dbl>,
#   investigation_approved_date <chr>, nov_date <date>,
#   coordinates_county_centroid <chr>, tceq_region <chr>,
#   notice_of_violation_id <dbl>, cat_a_violation_tracking_numbers <chr>,
#   cat_a_violation_citations <chr>, cat_b_violation_tracking_numbers <chr>,
#   cat_b_violation_citations <chr>, cat_c_violation_tracking_numbers <chr>, …

There are four dates labeled with the year 3000, which needs to be followed up with a TCEQ information officer.

To see the actual latest date, I will edit an earlier chunk to exclude the mismatched year (at least for now). The clean data set ranges from November 16, 2010 all the way to July 19, 2024. As with the citations data set, I want to have this one in a searchable table format for reference.

This table is temporary commented out for its large size.

#nov_clean |> datatable()

Exporting the data

Last but not least, I will export the cleaned data in a native R format.

citations_clean |> write_rds("data_processed/01-citations.rds")

nov_clean |> write_rds("data_processed/01-nov.rds")