Cleaning

Notebook goals

Download the data
Import the data
Check datatypes
Create a total_value variable
Create a control_type variable
Filter the date range (since Jan. 1 2010)
Export the cleaned data

Setup

For starters, we will attach a few handy packages to the project.

library(tidyverse)

Downloading the data

We will now load the data that we want to look at more closely. While the data we will use here if from Prof. McDonald, it is from the Law Enforcement Support Office. Find more information about the program here.

download.file(
  "https://github.com/utdata/rwd-r-leso/blob/main/data-processed/leso.csv?raw=true",
  "data-raw/leso.csv",
  mode = "wb"
)

Importing the data

To view the data inside our working space, we import it into the notebook.

leso_raw <- read_csv("data-raw/leso.csv")

Rows: 95056 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (7): state, agency_name, nsn, item_name, ui, demil_code, station_type
dbl  (4): sheet, quantity, acquisition_value, demil_ic
dttm (1): ship_date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

leso_raw

To see all the columns, we will glimpse at the imported tibble.

leso_raw |> glimpse()

Rows: 95,056
Columns: 12
$ sheet             <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name         <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity          <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui                <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code        <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic          <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date         <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…

When checking the datatypes, these are the things to watch for:

Are the variable names (column names) clean? All lower case with _ separating the word?
Are dates saved in a date format? ship_date looks good with the or “datetime” format.
Are your numbers really numbers? acquisition_value is the column we are most concerned with, and it looks good to go.

Removing unnecessary columns

We can remove the sheet column as it won’t be useful in the analysis. If other columns end up needing to be removed (or kept) we can go back to this document, make corrections, and run the notebook again.

We can also use c() to combine columns that need to be removed together.

leso_tight <- leso_raw |>
  select(!sheet)

leso_tight |> glimpse()

Rows: 95,056
Columns: 11
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name         <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity          <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui                <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code        <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic          <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date         <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…

Creating a total value column

Since the acquisition_value noted in the data is for a single “unit” of each item, we need to create a new column that shows the total according to quantity.

leso_total <- leso_tight |>
  mutate(total_value = quantity * acquisition_value)

leso_total |> glimpse()

Rows: 95,056
Columns: 12
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name         <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity          <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui                <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code        <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic          <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date         <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 168542.40, 10000.00, 16260.00, 3285.00, 62627.00, 65…

Check the results

Note the new (and removed) columns from the tibble. If there isn’t enough information visible, you can print some sample rows of the data to peek through it.

leso_total |>
  slice_sample(n = 10) |>
  select(agency_name, quantity, acquisition_value, total_value)

If you want to see a different sample of data, just rerun the chunk as you get a new one each time. If you want a different number of lines, change the value inside the function. You can use add on select() to pluck out the columns of interest to check.

Controlled vs. non-controlled

In this set, non-controlled generic items like boots and blankets are removed from the data after one year, but controlled items like guns and airplanes remain on the list until they are returned to the military for disposal. We are only concerned with the controlled items.

According to the agency, “DEMIL codes,” as they are referenced, are the demil_code and demil_ic columns in the data, so we can use those to mark which records are “non-controlled” (A and Q6) and then mark all the rest as “controlled.”

leso_control <- leso_total |> 
  mutate(
    control_type = case_when(
      str_detect(item_name, "AIRPLANE") ~ TRUE,
      (demil_code == "A" | (demil_code == "Q" & demil_ic == 6)) ~ FALSE,
      .default = TRUE
    )
  )

leso_control |> glimpse()

Rows: 95,056
Columns: 13
$ state             <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name       <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn               <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name         <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity          <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui                <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code        <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic          <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date         <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type      <chr> "State", "State", "State", "State", "State", "State"…
$ total_value       <dbl> 168542.40, 10000.00, 16260.00, 3285.00, 62627.00, 65…
$ control_type      <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…

Filtering our data

To look at more recent results, we will filter the data to records since Jan. 1, 2010. The summary() function allows us to see whether the changes we made are what we wanted.

leso_dated <- leso_control |>
  filter(ship_date >= "2010-01-01")

leso_dated |> summary()

    state           agency_name            nsn             item_name        
 Length:65172       Length:65172       Length:65172       Length:65172      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
    quantity            ui            acquisition_value   demil_code       
 Min.   : -72.00   Length:65172       Min.   :       0   Length:65172      
 1st Qu.:   1.00   Class :character   1st Qu.:     230   Class :character  
 Median :   1.00   Mode  :character   Median :     499   Mode  :character  
 Mean   :   4.63                      Mean   :   22643                     
 3rd Qu.:   1.00                      3rd Qu.:    3578                     
 Max.   :5000.00                      Max.   :22000000                     
                                                                           
    demil_ic       ship_date                      station_type      
 Min.   :0.000   Min.   :2010-01-06 00:00:00.00   Length:65172      
 1st Qu.:1.000   1st Qu.:2012-03-13 00:00:00.00   Class :character  
 Median :1.000   Median :2014-09-22 00:00:00.00   Mode  :character  
 Mean   :1.437   Mean   :2015-11-16 22:19:38.66                     
 3rd Qu.:1.000   3rd Qu.:2018-08-15 00:00:00.00                     
 Max.   :7.000   Max.   :2024-09-28 00:00:00.00                     
 NA's   :6048                                                       
  total_value       control_type   
 Min.   :-1392035   Mode :logical  
 1st Qu.:     365   FALSE:7257     
 Median :     749   TRUE :57915    
 Mean   :   24382                  
 3rd Qu.:    6656                  
 Max.   :22000000

Exporting cleaned data

We will export the data with all the changes we made into a native R format called .rds.

leso_dated |> write_rds("data-processed/01-leso-all.rds")