library(tidyverse)Cleaning
Notebook goals
- Download the data
- Import the data
- Check datatypes
- Create a total_value variable
- Create a control_type variable
- Filter the date range (since Jan. 1 2010)
- Export the cleaned data
Setup
For starters, we will attach a few handy packages to the project.
Downloading the data
We will now load the data that we want to look at more closely. While the data we will use here if from Prof. McDonald, it is from the Law Enforcement Support Office. Find more information about the program here.
download.file(
"https://github.com/utdata/rwd-r-leso/blob/main/data-processed/leso.csv?raw=true",
"data-raw/leso.csv",
mode = "wb"
)Importing the data
To view the data inside our working space, we import it into the notebook.
leso_raw <- read_csv("data-raw/leso.csv")Rows: 95056 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): state, agency_name, nsn, item_name, ui, demil_code, station_type
dbl (4): sheet, quantity, acquisition_value, demil_ic
dttm (1): ship_date
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
leso_rawTo see all the columns, we will glimpse at the imported tibble.
leso_raw |> glimpse()Rows: 95,056
Columns: 12
$ sheet <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
When checking the datatypes, these are the things to watch for:
- Are the variable names (column names) clean? All lower case with _ separating the word?
- Are dates saved in a date format? ship_date looks good with the
or “datetime” format. - Are your numbers really numbers? acquisition_value is the column we are most concerned with, and it looks good to go.
Removing unnecessary columns
We can remove the sheet column as it won’t be useful in the analysis. If other columns end up needing to be removed (or kept) we can go back to this document, make corrections, and run the notebook again.
We can also use c() to combine columns that need to be removed together.
leso_tight <- leso_raw |>
select(!sheet)
leso_tight |> glimpse()Rows: 95,056
Columns: 11
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
Creating a total value column
Since the acquisition_value noted in the data is for a single “unit” of each item, we need to create a new column that shows the total according to quantity.
leso_total <- leso_tight |>
mutate(total_value = quantity * acquisition_value)
leso_total |> glimpse()Rows: 95,056
Columns: 12
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
$ total_value <dbl> 168542.40, 10000.00, 16260.00, 3285.00, 62627.00, 65…
Check the results
Note the new (and removed) columns from the tibble. If there isn’t enough information visible, you can print some sample rows of the data to peek through it.
leso_total |>
slice_sample(n = 10) |>
select(agency_name, quantity, acquisition_value, total_value)If you want to see a different sample of data, just rerun the chunk as you get a new one each time. If you want a different number of lines, change the value inside the function. You can use add on select() to pluck out the columns of interest to check.
Controlled vs. non-controlled
In this set, non-controlled generic items like boots and blankets are removed from the data after one year, but controlled items like guns and airplanes remain on the list until they are returned to the military for disposal. We are only concerned with the controlled items.
According to the agency, “DEMIL codes,” as they are referenced, are the demil_code and demil_ic columns in the data, so we can use those to mark which records are “non-controlled” (A and Q6) and then mark all the rest as “controlled.”
leso_control <- leso_total |>
mutate(
control_type = case_when(
str_detect(item_name, "AIRPLANE") ~ TRUE,
(demil_code == "A" | (demil_code == "Q" & demil_ic == 6)) ~ FALSE,
.default = TRUE
)
)
leso_control |> glimpse()Rows: 95,056
Columns: 13
$ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL"…
$ agency_name <chr> "ABBEVILLE POLICE DEPT", "ABBEVILLE POLICE DEPT", "A…
$ nsn <chr> "2540-01-565-4700", "1385-01-574-4707", "1005-01-587…
$ item_name <chr> "BALLISTIC BLANKET KIT", "UNMANNED VEHICLE,GROUND", …
$ quantity <dbl> 10, 1, 10, 9, 1, 1, 1, 18, 1, 2, 3, 10, 1, 1, 2, 1, …
$ ui <chr> "Kit", "Each", "Each", "Each", "Each", "Each", "Each…
$ acquisition_value <dbl> 16854.24, 10000.00, 1626.00, 365.00, 62627.00, 65800…
$ demil_code <chr> "D", "Q", "D", "D", "C", "C", "C", "Q", "Q", "Q", "Q…
$ demil_ic <dbl> 1, 3, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 1, 3, 3, 3, 3…
$ ship_date <dttm> 2018-01-30 00:00:00, 2017-03-28 00:00:00, 2016-09-1…
$ station_type <chr> "State", "State", "State", "State", "State", "State"…
$ total_value <dbl> 168542.40, 10000.00, 16260.00, 3285.00, 62627.00, 65…
$ control_type <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE…
Filtering our data
To look at more recent results, we will filter the data to records since Jan. 1, 2010. The summary() function allows us to see whether the changes we made are what we wanted.
leso_dated <- leso_control |>
filter(ship_date >= "2010-01-01")
leso_dated |> summary() state agency_name nsn item_name
Length:65172 Length:65172 Length:65172 Length:65172
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
quantity ui acquisition_value demil_code
Min. : -72.00 Length:65172 Min. : 0 Length:65172
1st Qu.: 1.00 Class :character 1st Qu.: 230 Class :character
Median : 1.00 Mode :character Median : 499 Mode :character
Mean : 4.63 Mean : 22643
3rd Qu.: 1.00 3rd Qu.: 3578
Max. :5000.00 Max. :22000000
demil_ic ship_date station_type
Min. :0.000 Min. :2010-01-06 00:00:00.00 Length:65172
1st Qu.:1.000 1st Qu.:2012-03-13 00:00:00.00 Class :character
Median :1.000 Median :2014-09-22 00:00:00.00 Mode :character
Mean :1.437 Mean :2015-11-16 22:19:38.66
3rd Qu.:1.000 3rd Qu.:2018-08-15 00:00:00.00
Max. :7.000 Max. :2024-09-28 00:00:00.00
NA's :6048
total_value control_type
Min. :-1392035 Mode :logical
1st Qu.: 365 FALSE:7257
Median : 749 TRUE :57915
Mean : 24382
3rd Qu.: 6656
Max. :22000000
Exporting cleaned data
We will export the data with all the changes we made into a native R format called .rds.
leso_dated |> write_rds("data-processed/01-leso-all.rds")