Data imported from excel and csv in business situations can have
messy characteristics and data formats. This package provides functions
to tidy your data frame using the power of tidyselect
.
create sample data
tibble::tibble(
date = c("20190101", "20190305", "20201012"),
numeric_val = c(1, NA, 5),
char_val = c("", " val ", "-")
) -> sample_table
sample_table
## # A tibble: 3 × 3
## date numeric_val char_val
## <chr> <dbl> <chr>
## 1 20190101 1 ""
## 2 20190305 NA " val "
## 3 20201012 5 "-"
Data occasionally has different ways to represent NA values.
set_na
checks as default
c("-", "", " ", "null")
but any values can be supplied to
automatically be set to NA. This is helpful when you want to check the
NA profile of a data frame using validata::diagnose
## # A tibble: 3 × 3
## date numeric_val char_val
## <chr> <dbl> <chr>
## 1 20190101 1 NA
## 2 20190305 NA " val "
## 3 20201012 5 NA
remove whitespace from the ends of character variables that may be otherwise undetectable by inspection.
sample_table %>%
remove_whitespace()
## # A tibble: 3 × 3
## date numeric_val char_val
## <chr> <dbl> <chr>
## 1 20190101 1 ""
## 2 20190305 NA "val"
## 3 20201012 5 "-"
automatically convert character columns that should be dates.
## # A tibble: 3 × 3
## date numeric_val char_val
## <date> <dbl> <chr>
## 1 2019-01-01 1 ""
## 2 2019-03-05 NA " val "
## 3 2020-10-12 5 "-"
relocates an unorganized dataframe using heuristics such as putting character and date columns first, and organizing by alphabetical order.
sample_table %>%
relocate_all()
## # A tibble: 3 × 3
## date char_val numeric_val
## <chr> <chr> <dbl>
## 1 20190101 "" 1
## 2 20190305 " val " NA
## 3 20201012 "-" 5
Wrapper function to apply all cleaning operations to a data frame using sensible defaults.
sample_table %>%
clean_frame()
## # A tibble: 3 × 3
## DATE NUMERIC_VAL CHAR_VAL
## <int> <int> <chr>
## 1 20190101 1 NA
## 2 20190305 NA " val "
## 3 20201012 5 NA