Distinct

Confirm Distinct

In data analysis tasks we often have data sets with multiple possible ID columns, but it’s not always clear which combination uniquely identifies each row.

sample_data1 has 125 row with 3 ID type columns and 3 value columns.

head(sample_data1)
#> # A tibble: 6 × 6
#>   ID_COL1 ID_COL2 ID_COL3     VAL1   VAL2   VAL3
#>     <dbl>   <dbl>   <dbl>    <dbl>  <dbl>  <dbl>
#> 1    2413    1034    1014 -0.0639  -1.16  -0.302
#> 2    2413    1034    1322  0.363    1.62   0.165
#> 3    2413    1034    2999 -0.00466  1.23   0.819
#> 4    2413    1034    3544  1.83    -2.58  -0.525
#> 5    2413    1034    9901  0.837   -0.442 -0.341
#> 6    2413    1122    1014 -0.894   -1.11   0.768

Let’s use confirm_distinct iteratively to find the uniquely identifying columns of sample_data1.

sample_data1 %>% 
  confirm_distinct(ID_COL1)
#> database has 120 duplicates at ID_COL1
sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2)
#> database has 100 duplicates at ID_COL1, ID_COL2
sample_data1 %>% 
  confirm_distinct(ID_COL1, ID_COL2, ID_COL3)
#> database is distinct at ID_COL1, ID_COL2, ID_COL3

Here we can conclude that the combination of 3 ID columns is the primary key for the data.

Determine Distinct

These steps can be automated with the wrapper function determine distinct.

sample_data1 %>% 
  determine_distinct(matches("ID"))

Mapping

confirm_mapping tells you the mapping between two columns in a data frame:

  • 1 - 1 mapping
  • 1 - many mapping
  • many - 1 mapping
  • many - many mapping

Confirm mapping

confirm_mapping gives the option to view which type of mapping is associated with each individual row.

sample_data1 %>% 
  confirm_mapping(ID_COL1, ID_COL2, view = F)
#> many - many mapping between ID_COL1 and ID_COL2

Determine mapping

Overlap

The overlap functions give a venn style description of the values in 2 columns. This is especially useful before performing a join function, and you want to confirm that the dataframes have matching keys.

Confirm Overlap

confirm_overlap is different from the other confirm functions in that it takes 2 vectors as arguments, instead of a data frame. This is to allow the user to test overlap between different dataframes, or arbitrary vectors if necessary


confirm_overlap(iris$Sepal.Width, iris$Petal.Length) -> iris_overlap
#> # A tibble: 1 × 5
#>   only_in_iris_Sepal.Width only_in_iris_Pet… shared_names total_names pct_shared
#>                      <int>             <int>        <int>       <int> <chr>     
#> 1                       12                32           11          55 20%

confirm_overlap returns a summary data frame invisibly allowing you to access individual elements using the helper functions.

print(iris_overlap)
#> # A tibble: 55 × 4
#>        x iris_Sepal.Width iris_Petal.Length both_flags
#>    <dbl>            <dbl>             <dbl>      <dbl>
#>  1   3.5                1                 1          2
#>  2   3                  1                 1          2
#>  3   3.2                1                 0          1
#>  4   3.1                1                 0          1
#>  5   3.6                1                 1          2
#>  6   3.9                1                 1          2
#>  7   3.4                1                 0          1
#>  8   2.9                1                 0          1
#>  9   3.7                1                 1          2
#> 10   4                  1                 1          2
#> # … with 45 more rows

Find the elements unique to the first column

iris_overlap %>% 
  co_find_only_in_1() %>% 
  head()
#> # A tibble: 6 × 1
#>   iris_Sepal.Width
#>              <dbl>
#> 1              3.2
#> 2              3.1
#> 3              3.4
#> 4              2.9
#> 5              2.3
#> 6              2.8

Find the elements unique to the second column

iris_overlap %>% 
  co_find_only_in_2() %>% 
  head()
#> # A tibble: 6 × 1
#>   iris_Petal.Length
#>               <dbl>
#> 1               1.4
#> 2               1.3
#> 3               1.5
#> 4               1.7
#> 5               1.6
#> 6               1.1

Find the elements shared by both columns

iris_overlap %>% 
  co_find_in_both() %>% 
  head()
#> # A tibble: 6 × 1
#>       x
#>   <dbl>
#> 1   3.5
#> 2   3  
#> 3   3.6
#> 4   3.9
#> 5   3.7
#> 6   4

Determine Overlap

determine_overlap takes a dataframe and a tidyselect specification, and returns a tibble summarizing all of the pairwise overlaps. Only pairs with matching types are tested.

Note that the overlap functions only test pairwise overlaps. For multi-column and large-scale overlap testing, see Complex Upset Plots

string length

confirm string length

Get a frequency table of string lengths in a character column. Table is printed while the original df is returned invisibly with a column indicating the string lengths.

iris %>% 
  confirm_strlen(Species) -> species_len
#>  Species_chr_len  n percent
#>                6 50   33.3%
#>                9 50   33.3%
#>               10 50   33.3%

output is a dataframe

head(species_len)
#> # A tibble: 6 × 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>             <int>
#> 1          5.1         3.5          1.4         0.2 setosa                6
#> 2          4.9         3            1.4         0.2 setosa                6
#> 3          4.7         3.2          1.3         0.2 setosa                6
#> 4          4.6         3.1          1.5         0.2 setosa                6
#> 5          5           3.6          1.4         0.2 setosa                6
#> 6          5.4         3.9          1.7         0.4 setosa                6

choose string length

A helped function for the output of confirm_strlen that filters the database for chosen string lengths.

species_len %>% 
  choose_strlen(len = 6) %>% 
  head()
#> # A tibble: 6 × 6
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species Species_chr_len
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>             <int>
#> 1          5.1         3.5          1.4         0.2 setosa                6
#> 2          4.9         3            1.4         0.2 setosa                6
#> 3          4.7         3.2          1.3         0.2 setosa                6
#> 4          4.6         3.1          1.5         0.2 setosa                6
#> 5          5           3.6          1.4         0.2 setosa                6
#> 6          5.4         3.9          1.7         0.4 setosa                6

diagnose

Reproduction of diagnose from the dlookr package. Usually a good choice for first analyzing a data set.

iris %>% 
  diagnose()
#> # A tibble: 5 × 6
#>   variables    types   missing_count missing_percent unique_count unique_rate
#>   <chr>        <chr>           <int>           <dbl>        <int>       <dbl>
#> 1 Sepal.Length numeric             0               0           35       0.233
#> 2 Sepal.Width  numeric             0               0           23       0.153
#> 3 Petal.Length numeric             0               0           43       0.287
#> 4 Petal.Width  numeric             0               0           22       0.147
#> 5 Species      factor              0               0            3       0.02