Uses confirm_distinct
in an iterative fashion to determine the primary keys.
determine_distinct(df, ..., listviewer = TRUE)
a data frame
columns or a tidyselect specification. defaults to everything
logical. defaults to TRUE to view output using the listviewer package
list
The goal of this function is to automatically determine which columns uniquely identify the rows of a dataframe. The output is a printed description of the combination of columns that form unique identifiers at each level. At level 1, the function tests if individual columns are primary keys At level 2, the function tests n C 2 combinations of columns to see if they form primary keys. The final level is testing all columns at once.
For completely unique columns, they are recorded in level 1, but then dropped from the data frame to facilitate the determination of multi-column primary keys.
If the dataset contains duplicated rows, they are eliminated before proceeding.
sample_data1 %>%
head
#> # A tibble: 6 × 6
#> ID_COL1 ID_COL2 ID_COL3 VAL1 VAL2 VAL3
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2413 1034 1014 -0.0639 -1.16 -0.302
#> 2 2413 1034 1322 0.363 1.62 0.165
#> 3 2413 1034 2999 -0.00466 1.23 0.819
#> 4 2413 1034 3544 1.83 -2.58 -0.525
#> 5 2413 1034 9901 0.837 -0.442 -0.341
#> 6 2413 1122 1014 -0.894 -1.11 0.768
## on level 1, each column is tested as a unique identifier. the VAL columns have no
## duplicates and hence qualify, even though they normally would be considered as IDs
## on level 3, combinations of 3 columns are tested. implying that ID_COL 1,2,3 form a unique key
## level 2 does not appear, implying that combinations of any 2 ID_COLs do not form a unique key
sample_data1 %>%
determine_distinct(listviewer = FALSE)
#> $`LEVEL 3`
#> [1] "ID_COL1, ID_COL2, ID_COL3"
#>
#> $`LEVEL 1`
#> $`LEVEL 1`[[1]]
#> [1] "VAL1"
#>
#> $`LEVEL 1`[[2]]
#> [1] "VAL2"
#>
#> $`LEVEL 1`[[3]]
#> [1] "VAL3"
#>
#>