dplyr 1.1.1

  dplyr, dplyr-1-1-0

  Davis Vaughan

We’re stoked to announce the release of dplyr 1.1.1. We don’t typically blog about patch releases, because they generally only fix bugs without significantly changing behavior, but this one includes two important updates:

  • Addressing various performance regressions
  • Refining the multiple match warning thrown by dplyr’s joins

You can see a full list of changes in the release notes. To see the other blog posts in the dplyr 1.1.0 series, head here.

You can install dplyr 1.1.1 from CRAN with:

install.packages("dplyr")

Performance regressions

In the 1.1.0 post on vctrs, we discussed that we’ve rewritten all of dplyr’s vector functions on top of vctrs for improved versatility. Unfortunately, we accidentally made two sets of functions much slower, especially when used on a data frame with many groups:

These performance issues have been addressed, and should be back to 1.0.10 level of performance. case_when() is still slightly slower than 1.0.10, but it isn’t likely to be very noticeable, and we already have plans to improve this further in a future release.

Revisiting multiple matches

In the 1.1.0 post on joins, we discussed the new multiple argument that was added to left_join() and friends, which had a built in safety check that warned when you performed a join where a row from x matched more than one row from y. The TLDR of the discussion below is that we’ve realized that this warning was being thrown in too many cases, so we’ve adjusted it in such a way that it now only catches the most dangerous type of join (a many-to-many join), meaning that you should see the warning much less often.

As a reminder, multiple determines what happens when a row from x matches more than one row from y. You can choose to return "all" of the matches, the "first" or "last" match, or "any" of the matches if you are just interested in detecting if there is at least one. multiple defaulted to a behavior similar to "all", with the added side effect of throwing a warning if multiple matches were actually detected, like this:

student <- tibble(
  student_id = c(1, 2, 3),
  transfer = c(FALSE, TRUE, TRUE),
  initial_term = c("fall 2018", "fall 2020", "fall 2020")
)

term <- tibble(
  student_id = c(1, 1, 2, 3, 3, 3),
  term = c("fall 2018", "spring 2019", "fall 2020", "fall 2020", "spring 2021", "fall 2021"),
  course_load = c(12, 15, 10, 14, 15, 12)
)
# Information about students attending a university.
# One row per (student_id).
student
#> # A tibble: 3 × 3
#>   student_id transfer initial_term
#>        <dbl> <lgl>    <chr>       
#> 1          1 FALSE    fall 2018   
#> 2          2 TRUE     fall 2020   
#> 3          3 TRUE     fall 2020

# Term specific information about each student.
# One row per (student_id, term) combination.
term
#> # A tibble: 6 × 3
#>   student_id term        course_load
#>        <dbl> <chr>             <dbl>
#> 1          1 fall 2018            12
#> 2          1 spring 2019          15
#> 3          2 fall 2020            10
#> 4          3 fall 2020            14
#> 5          3 spring 2021          15
#> 6          3 fall 2021            12
student |>
  left_join(term, join_by(student_id))
#> Warning in left_join(student, term, join_by(student_id)): Each row in `x` is expected to match at most 1 row in `y`.
#> i Row 1 of `x` matches multiple rows.
#> i If multiple matches are expected, set `multiple = "all"` to silence this warning.
#> # A tibble: 6 × 5
#>   student_id transfer initial_term term        course_load
#>        <dbl> <lgl>    <chr>        <chr>             <dbl>
#> 1          1 FALSE    fall 2018    fall 2018            12
#> 2          1 FALSE    fall 2018    spring 2019          15
#> 3          2 TRUE     fall 2020    fall 2020            10
#> 4          3 TRUE     fall 2020    fall 2020            14
#> 5          3 TRUE     fall 2020    spring 2021          15
#> 6          3 TRUE     fall 2020    fall 2021            12

To silence this warning, we encouraged you to set multiple = "all" to be explicit about the fact that you expected a row from x to match multiple rows in y.

The original motivation for this behavior comes from a two-part hypothesis of ours:

  • Users are often surprised when a join returns more rows than the left-hand table started with (in the above example, student has 3 rows but the join result has 6).

  • It is dangerous to allow joins that can result in a Cartesian explosion of the number of rows (i.e. nrow(x) * nrow(y)).

This hypothesis led us to automatically warn on two types of join relationships, one-to-many joins and many-to-many joins. If you aren’t familiar with these terms, here is a quick rundown of the 4 types of join relationships (often discussed in a SQL context), which provide constraints on the number of allowed matches:

  • one-to-one:
    • A row from x can match at most 1 row from y.
    • A row from y can match at most 1 row from x.
  • one-to-many:
    • A row from x can match any number of rows in y.
    • A row from y can match at most 1 row from x.
  • many-to-one:
    • A row from x can match at most 1 row from y.
    • A row from y can match any number of rows in x.
  • many-to-many:
    • A row from x can match any number of rows in y.
    • A row from y can match any number of rows in x.

After gathering some valuable user feedback and conducting an in depth analysis of these join relationships, we’ve determined that the only relationship style actually worth warning on is many-to-many, because that is the one that can result in a Cartesian explosion of rows. In retrospect, the one-to-many relationship is actually quite common, and is symmetrical with many-to-one, which we weren’t warning on. You could actually exploit this fact by switching the above join around, which would silence the warning:

term |>
  left_join(student, join_by(student_id))
#> # A tibble: 6 × 5
#>   student_id term        course_load transfer initial_term
#>        <dbl> <chr>             <dbl> <lgl>    <chr>       
#> 1          1 fall 2018            12 FALSE    fall 2018   
#> 2          1 spring 2019          15 FALSE    fall 2018   
#> 3          2 fall 2020            10 TRUE     fall 2020   
#> 4          3 fall 2020            14 TRUE     fall 2020   
#> 5          3 spring 2021          15 TRUE     fall 2020   
#> 6          3 fall 2021            12 TRUE     fall 2020

We still believe that new users are often surprised when a join returns more rows than they originally started with, but the many-to-one case of this is rarely a problem in practice. So, as of dplyr 1.1.1, we no longer warn on one-to-many relationships, which should drastically reduce the amount of warnings that you see.

Many-to-many relationships

A many-to-many relationship is much harder to construct (which is good). In fact, a database system won’t even let you create one of these “relationships” between two tables directly, instead requiring you to create a third bridge table that turns the many-to-many relationship into two one-to-many relationships. We can “accidentally” create one of these in R though:

course <- tibble(
  student_id = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
  instructor_id = c(1, 2, 3, 1, 2, 1, 2, 3, 4),
  course = c(101, 110, 123, 110, 101, 110, 115, 110, 101),
  term = c(
    "fall 2018", "fall 2018", "spring 2019", "fall 2020", "fall 2020", 
    "fall 2020", "fall 2020", "spring 2021", "fall 2021"
  ),
  grade = c("A", "B", "A", "B", "C", "A", "C", "D", "B")
)

# Information about the courses each student took per semester.
# One row per (student_id, course, term) combination.
course
#> # A tibble: 9 × 5
#>   student_id instructor_id course term        grade
#>        <dbl>         <dbl>  <dbl> <chr>       <chr>
#> 1          1             1    101 fall 2018   A    
#> 2          1             2    110 fall 2018   B    
#> 3          1             3    123 spring 2019 A    
#> 4          2             1    110 fall 2020   B    
#> 5          2             2    101 fall 2020   C    
#> 6          3             1    110 fall 2020   A    
#> 7          3             2    115 fall 2020   C    
#> 8          3             3    110 spring 2021 D    
#> 9          3             4    101 fall 2021   B
# Forgetting to join by both `student_id` and `term`!
term |>
  left_join(course, by = join_by(student_id))
#> Warning in left_join(term, course, by = join_by(student_id)): Detected an unexpected many-to-many relationship between `x` and `y`.
#>  Row 1 of `x` matches multiple rows in `y`.
#>  Row 1 of `y` matches multiple rows in `x`.
#>  If a many-to-many relationship is expected, set `relationship =
#>   "many-to-many"` to silence this warning.
#> # A tibble: 20 × 7
#>    student_id term.x      course_load instructor_id course term.y      grade
#>         <dbl> <chr>             <dbl>         <dbl>  <dbl> <chr>       <chr>
#>  1          1 fall 2018            12             1    101 fall 2018   A    
#>  2          1 fall 2018            12             2    110 fall 2018   B    
#>  3          1 fall 2018            12             3    123 spring 2019 A    
#>  4          1 spring 2019          15             1    101 fall 2018   A    
#>  5          1 spring 2019          15             2    110 fall 2018   B    
#>  6          1 spring 2019          15             3    123 spring 2019 A    
#>  7          2 fall 2020            10             1    110 fall 2020   B    
#>  8          2 fall 2020            10             2    101 fall 2020   C    
#>  9          3 fall 2020            14             1    110 fall 2020   A    
#> 10          3 fall 2020            14             2    115 fall 2020   C    
#> 11          3 fall 2020            14             3    110 spring 2021 D    
#> 12          3 fall 2020            14             4    101 fall 2021   B    
#> 13          3 spring 2021          15             1    110 fall 2020   A    
#> 14          3 spring 2021          15             2    115 fall 2020   C    
#> 15          3 spring 2021          15             3    110 spring 2021 D    
#> 16          3 spring 2021          15             4    101 fall 2021   B    
#> 17          3 fall 2021            12             1    110 fall 2020   A    
#> 18          3 fall 2021            12             2    115 fall 2020   C    
#> 19          3 fall 2021            12             3    110 spring 2021 D    
#> 20          3 fall 2021            12             4    101 fall 2021   B

In the example above, we’ve forgotten to include the term column when joining these two tables together, which accidentally results in a small explosion of rows (we end up with 20 rows, more than in either original input, but not quite the maximum possible amount, which is a whopping 54 rows!). Luckily, dplyr warns us that at least one row in each table matches more than one row in the opposite table - a sign that something isn’t right. At this point we can do one of two things:

  • Look into the new relationship argument that the warning mentions (we’ll discuss this below)

  • Look at our join to see if we made a mistake

Of course, in this case we’ve messed up, and adding term into the by expression results in the correct (and silent) join:

term |>
  left_join(course, by = join_by(student_id, term))
#> # A tibble: 9 × 6
#>   student_id term        course_load instructor_id course grade
#>        <dbl> <chr>             <dbl>         <dbl>  <dbl> <chr>
#> 1          1 fall 2018            12             1    101 A    
#> 2          1 fall 2018            12             2    110 B    
#> 3          1 spring 2019          15             3    123 A    
#> 4          2 fall 2020            10             1    110 B    
#> 5          2 fall 2020            10             2    101 C    
#> 6          3 fall 2020            14             1    110 A    
#> 7          3 fall 2020            14             2    115 C    
#> 8          3 spring 2021          15             3    110 D    
#> 9          3 fall 2021            12             4    101 B

Join relationships

To adjust the joins to only warn on many-to-many relationships, we’ve done two things:

  • multiple now defaults to "all", and is now focused solely on limiting the matches returned if multiple are detected, rather than also optionally warning/erroring.

  • We’ve added a new relationship argument.

The relationship argument allows you to explicitly specify the expected join relationship between the keys of x and y using the exact options we listed above: "one-to-one", "one-to-many", "many-to-one", and "many-to-many". If the constraints of the relationship you choose are violated, an error is thrown. For example, we could use this to require that the student + term join contains a one-to-many relationship between the two tables:

student |>
  left_join(term, join_by(student_id), relationship = "one-to-many")
#> # A tibble: 6 × 5
#>   student_id transfer initial_term term        course_load
#>        <dbl> <lgl>    <chr>        <chr>             <dbl>
#> 1          1 FALSE    fall 2018    fall 2018            12
#> 2          1 FALSE    fall 2018    spring 2019          15
#> 3          2 TRUE     fall 2020    fall 2020            10
#> 4          3 TRUE     fall 2020    fall 2020            14
#> 5          3 TRUE     fall 2020    spring 2021          15
#> 6          3 TRUE     fall 2020    fall 2021            12

Let’s violate this by adding a duplicate row in student:

student_bad <- student |>
  tibble::add_row(
    student_id = 1, 
    transfer = FALSE, 
    initial_term = "fall 2019", 
    .after = 1
  )

student_bad
#> # A tibble: 4 × 3
#>   student_id transfer initial_term
#>        <dbl> <lgl>    <chr>       
#> 1          1 FALSE    fall 2018   
#> 2          1 FALSE    fall 2019   
#> 3          2 TRUE     fall 2020   
#> 4          3 TRUE     fall 2020
student_bad |>
  left_join(term, join_by(student_id), relationship = "one-to-many")
#> Error in `left_join()`:
#> ! Each row in `y` must match at most 1 row in `x`.
#>  Row 1 of `y` matches multiple rows in `x`.

The default value of relationship doesn’t add any constraints, but for equality joins it will check to see if a many-to-many relationship exists, and will warn if one occurs (like with the term + course join from above). As mentioned before, this is quite hard to do, and often means you have a mistake in your join call or in the data itself. If you really do want to perform a join with this kind of relationship, to silence the warning you can explicitly specify relationship = "many-to-many".

One last thing to note is that relationship doesn’t handle the case of an unmatched row. For that, you should use the unmatched argument that was also added in 1.1.0. The combination of relationship and unmatched provides a complete set of tools for adding production level quality control checks to your joins.

Acknowledgements

The examples used in this blog post were adapted from @eipi10 in this issue.

We’d like to thank all 66 contributors who help in someway, whether it was filing issues or contributing code and documentation: @alexhallam, @ammar-gla, @arnaudgallou, @ArthurAndrews, @AuburnEagle-578, @batpigandme, @billdenney, @Bisaloo, @bitplane, @chrarnold, @D5n9sMatrix, @daattali, @DanChaltiel, @DavisVaughan, @dieghernan, @dkutner, @eipi10, @eitsupi, @emilBeBri, @fawda123, @fedassembly, @fkohrt, @gavinsimpson, @geogale, @ggrothendieck, @hadley, @hope-data-science, @jaganmn, @jakub-jedrusiak, @JorisChau, @krlmlr, @krprasangdas, @larry77, @lionel-, @lschneiderbauer, @LukasWallrich, @maellecoursonnais, @manhnguyen48, @mattansb, @mgirlich, @mhaynam, @MichaelChirico, @mine-cetinkaya-rundel, @mkoohafkan, @moodymudskipper, @Moohan, @msgoussi, @multimeric, @osheen1, @Pozdniakov, @psychelzh, @pur80a, @robayo, @rszulkin, @salim-b, @sda030, @sfirke, @shannonpileggi, @stephLH, @szabgab, @tjebo, @Torvaney, @twest820, @vanillajonathan, @warnes, and @zknitter.