We’re stoked to announce the release of dplyr 1.1.1. We don’t typically blog about patch releases, because they generally only fix bugs without significantly changing behavior, but this one includes two important updates:
- Addressing various performance regressions
- Refining the
multiple
match warning thrown by dplyr’s joins
You can see a full list of changes in the release notes. To see the other blog posts in the dplyr 1.1.0 series, head here.
You can install dplyr 1.1.1 from CRAN with:
install.packages("dplyr")
Performance regressions
In the 1.1.0 post on vctrs, we discussed that we’ve rewritten all of dplyr’s vector functions on top of vctrs for improved versatility. Unfortunately, we accidentally made two sets of functions much slower, especially when used on a data frame with many groups:
-
case_when()
andif_else()
These performance issues have been addressed, and should be back to 1.0.10 level of performance.
case_when()
is still slightly slower than 1.0.10, but it isn’t likely to be very noticeable, and we already have plans to improve this further in a future release.
Revisiting multiple matches
In the
1.1.0 post on joins, we discussed the new multiple
argument that was added to
left_join()
and friends, which had a built in safety check that warned when you performed a join where a row from x
matched more than one row from y
. The TLDR of the discussion below is that we’ve realized that this warning was being thrown in too many cases, so we’ve adjusted it in such a way that it now only catches the most dangerous type of join (a many-to-many join), meaning that you should see the warning much less often.
As a reminder, multiple
determines what happens when a row from x
matches more than one row from y
. You can choose to return "all"
of the matches, the "first"
or "last"
match, or "any"
of the matches if you are just interested in detecting if there is at least one. multiple
defaulted to a behavior similar to "all"
, with the added side effect of throwing a warning if multiple matches were actually detected, like this:
student <- tibble(
student_id = c(1, 2, 3),
transfer = c(FALSE, TRUE, TRUE),
initial_term = c("fall 2018", "fall 2020", "fall 2020")
)
term <- tibble(
student_id = c(1, 1, 2, 3, 3, 3),
term = c("fall 2018", "spring 2019", "fall 2020", "fall 2020", "spring 2021", "fall 2021"),
course_load = c(12, 15, 10, 14, 15, 12)
)
# Information about students attending a university.
# One row per (student_id).
student
#> # A tibble: 3 × 3
#> student_id transfer initial_term
#> <dbl> <lgl> <chr>
#> 1 1 FALSE fall 2018
#> 2 2 TRUE fall 2020
#> 3 3 TRUE fall 2020
# Term specific information about each student.
# One row per (student_id, term) combination.
term
#> # A tibble: 6 × 3
#> student_id term course_load
#> <dbl> <chr> <dbl>
#> 1 1 fall 2018 12
#> 2 1 spring 2019 15
#> 3 2 fall 2020 10
#> 4 3 fall 2020 14
#> 5 3 spring 2021 15
#> 6 3 fall 2021 12
student |>
left_join(term, join_by(student_id))
#> Warning in left_join(student, term, join_by(student_id)): Each row in `x` is expected to match at most 1 row in `y`.
#> i Row 1 of `x` matches multiple rows.
#> i If multiple matches are expected, set `multiple = "all"` to silence this warning.
#> # A tibble: 6 × 5
#> student_id transfer initial_term term course_load
#> <dbl> <lgl> <chr> <chr> <dbl>
#> 1 1 FALSE fall 2018 fall 2018 12
#> 2 1 FALSE fall 2018 spring 2019 15
#> 3 2 TRUE fall 2020 fall 2020 10
#> 4 3 TRUE fall 2020 fall 2020 14
#> 5 3 TRUE fall 2020 spring 2021 15
#> 6 3 TRUE fall 2020 fall 2021 12
To silence this warning, we encouraged you to set multiple = "all"
to be explicit about the fact that you expected a row from x
to match multiple rows in y
.
The original motivation for this behavior comes from a two-part hypothesis of ours:
-
Users are often surprised when a join returns more rows than the left-hand table started with (in the above example,
student
has 3 rows but the join result has 6). -
It is dangerous to allow joins that can result in a Cartesian explosion of the number of rows (i.e.
nrow(x) * nrow(y)
).
This hypothesis led us to automatically warn on two types of join relationships, one-to-many joins and many-to-many joins. If you aren’t familiar with these terms, here is a quick rundown of the 4 types of join relationships (often discussed in a SQL context), which provide constraints on the number of allowed matches:
- one-to-one:
- A row from
x
can match at most 1 row fromy
. - A row from
y
can match at most 1 row fromx
.
- A row from
- one-to-many:
- A row from
x
can match any number of rows iny
. - A row from
y
can match at most 1 row fromx
.
- A row from
- many-to-one:
- A row from
x
can match at most 1 row fromy
. - A row from
y
can match any number of rows inx
.
- A row from
- many-to-many:
- A row from
x
can match any number of rows iny
. - A row from
y
can match any number of rows inx
.
- A row from
After gathering some valuable user feedback and conducting an in depth analysis of these join relationships, we’ve determined that the only relationship style actually worth warning on is many-to-many, because that is the one that can result in a Cartesian explosion of rows. In retrospect, the one-to-many relationship is actually quite common, and is symmetrical with many-to-one, which we weren’t warning on. You could actually exploit this fact by switching the above join around, which would silence the warning:
term |>
left_join(student, join_by(student_id))
#> # A tibble: 6 × 5
#> student_id term course_load transfer initial_term
#> <dbl> <chr> <dbl> <lgl> <chr>
#> 1 1 fall 2018 12 FALSE fall 2018
#> 2 1 spring 2019 15 FALSE fall 2018
#> 3 2 fall 2020 10 TRUE fall 2020
#> 4 3 fall 2020 14 TRUE fall 2020
#> 5 3 spring 2021 15 TRUE fall 2020
#> 6 3 fall 2021 12 TRUE fall 2020
We still believe that new users are often surprised when a join returns more rows than they originally started with, but the many-to-one case of this is rarely a problem in practice. So, as of dplyr 1.1.1, we no longer warn on one-to-many relationships, which should drastically reduce the amount of warnings that you see.
Many-to-many relationships
A many-to-many relationship is much harder to construct (which is good). In fact, a database system won’t even let you create one of these “relationships” between two tables directly, instead requiring you to create a third bridge table that turns the many-to-many relationship into two one-to-many relationships. We can “accidentally” create one of these in R though:
course <- tibble(
student_id = c(1, 1, 1, 2, 2, 3, 3, 3, 3),
instructor_id = c(1, 2, 3, 1, 2, 1, 2, 3, 4),
course = c(101, 110, 123, 110, 101, 110, 115, 110, 101),
term = c(
"fall 2018", "fall 2018", "spring 2019", "fall 2020", "fall 2020",
"fall 2020", "fall 2020", "spring 2021", "fall 2021"
),
grade = c("A", "B", "A", "B", "C", "A", "C", "D", "B")
)
# Information about the courses each student took per semester.
# One row per (student_id, course, term) combination.
course
#> # A tibble: 9 × 5
#> student_id instructor_id course term grade
#> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 1 1 101 fall 2018 A
#> 2 1 2 110 fall 2018 B
#> 3 1 3 123 spring 2019 A
#> 4 2 1 110 fall 2020 B
#> 5 2 2 101 fall 2020 C
#> 6 3 1 110 fall 2020 A
#> 7 3 2 115 fall 2020 C
#> 8 3 3 110 spring 2021 D
#> 9 3 4 101 fall 2021 B
# Forgetting to join by both `student_id` and `term`!
term |>
left_join(course, by = join_by(student_id))
#> Warning in left_join(term, course, by = join_by(student_id)): Detected an unexpected many-to-many relationship between `x` and `y`.
#> ℹ Row 1 of `x` matches multiple rows in `y`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
#> ℹ If a many-to-many relationship is expected, set `relationship =
#> "many-to-many"` to silence this warning.
#> # A tibble: 20 × 7
#> student_id term.x course_load instructor_id course term.y grade
#> <dbl> <chr> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 1 fall 2018 12 1 101 fall 2018 A
#> 2 1 fall 2018 12 2 110 fall 2018 B
#> 3 1 fall 2018 12 3 123 spring 2019 A
#> 4 1 spring 2019 15 1 101 fall 2018 A
#> 5 1 spring 2019 15 2 110 fall 2018 B
#> 6 1 spring 2019 15 3 123 spring 2019 A
#> 7 2 fall 2020 10 1 110 fall 2020 B
#> 8 2 fall 2020 10 2 101 fall 2020 C
#> 9 3 fall 2020 14 1 110 fall 2020 A
#> 10 3 fall 2020 14 2 115 fall 2020 C
#> 11 3 fall 2020 14 3 110 spring 2021 D
#> 12 3 fall 2020 14 4 101 fall 2021 B
#> 13 3 spring 2021 15 1 110 fall 2020 A
#> 14 3 spring 2021 15 2 115 fall 2020 C
#> 15 3 spring 2021 15 3 110 spring 2021 D
#> 16 3 spring 2021 15 4 101 fall 2021 B
#> 17 3 fall 2021 12 1 110 fall 2020 A
#> 18 3 fall 2021 12 2 115 fall 2020 C
#> 19 3 fall 2021 12 3 110 spring 2021 D
#> 20 3 fall 2021 12 4 101 fall 2021 B
In the example above, we’ve forgotten to include the term
column when joining these two tables together, which accidentally results in a small explosion of rows (we end up with 20 rows, more than in either original input, but not quite the maximum possible amount, which is a whopping 54 rows!). Luckily, dplyr warns us that at least one row in each table matches more than one row in the opposite table - a sign that something isn’t right. At this point we can do one of two things:
-
Look into the new
relationship
argument that the warning mentions (we’ll discuss this below) -
Look at our join to see if we made a mistake
Of course, in this case we’ve messed up, and adding term
into the by expression results in the correct (and silent) join:
term |>
left_join(course, by = join_by(student_id, term))
#> # A tibble: 9 × 6
#> student_id term course_load instructor_id course grade
#> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
#> 1 1 fall 2018 12 1 101 A
#> 2 1 fall 2018 12 2 110 B
#> 3 1 spring 2019 15 3 123 A
#> 4 2 fall 2020 10 1 110 B
#> 5 2 fall 2020 10 2 101 C
#> 6 3 fall 2020 14 1 110 A
#> 7 3 fall 2020 14 2 115 C
#> 8 3 spring 2021 15 3 110 D
#> 9 3 fall 2021 12 4 101 B
Join relationship
s
To adjust the joins to only warn on many-to-many relationships, we’ve done two things:
-
multiple
now defaults to"all"
, and is now focused solely on limiting the matches returned if multiple are detected, rather than also optionally warning/erroring. -
We’ve added a new
relationship
argument.
The relationship
argument allows you to explicitly specify the expected join relationship between the keys of x
and y
using the exact options we listed above: "one-to-one"
, "one-to-many"
, "many-to-one"
, and "many-to-many"
. If the constraints of the relationship you choose are violated, an error is thrown. For example, we could use this to require that the student
+ term
join contains a one-to-many relationship between the two tables:
student |>
left_join(term, join_by(student_id), relationship = "one-to-many")
#> # A tibble: 6 × 5
#> student_id transfer initial_term term course_load
#> <dbl> <lgl> <chr> <chr> <dbl>
#> 1 1 FALSE fall 2018 fall 2018 12
#> 2 1 FALSE fall 2018 spring 2019 15
#> 3 2 TRUE fall 2020 fall 2020 10
#> 4 3 TRUE fall 2020 fall 2020 14
#> 5 3 TRUE fall 2020 spring 2021 15
#> 6 3 TRUE fall 2020 fall 2021 12
Let’s violate this by adding a duplicate row in student
:
student_bad <- student |>
tibble::add_row(
student_id = 1,
transfer = FALSE,
initial_term = "fall 2019",
.after = 1
)
student_bad
#> # A tibble: 4 × 3
#> student_id transfer initial_term
#> <dbl> <lgl> <chr>
#> 1 1 FALSE fall 2018
#> 2 1 FALSE fall 2019
#> 3 2 TRUE fall 2020
#> 4 3 TRUE fall 2020
student_bad |>
left_join(term, join_by(student_id), relationship = "one-to-many")
#> Error in `left_join()`:
#> ! Each row in `y` must match at most 1 row in `x`.
#> ℹ Row 1 of `y` matches multiple rows in `x`.
The default value of relationship
doesn’t add any constraints, but for equality joins it will check to see if a many-to-many relationship exists, and will warn if one occurs (like with the term
+ course
join from above). As mentioned before, this is quite hard to do, and often means you have a mistake in your join call or in the data itself. If you really do want to perform a join with this kind of relationship, to silence the warning you can explicitly specify relationship = "many-to-many"
.
One last thing to note is that relationship
doesn’t handle the case of an unmatched row. For that, you should use the unmatched
argument that was also added in 1.1.0. The combination of relationship
and unmatched
provides a complete set of tools for adding production level quality control checks to your joins.
Acknowledgements
The examples used in this blog post were adapted from @eipi10 in this issue.
We’d like to thank all 66 contributors who help in someway, whether it was filing issues or contributing code and documentation: @alexhallam, @ammar-gla, @arnaudgallou, @ArthurAndrews, @AuburnEagle-578, @batpigandme, @billdenney, @Bisaloo, @bitplane, @chrarnold, @D5n9sMatrix, @daattali, @DanChaltiel, @DavisVaughan, @dieghernan, @dkutner, @eipi10, @eitsupi, @emilBeBri, @fawda123, @fedassembly, @fkohrt, @gavinsimpson, @geogale, @ggrothendieck, @hadley, @hope-data-science, @jaganmn, @jakub-jedrusiak, @JorisChau, @krlmlr, @krprasangdas, @larry77, @lionel-, @lschneiderbauer, @LukasWallrich, @maellecoursonnais, @manhnguyen48, @mattansb, @mgirlich, @mhaynam, @MichaelChirico, @mine-cetinkaya-rundel, @mkoohafkan, @moodymudskipper, @Moohan, @msgoussi, @multimeric, @osheen1, @Pozdniakov, @psychelzh, @pur80a, @robayo, @rszulkin, @salim-b, @sda030, @sfirke, @shannonpileggi, @stephLH, @szabgab, @tjebo, @Torvaney, @twest820, @vanillajonathan, @warnes, and @zknitter.