dplyr 1.1.0: Joins

  dplyr, dplyr-1-1-0

  Davis Vaughan

dplyr 1.1.0 is out now! This is a giant release, so we’re splitting the release announcement up into four blog posts which we’ll post over the course of this week. Today, we’re focusing on joins, including the new join_by() syntax, new warnings for multiple matches, inequality joins, rolling joins, and new tools for handling unmatched rows. To learn more about joins, you might want to read the updated joins chapter in the upcoming 2nd edition of R for Data Science.

This version of dplyr includes a number of features inspired by our data.table friends. The inequality and rolling joins we discuss today were popularized in R by data.table, and greatly inspired our own implementation.

To see the other blog posts in this series, head here.

You can install it from CRAN with:

install.packages("dplyr")

join_by()

Consider the following two tables, transactions and companies. transactions tracks sales across various years for different companies, and companies connects the short company id to its actual company name - either Patagonia (a fellow B-Corp!) or RStudio.

transactions <- tibble(
  company = c("A", "A", "B", "B"),
  year = c(2019, 2020, 2021, 2023),
  revenue = c(50, 4, 10, 12)
)
transactions
#> # A tibble: 4 × 3
#>   company  year revenue
#>   <chr>   <dbl>   <dbl>
#> 1 A        2019      50
#> 2 A        2020       4
#> 3 B        2021      10
#> 4 B        2023      12

companies <- tibble(
  id = c("A", "B"),
  name = c("Patagonia", "RStudio")
)
companies
#> # A tibble: 2 × 2
#>   id    name     
#>   <chr> <chr>    
#> 1 A     Patagonia
#> 2 B     RStudio

To join these two tables together, we might use an inner join:

transactions |> 
  inner_join(companies, by = c(company = "id"))
#> # A tibble: 4 × 4
#>   company  year revenue name     
#>   <chr>   <dbl>   <dbl> <chr>    
#> 1 A        2019      50 Patagonia
#> 2 A        2020       4 Patagonia
#> 3 B        2021      10 RStudio  
#> 4 B        2023      12 RStudio

This works great, but has always felt a little clunky. Specifying c(company = "id") is a little awkward because it uses =, not ==: here we’re asserting that we want company to equal id, not naming a function argument or performing assignment. We’ve improved on this with a new helper, join_by(), which takes expressions in a way that allows you to more naturally express this join:

join_by(company == id)
#> Join By:
#> - company == id

This join specification can be used as the by argument in any of the *_join() functions:

transactions |> 
  inner_join(companies, by = join_by(company == id))
#> # A tibble: 4 × 4
#>   company  year revenue name     
#>   <chr>   <dbl>   <dbl> <chr>    
#> 1 A        2019      50 Patagonia
#> 2 A        2020       4 Patagonia
#> 3 B        2021      10 RStudio  
#> 4 B        2023      12 RStudio

This small quality of life improvement is just one of the many new features that come with join_by(). We’ll look at more of these next.

Multiple matches


Update: As of March 22, dplyr 1.1.1 is available on CRAN, which alters the behavior of multiple match detection so that you see warnings much less often. Read all about it or install it now with install.packages("dplyr").


To make things a little more interesting, we’ll add one more column to companies, and one more row:

companies <- tibble(
  id = c("A", "B", "B"),
  since = c(1973, 2009, 2022),
  name = c("Patagonia", "RStudio", "Posit")
)

companies
#> # A tibble: 3 × 3
#>   id    since name     
#>   <chr> <dbl> <chr>    
#> 1 A      1973 Patagonia
#> 2 B      2009 RStudio  
#> 3 B      2022 Posit

This table now also tracks name changes that have happened over the course of a company’s history. In 2022, we changed our name from RStudio to Posit, so we’ve tracked that as an additional row in our dataset. Note that both RStudio and Posit are given an id of B, which links back to the transactions table.

If we were to join these two tables together, ideally we’d bring over the name that was in effect when the transaction took place. For example, for the transaction in 2021, the company was still RStudio, so ideally we’d only match up against the RStudio row in companies. If we colored the expected matches, they’d look something like this:

How can we do this? We can try the same join from before, but we won’t like the results:

faulty <- transactions |> 
  inner_join(companies, by = join_by(company == id))
#> Warning in inner_join(transactions, companies, by = join_by(company == id)): Each row in `x` is expected to match at most 1 row in `y`.
#>  Row 3 of `x` matches multiple rows.
#>  If multiple matches are expected, set `multiple = "all"` to silence this
#>   warning.

faulty
#> # A tibble: 6 × 5
#>   company  year revenue since name     
#>   <chr>   <dbl>   <dbl> <dbl> <chr>    
#> 1 A        2019      50  1973 Patagonia
#> 2 A        2020       4  1973 Patagonia
#> 3 B        2021      10  2009 RStudio  
#> 4 B        2021      10  2022 Posit    
#> 5 B        2023      12  2009 RStudio  
#> 6 B        2023      12  2022 Posit

Company A matches correctly, but since we only joined on the company id, we get multiple matches for each of company B's transactions and end up with more rows than we started with. This is a problem, as we were expecting a 1:1 match for each row in transactions. Multiple matches in equality joins like this one are typically unexpected (even though they are baked in to SQL) so we’ve also added a new warning to alert you when this happens. If multiple matches are expected, you can explicitly set multiple = "all" to silence this warning. This also serves as a code “sign post” for future readers of your code to let them know that this is a join that is expected to increase the number of rows in the data. If multiple matches aren’t expected, you can also set multiple = "error" to immediately halt the analysis. We expect this will be useful as a quality control check for production code where you might rerun analyses with new data on a rolling basis.

Inequality joins

To actually fix this issue, we’ll need to expand our join specification to include another condition. Let’s zoom in to just 2021:

filter(faulty, company == "B", year == 2021)
#> # A tibble: 2 × 5
#>   company  year revenue since name   
#>   <chr>   <dbl>   <dbl> <dbl> <chr>  
#> 1 B        2021      10  2009 RStudio
#> 2 B        2021      10  2022 Posit

We want to retain the match with RStudio, but not with Posit (because the name hasn’t changed yet). One way to express this is by using the year and since columns to state that you only want a match if the transaction year occurred after a name change:

# `year[i] >= since`?
2021 >= 2009
#> [1] TRUE
2021 >= 2022
#> [1] FALSE

Because join_by() accepts expressions, we can express this inequality directly inside the join specification:

join_by(company == id, year >= since)
#> Join By:
#> - company == id
#> - year >= since
transactions |>
  inner_join(companies, join_by(company == id, year >= since))
#> # A tibble: 5 × 5
#>   company  year revenue since name     
#>   <chr>   <dbl>   <dbl> <dbl> <chr>    
#> 1 A        2019      50  1973 Patagonia
#> 2 A        2020       4  1973 Patagonia
#> 3 B        2021      10  2009 RStudio  
#> 4 B        2023      12  2009 RStudio  
#> 5 B        2023      12  2022 Posit

This eliminated the 2021 match to Posit, as expected! This type of join is known as an inequality join, i.e. it involves at least one join expression containing one of the following inequality conditions: >=, >, <=, or <.

However, we still have 2 matches corresponding to the 2023 year. In this case, we only wanted the match to Posit. We can understand why we are still getting multiple matches here by running the same row-by-row analysis as before:

# `year[i] >= since`? Both are true!
2023 >= 2009
#> [1] TRUE
2023 >= 2022
#> [1] TRUE

To remove the last problematic match of the 2023 transaction to the RStudio name, we’ll need to refine our join specification one more time.

Rolling joins

Inequality conditions like year >= since are powerful, but since the condition is only bounded on one side it is common for them to return a large number of matches. Since multiple matches are the typical case with inequality joins, we don’t get a warning like with the equality join, but we clearly still haven’t gotten the join right. As a reminder, here are where we still have too many matches:

transactions |>
  inner_join(companies, join_by(company == id, year >= since)) |>
  filter(company == "B", year == 2023)
#> # A tibble: 2 × 5
#>   company  year revenue since name   
#>   <chr>   <dbl>   <dbl> <dbl> <chr>  
#> 1 B        2023      12  2009 RStudio
#> 2 B        2023      12  2022 Posit

We need a way to filter down the matches returned from year >= since to only the most recent name change. In other words, we prefer the Posit match over the RStudio match because 2022 is closer to the transaction year of 2023 than 2009 is. We can express this in join_by() by using a helper named closest().

transactions |>
  inner_join(companies, join_by(company == id, closest(year >= since)))
#> # A tibble: 4 × 5
#>   company  year revenue since name     
#>   <chr>   <dbl>   <dbl> <dbl> <chr>    
#> 1 A        2019      50  1973 Patagonia
#> 2 A        2020       4  1973 Patagonia
#> 3 B        2021      10  2009 RStudio  
#> 4 B        2023      12  2022 Posit

closest(year >= since) finds all of the matches in since for a particular year, and then filters them down to only the closest match to that year. This is known as a rolling join, because in this case it rolls the most recent name change forward to match up with the transaction. Rolling joins were popularized by data.table, and are related to ASOF joins supported by some SQL flavors.

There is a third new class of joins supported by join_by() that we won’t discuss today known as overlap joins. These are particularly useful in time series where you are looking for cases where a date or range of dates from one table overlaps a range of dates in another table. There are three helpers for overlap joins: between(), overlaps(), and within(), which you can read more about in the documentation.

Unmatched rows

I mentioned earlier that we expected a 1:1 match between transactions and companies. We saw that multiple can help protect us from having too many matches, but what about not having enough? Consider what happens if we add a new company to transactions without a corresponding match in companies.

transactions <- transactions |>
  tibble::add_row(company = "C", year = 2023, revenue = 15)

transactions
#> # A tibble: 5 × 3
#>   company  year revenue
#>   <chr>   <dbl>   <dbl>
#> 1 A        2019      50
#> 2 A        2020       4
#> 3 B        2021      10
#> 4 B        2023      12
#> 5 C        2023      15
transactions |>
  inner_join(
    companies, 
    join_by(company == id, closest(year >= since))
  )
#> # A tibble: 4 × 5
#>   company  year revenue since name     
#>   <chr>   <dbl>   <dbl> <dbl> <chr>    
#> 1 A        2019      50  1973 Patagonia
#> 2 A        2020       4  1973 Patagonia
#> 3 B        2021      10  2009 RStudio  
#> 4 B        2023      12  2022 Posit

We’ve accidentally lost the C row! If you don’t expect any unmatched rows, you can now catch this problem automatically by using our other new quality control argument, unmatched:

transactions |>
  inner_join(
    companies, 
    join_by(company == id, closest(year >= since)),
    unmatched = "error"
  )
#> Error in `inner_join()`:
#> ! Each row of `x` must have a match in `y`.
#>  Row 5 of `x` does not have a match.

If you’ve been questioning why I’ve been using an inner_join() over a left_join() this whole time, unmatched is why. We could use a left_join():

transactions |>
  left_join(
    companies, 
    join_by(company == id, closest(year >= since)),
    unmatched = "error"
  )
#> # A tibble: 5 × 5
#>   company  year revenue since name     
#>   <chr>   <dbl>   <dbl> <dbl> <chr>    
#> 1 A        2019      50  1973 Patagonia
#> 2 A        2020       4  1973 Patagonia
#> 3 B        2021      10  2009 RStudio  
#> 4 B        2023      12  2022 Posit    
#> 5 C        2023      15    NA NA

But you’ll notice that we don’t get an error here. unmatched will only error if the input that has the potential to drop rows has an unmatched row. The reason you’d use a left_join() is to ensure that rows from x are always retained, so it wouldn’t make sense to error when rows from x are also unmatched. If y had unmatched rows instead, then it would have errored because those rows would otherwise be lost from the join. In an inner_join(), both inputs can potentially drop rows, so unmatched = "error" checks for unmatched rows in both inputs.