dplyr 1.1.0: The power of vctrs

  dplyr, dplyr-1-1-0

  Davis Vaughan

Today’s dplyr 1.1.0 post is focused on various updates to vector functions, like case_when() and between(). If you missed our previous posts, you can also see the other blog posts in this series. All of dplyr’s vector functions are now backed by vctrs, which typically results in better error messages, better performance, and greater versatility.

install.packages("dplyr")

case_when()

If you’ve used case_when() before, you’ve probably written a statement like this:

x <- c(1, 12, -5, 6, -2, NA, 0)
case_when(
  x >= 10 ~ "large",
  x >= 0 ~ "small",
  x < 0 ~ NA
)
#> Error: `NA` must be <character>, not <logical>.

Like me, you’ve probably forgotten that case_when() has historically been strict about the types on the right-hand side of the ~, which means that I needed to use NA_character_ here instead of NA. Luckily, the switch to vctrs means that the above code now “just works”:

case_when(
  x >= 10 ~ "large",
  x >= 0 ~ "small",
  x < 0 ~ NA
)
#> [1] "small" "large" NA      "small" NA      NA      "small"

You’ve probably also written a statement like this:

case_when(
  x >= 10 ~ "large",
  x >= 0 ~ "small",
  is.na(x) ~ "missing",
  TRUE ~ "other"
)
#> [1] "small"   "large"   "other"   "small"   "other"   "missing" "small"

In this case, we have a fall-through “default” captured by TRUE ~. This has always felt a little awkward and is fairly difficult to explain to new R users. To make this clearer, we’ve added an explicit .default argument that we encourage you to use instead:

case_when(
  x >= 10 ~ "large",
  x >= 0 ~ "small",
  is.na(x) ~ "missing",
  .default = "other"
)
#> [1] "small"   "large"   "other"   "small"   "other"   "missing" "small"

.default will always be processed last, regardless of where you put it in the call to case_when(), so we recommend placing it at the very end.

We haven’t started any formal deprecation process for TRUE ~ yet, but now that there is a better solution available we encourage you to switch over. We do plan to deprecate this feature in the future because it involves some slightly problematic recycling rules (but we wouldn’t even begin this process for at least a year).

case_match()

Another type of case_when() statement you’ve probably written is some kind of value remapping like:

x <- c("USA", "Canada", "Wales", "UK", "China", NA, "Mexico", "Russia")

case_when(
  x %in% c("USA", "Canada", "Mexico") ~ "North America",
  x %in% c("Wales", "UK") ~ "Europe",
  x %in% "China" ~ "Asia"
)
#> [1] "North America" "North America" "Europe"        "Europe"       
#> [5] "Asia"          NA              "North America" NA

Remapping values in this way is so common that SQL gives it its own name - the “simple” case statement. To streamline this further, we’ve taken out some of the repetition involved with x %in% by introducing case_match(), a variant of case_when() that allows you to specify one or more values on the left-hand side of the ~, rather than logical vectors.

case_match(
  x,
  c("USA", "Canada", "Mexico") ~ "North America",
  c("France", "UK") ~ "Europe",
  "China" ~ "Asia"
)
#> [1] "North America" "North America" NA              "Europe"       
#> [5] "Asia"          NA              "North America" NA

I think that case_match() is particularly neat because it can be wrapped into an ad-hoc replacement helper if you just need to collapse or replace a few problematic values in a vector, while leaving everything else unchanged:

replace_match <- function(x, ...) {
  case_match(x, ..., .default = x, .ptype = x)
}

replace_match(
  x, 
  "USA" ~ "United States", 
  c("UK", "Wales") ~ "United Kingdom",
  NA ~ "[Missing]"
)
#> [1] "United States"  "Canada"         "United Kingdom" "United Kingdom"
#> [5] "China"          "[Missing]"      "Mexico"         "Russia"

consecutive_id()

At Posit, we have regular company update meetings. Since we are all remote, these meetings are over Zoom. Zoom has a neat feature where it can record the transcript of your call, and it will report who was speaking and what they said. It looks something like this:

transcript <- tribble(
  ~name, ~text,
  "Hadley", "I'll never learn Python.",
  "Davis", "But aren't you speaking at PyCon?",
  "Hadley", "So?",
  "Hadley", "That doesn't influence my decision.",
  "Hadley", "I'm not budging!",
  "Mara", "Typical, Hadley. Stubborn as always.",
  "Davis", "Fair enough!",
  "Davis", "Let's move on."
)

transcript
#> # A tibble: 8 × 2
#>   name   text                                
#>   <chr>  <chr>                               
#> 1 Hadley I'll never learn Python.            
#> 2 Davis  But aren't you speaking at PyCon?   
#> 3 Hadley So?                                 
#> 4 Hadley That doesn't influence my decision. 
#> 5 Hadley I'm not budging!                    
#> 6 Mara   Typical, Hadley. Stubborn as always.
#> 7 Davis  Fair enough!                        
#> 8 Davis  Let's move on.

We were working with this data and wanted a way to collapse each continuous thought down to one line. For example, rows 3-5 all contain a single idea from Hadley, so we’d like those to be collapsed into a single line. This isn’t quite as straightforward as a simple group-by-name and summarise():

transcript |>
  summarise(text = stringr::str_flatten(text, collapse = " "), .by = name)
#> # A tibble: 3 × 2
#>   name   text                                                                   
#>   <chr>  <chr>                                                                  
#> 1 Hadley I'll never learn Python. So? That doesn't influence my decision. I'm n…
#> 2 Davis  But aren't you speaking at PyCon? Fair enough! Let's move on.          
#> 3 Mara   Typical, Hadley. Stubborn as always.

This isn’t quite right because it collapsed the first row where Hadley says “I’ll never learn Python” alongside rows 3-5. We need a way to identify consecutive runs representing when a single person is speaking, which is exactly what consecutive_id() is for!

transcript |>
  mutate(id = consecutive_id(name))
#> # A tibble: 8 × 3
#>   name   text                                    id
#>   <chr>  <chr>                                <int>
#> 1 Hadley I'll never learn Python.                 1
#> 2 Davis  But aren't you speaking at PyCon?        2
#> 3 Hadley So?                                      3
#> 4 Hadley That doesn't influence my decision.      3
#> 5 Hadley I'm not budging!                         3
#> 6 Mara   Typical, Hadley. Stubborn as always.     4
#> 7 Davis  Fair enough!                             5
#> 8 Davis  Let's move on.                           5

consecutive_id() takes one or more columns and generates an integer vector that increments every time a value in one of those columns changes. This gives us something we can group on to correctly flatten our text.

transcript |>
  mutate(id = consecutive_id(name)) |>
  summarise(text = stringr::str_flatten(text, collapse = " "), .by = c(id, name))
#> # A tibble: 5 × 3
#>      id name   text                                                    
#>   <int> <chr>  <chr>                                                   
#> 1     1 Hadley I'll never learn Python.                                
#> 2     2 Davis  But aren't you speaking at PyCon?                       
#> 3     3 Hadley So? That doesn't influence my decision. I'm not budging!
#> 4     4 Mara   Typical, Hadley. Stubborn as always.                    
#> 5     5 Davis  Fair enough! Let's move on.

Grouping by id alone is actually enough, but I’ve also grouped by name for a convenient way to drag the name along into the summary table.

consecutive_id() is inspired by data.table::rleid(), which serves a similar purpose.

Miscellaneous updates

  • between() is no longer restricted to length 1 left and right boundaries. They are now allowed to be length 1 or the same length as x. Additionally, between() now works with any type supported by vctrs, rather than just with numerics and date-times.

  • if_else() has received the same updates as case_when(). In particular, it is no longer as strict about typed missing values.

  • The ranking functions, like dense_rank(), now allow data frame inputs as a way to rank by multiple columns at once.

  • first(), last(), and nth() have all gained an na_rm argument since they are summary functions.

  • na_if() now casts y to the type of x to make it clear that it is type stable on x. In particular, this means you can no longer do na_if(<tbl>, 0), which previously accidentally allowed you to attempt to replace missing values in every column with 0. This function has always been intended as a vector function, and this is considered off-label usage. It also now replaces NaN values in double and complex vectors.