Today’s
dplyr 1.1.0 post is focused on various updates to vector functions, like
case_when()
and
between()
. If you missed our previous posts, you can also see the other
blog posts in this series. All of dplyr’s vector functions are now backed by
vctrs, which typically results in better error messages, better performance, and greater versatility.
install.packages("dplyr")
case_when()
If you’ve used
case_when()
before, you’ve probably written a statement like this:
x <- c(1, 12, -5, 6, -2, NA, 0)
case_when(
x >= 10 ~ "large",
x >= 0 ~ "small",
x < 0 ~ NA
)
#> Error: `NA` must be <character>, not <logical>.
Like me, you’ve probably forgotten that
case_when()
has historically been strict about the types on the right-hand side of the ~
, which means that I needed to use NA_character_
here instead of NA
. Luckily, the switch to vctrs means that the above code now “just works”:
case_when(
x >= 10 ~ "large",
x >= 0 ~ "small",
x < 0 ~ NA
)
#> [1] "small" "large" NA "small" NA NA "small"
You’ve probably also written a statement like this:
case_when(
x >= 10 ~ "large",
x >= 0 ~ "small",
is.na(x) ~ "missing",
TRUE ~ "other"
)
#> [1] "small" "large" "other" "small" "other" "missing" "small"
In this case, we have a fall-through “default” captured by TRUE ~
. This has always felt a little awkward and is fairly difficult to explain to new R users. To make this clearer, we’ve added an explicit .default
argument that we encourage you to use instead:
case_when(
x >= 10 ~ "large",
x >= 0 ~ "small",
is.na(x) ~ "missing",
.default = "other"
)
#> [1] "small" "large" "other" "small" "other" "missing" "small"
.default
will always be processed last, regardless of where you put it in the call to
case_when()
, so we recommend placing it at the very end.
We haven’t started any formal deprecation process for TRUE ~
yet, but now that there is a better solution available we encourage you to switch over. We do plan to deprecate this feature in the future because it involves some slightly problematic recycling rules (but we wouldn’t even begin this process for at least a year).
case_match()
Another type of
case_when()
statement you’ve probably written is some kind of value remapping like:
x <- c("USA", "Canada", "Wales", "UK", "China", NA, "Mexico", "Russia")
case_when(
x %in% c("USA", "Canada", "Mexico") ~ "North America",
x %in% c("Wales", "UK") ~ "Europe",
x %in% "China" ~ "Asia"
)
#> [1] "North America" "North America" "Europe" "Europe"
#> [5] "Asia" NA "North America" NA
Remapping values in this way is so common that SQL gives it its own name - the “simple” case statement. To streamline this further, we’ve taken out some of the repetition involved with x %in%
by introducing
case_match()
, a variant of
case_when()
that allows you to specify one or more values on the left-hand side of the ~
, rather than logical vectors.
case_match(
x,
c("USA", "Canada", "Mexico") ~ "North America",
c("France", "UK") ~ "Europe",
"China" ~ "Asia"
)
#> [1] "North America" "North America" NA "Europe"
#> [5] "Asia" NA "North America" NA
I think that
case_match()
is particularly neat because it can be wrapped into an ad-hoc replacement helper if you just need to collapse or replace a few problematic values in a vector, while leaving everything else unchanged:
replace_match <- function(x, ...) {
case_match(x, ..., .default = x, .ptype = x)
}
replace_match(
x,
"USA" ~ "United States",
c("UK", "Wales") ~ "United Kingdom",
NA ~ "[Missing]"
)
#> [1] "United States" "Canada" "United Kingdom" "United Kingdom"
#> [5] "China" "[Missing]" "Mexico" "Russia"
consecutive_id()
At Posit, we have regular company update meetings. Since we are all remote, these meetings are over Zoom. Zoom has a neat feature where it can record the transcript of your call, and it will report who was speaking and what they said. It looks something like this:
transcript <- tribble(
~name, ~text,
"Hadley", "I'll never learn Python.",
"Davis", "But aren't you speaking at PyCon?",
"Hadley", "So?",
"Hadley", "That doesn't influence my decision.",
"Hadley", "I'm not budging!",
"Mara", "Typical, Hadley. Stubborn as always.",
"Davis", "Fair enough!",
"Davis", "Let's move on."
)
transcript
#> # A tibble: 8 × 2
#> name text
#> <chr> <chr>
#> 1 Hadley I'll never learn Python.
#> 2 Davis But aren't you speaking at PyCon?
#> 3 Hadley So?
#> 4 Hadley That doesn't influence my decision.
#> 5 Hadley I'm not budging!
#> 6 Mara Typical, Hadley. Stubborn as always.
#> 7 Davis Fair enough!
#> 8 Davis Let's move on.
We were working with this data and wanted a way to collapse each continuous thought down to one line. For example, rows 3-5 all contain a single idea from Hadley, so we’d like those to be collapsed into a single line. This isn’t quite as straightforward as a simple group-by-name
and
summarise()
:
transcript |>
summarise(text = stringr::str_flatten(text, collapse = " "), .by = name)
#> # A tibble: 3 × 2
#> name text
#> <chr> <chr>
#> 1 Hadley I'll never learn Python. So? That doesn't influence my decision. I'm n…
#> 2 Davis But aren't you speaking at PyCon? Fair enough! Let's move on.
#> 3 Mara Typical, Hadley. Stubborn as always.
This isn’t quite right because it collapsed the first row where Hadley says “I’ll never learn Python” alongside rows 3-5. We need a way to identify consecutive runs representing when a single person is speaking, which is exactly what
consecutive_id()
is for!
transcript |>
mutate(id = consecutive_id(name))
#> # A tibble: 8 × 3
#> name text id
#> <chr> <chr> <int>
#> 1 Hadley I'll never learn Python. 1
#> 2 Davis But aren't you speaking at PyCon? 2
#> 3 Hadley So? 3
#> 4 Hadley That doesn't influence my decision. 3
#> 5 Hadley I'm not budging! 3
#> 6 Mara Typical, Hadley. Stubborn as always. 4
#> 7 Davis Fair enough! 5
#> 8 Davis Let's move on. 5
consecutive_id()
takes one or more columns and generates an integer vector that increments every time a value in one of those columns changes. This gives us something we can group on to correctly flatten our text
.
transcript |>
mutate(id = consecutive_id(name)) |>
summarise(text = stringr::str_flatten(text, collapse = " "), .by = c(id, name))
#> # A tibble: 5 × 3
#> id name text
#> <int> <chr> <chr>
#> 1 1 Hadley I'll never learn Python.
#> 2 2 Davis But aren't you speaking at PyCon?
#> 3 3 Hadley So? That doesn't influence my decision. I'm not budging!
#> 4 4 Mara Typical, Hadley. Stubborn as always.
#> 5 5 Davis Fair enough! Let's move on.
Grouping by id
alone is actually enough, but I’ve also grouped by name
for a convenient way to drag the name along into the summary table.
consecutive_id()
is inspired by
data.table::rleid()
, which serves a similar purpose.
Miscellaneous updates
-
between()
is no longer restricted to length 1left
andright
boundaries. They are now allowed to be length 1 or the same length asx
. Additionally,between()
now works with any type supported by vctrs, rather than just with numerics and date-times. -
if_else()
has received the same updates ascase_when()
. In particular, it is no longer as strict about typed missing values. -
The ranking functions, like
dense_rank()
, now allow data frame inputs as a way to rank by multiple columns at once. -
first()
,last()
, andnth()
have all gained anna_rm
argument since they are summary functions. -
na_if()
now castsy
to the type ofx
to make it clear that it is type stable onx
. In particular, this means you can no longer dona_if(<tbl>, 0)
, which previously accidentally allowed you to attempt to replace missing values in every column with0
. This function has always been intended as a vector function, and this is considered off-label usage. It also now replacesNaN
values in double and complex vectors.