dplyr 1.1.0: pick(), reframe(), and arrange()

  dplyr, dplyr-1-1-0

  Davis Vaughan

In this final dplyr 1.1.0 post, we’ll take a look at two new verbs, pick() and reframe(), along with some changes to arrange() that improve both reproducibility and performance. If you missed our previous posts, you should definitely go back and check them out!

You can install it from CRAN with:

install.packages("dplyr")

pick()

One thing we noticed after dplyr 1.0.0 was released is that many people like to use across() for its column selection features while working inside a data-masking function like mutate() or summarise(). This is typically useful if you have a function that takes data frames as inputs, or if you need to compute features about a specific subset of columns.

df <- tibble(
  x_1 = c(1, 3, 2, 1, 2), 
  x_2 = 6:10, 
  w_4 = 11:15, 
  y_2 = c(5, 2, 4, 0, 6)
)

df |>
  summarise(
    n_x = ncol(across(starts_with("x"))),
    n_y = ncol(across(starts_with("y")))
  )
#> # A tibble: 1 × 2
#>     n_x   n_y
#>   <int> <int>
#> 1     2     1

across() is intended to apply a function to each of these columns, rather than just select them, which is why its name doesn’t feel natural for this operation. In dplyr 1.1.0 we’ve introduced pick(), a specialized column selection variant with a more natural name:

df |>
  summarise(
    n_x = ncol(pick(starts_with("x"))),
    n_y = ncol(pick(starts_with("y")))
  )
#> # A tibble: 1 × 2
#>     n_x   n_y
#>   <int> <int>
#> 1     2     1

pick() is particularly useful in combination with ranking functions like dense_rank(), which have been upgraded in 1.1.0 to take data frames as inputs, serving as a way to jointly rank by multiple columns at once.

df |>
  mutate(
    rank1 = dense_rank(x_1), 
    rank2 = dense_rank(pick(x_1, y_2)) # Using `y_2` to break ties in `x_1`
  )
#> # A tibble: 5 × 6
#>     x_1   x_2   w_4   y_2 rank1 rank2
#>   <dbl> <int> <int> <dbl> <int> <int>
#> 1     1     6    11     5     1     2
#> 2     3     7    12     2     3     5
#> 3     2     8    13     4     2     3
#> 4     1     9    14     0     1     1
#> 5     2    10    15     6     2     4

We haven’t deprecated using across() without supplying .fns yet, but we plan to in the future now that pick() exists as a better alternative.

reframe()

As we mentioned in the coming soon blog post, in dplyr 1.1.0 we’ve decided to walk back the change we introduced to summarise() in dplyr 1.0.0 that allowed it to return per-group results of any length, rather than results of length 1. We think that the idea of multi-row results is extremely powerful, as it serves as a flexible way to apply arbitrary operations to each group, but we’ve realized that summarise() wasn’t the best home for it because it increases the chance for users to run into silent recycling bugs (thanks to Kirill Müller and David Robinson for bringing this to our attention).

As an example, here we’re computing the mean and standard deviation of x, grouped by g. Unfortunately, I accidentally forgot to use sd(x) and instead just typed x. Because of how tidyverse recycling rules work, the multi-row behavior silently recycled the size 1 mean values instead of erroring, so rather than 2 rows, we end up with 5.

df <- tibble(
  g = c(1, 1, 1, 2, 2),
  x = c(4, 3, 6, 2, 8),
  y = c(5, 1, 2, 8, 9)
)

df
#> # A tibble: 5 × 3
#>       g     x     y
#>   <dbl> <dbl> <dbl>
#> 1     1     4     5
#> 2     1     3     1
#> 3     1     6     2
#> 4     2     2     8
#> 5     2     8     9
df |>
  summarise(
    x_average = mean(x),
    x_sd = x, # Oops
    .by = g
  )
#> Warning: Returning more (or less) than 1 row per `summarise()` group was deprecated in
#> dplyr 1.1.0.
#>  Please use `reframe()` instead.
#>  When switching from `summarise()` to `reframe()`, remember that `reframe()`
#>   always returns an ungrouped data frame and adjust accordingly.
#> # A tibble: 5 × 3
#>       g x_average  x_sd
#>   <dbl>     <dbl> <dbl>
#> 1     1      4.33     4
#> 2     1      4.33     3
#> 3     1      4.33     6
#> 4     2      5        2
#> 5     2      5        8

summarise() now throws a warning when any group returns a result that isn’t length 1. We expect to upgrade this to an error in the future to revert summarise() back to its “safe” behavior of requiring 1 row per group.

summarise() also wasn’t the best name for a function with this feature, as the name itself implies one row per group. After gathering some feedback, we’ve settled on a new verb with a more appropriate name, reframe(). We think of reframe() as a way to “do something” to each group, with no restrictions on the number of rows returned per group. The name has a nice connection to the tibble functions tibble::enframe() and tibble::deframe(), which are used for converting vectors to data frames and vice versa:

  • enframe(): Takes a vector, returns a data frame

  • deframe(): Takes a data frame, returns a vector

  • reframe(): Takes a data frame, returns a data frame

One nice application of reframe() is computing quantiles at various probability thresholds. It’s particularly nice if we wrap quantile() into a helper that returns a data frame, which reframe() then automatically unpacks.

quantile_df <- function(x, probs = c(0.25, 0.5, 0.75)) {
  tibble(
    value = quantile(x, probs, na.rm = TRUE),
    prob = probs
  )
}
df |>
  reframe(quantile_df(x), .by = g)
#> # A tibble: 6 × 3
#>       g value  prob
#>   <dbl> <dbl> <dbl>
#> 1     1   3.5  0.25
#> 2     1   4    0.5 
#> 3     1   5    0.75
#> 4     2   3.5  0.25
#> 5     2   5    0.5 
#> 6     2   6.5  0.75

This also works well if you want to apply it to multiple columns using across():

df %>%
  reframe(across(x:y, quantile_df), .by = g)
#> # A tibble: 6 × 3
#>       g x$value $prob y$value $prob
#>   <dbl>   <dbl> <dbl>   <dbl> <dbl>
#> 1     1     3.5  0.25    1.5   0.25
#> 2     1     4    0.5     2     0.5 
#> 3     1     5    0.75    3.5   0.75
#> 4     2     3.5  0.25    8.25  0.25
#> 5     2     5    0.5     8.5   0.5 
#> 6     2     6.5  0.75    8.75  0.75

Because quantile_df() returns a tibble, we end up with packed data frame columns. You’ll often want to unpack these into their individual columns, and across() has gained a new .unpack argument in 1.1.0 that helps you do exactly that:

df %>%
  reframe(across(x:y, quantile_df, .unpack = TRUE), .by = g)
#> # A tibble: 6 × 5
#>       g x_value x_prob y_value y_prob
#>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>
#> 1     1     3.5   0.25    1.5    0.25
#> 2     1     4     0.5     2      0.5 
#> 3     1     5     0.75    3.5    0.75
#> 4     2     3.5   0.25    8.25   0.25
#> 5     2     5     0.5     8.5    0.5 
#> 6     2     6.5   0.75    8.75   0.75

We expect that seeing reframe() in a colleague’s code will serve as an extremely clear signal that something “special” is happening, because they’ve made a conscious decision to opt-into the 1% case of returning multiple rows per group.

arrange()

We also mentioned in the coming soon post that arrange() has undergone two user-facing changes:

  • When sorting character vectors, the C locale is now the default, rather than the system locale

  • A new .locale argument, powered by stringi, allows you to explicitly request an alternative locale using a stringi locale identifier (like "en" for English, or "fr" for French)

These changes were made for two reasons:

  • Much faster performance by default, due to usage of a custom radix sort algorithm inspired by data.table‘s forder()

  • Improved reproducibility across R sessions, where different computers might use different system locales and different operating systems have different ways to specify the same system locale

If you use arrange() for the purpose of grouping similar values together (and don’t care much about the specific locale that it uses to do so), then you’ll likely see performance improvements of up to 100x in dplyr 1.1.0. If you do care about the locale and supply .locale, you should still see improvements of up to 10x.

# 10,000 random strings, sampled up to 1,000,000 rows
dictionary <- stringi::stri_rand_strings(10000, length = 10, pattern = "[a-z]")
str <- tibble(x = sample(dictionary, size = 1e6, replace = TRUE))
str
#> # A tibble: 1,000,000 × 1
#>    x         
#>    <chr>     
#>  1 slpqkdtpyr
#>  2 xtoucpndhc
#>  3 vsvfoqcyqm
#>  4 gnbpkwcmse
#>  5 xutzdqxpsi
#>  6 gkolsrndrz
#>  7 mitqahkkou
#>  8 eehfrrimhd
#>  9 ymxxjczjsv
#> 10 svpvizfxwe
#> # … with 999,990 more rows
# dplyr 1.0.10 (American English system locale)
bench::mark(arrange(str, x))
#> # A tibble: 1 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 arrange(str, x)   4.38s    4.89s     0.204    12.7MB    0.148

# dplyr 1.1.0 (C locale default, 100x faster)
bench::mark(arrange(str, x))
#> # A tibble: 1 × 6
#>   expression          min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>     <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 arrange(str, x)  42.3ms   46.6ms      20.8    22.4MB     46.0

# dplyr 1.1.0 (American English `.locale`, 10x faster)
bench::mark(arrange(str, x, .locale = "en"))
#> # A tibble: 1 × 6
#>   expression                           min median `itr/sec` mem_alloc
#>   <bch:expr>                      <bch:tm> <bch:>     <dbl> <bch:byt>
#> 1 arrange(str, x, .locale = "en")    377ms  430ms      2.21    27.9MB
#> # … with 1 more variable: `gc/sec` <dbl>

We are hopeful that switching to a C locale default will have a relatively small amount of impact in exchange for much faster performance. To read more about the exact differences between the C locale and locales like American English or Spanish, see the coming soon post or our detailed tidyup. If you are having trouble converting an existing script over to the new behavior, you can set the temporary global option options(dplyr.legacy_locale = TRUE), which will revert to the pre-1.1.0 behavior of using the system locale. We expect to remove this option in a future release.

Acknowledgements

A big thanks to the 88 contributors who helped make the 1.1.0 release possible by opening issues, contributing features and documentation, and asking questions! @7708801314520dym, @abalter, @aghaynes, @AlbertRapp, @AlexGaithuma, @algsat, @andrewbaxter439, @andrewpbray, @asadow, @asmlgkj, @barbosawf, @barnabasharris, @bart1, @bergsmat, @chrisbrownlie, @cjyetman, @CNUlichao, @daattali, @DanChaltiel, @davidchall, @DavisVaughan, @ddsjoberg, @donboyd5, @drmowinckels, @dxtxs1, @eitsupi, @eogoodwin, @erhoppe, @eutwt, @ggrothendieck, @grayskripko, @H-Mateus, @hadley, @haozhou1988, @hassanjfry, @Hesham999666, @hideaki, @jeffreypullin, @jic007, @jmbarbone, @jonspring, @jonthegeek, @jpeacock29, @kendonB, @kenkoonwong, @kevinushey, @krlmlr, @larry77, @latot, @lionel-, @llayman12, @LukasWallrich, @m-sostero, @machow, @mc-unimi, @mgacc0, @mgirlich, @MichelleSMA, @mine-cetinkaya-rundel, @moodymudskipper, @moriarais, @NicChr, @nstjhp, @omarwh, @orgadish, @rempsyc, @rorynolan, @ryanvoyack, @selkamand, @seth-cp, @shalom-lab, @shannonpileggi, @simonpcouch, @sjackson1997, @spono, @stibu81, @tfehring, @Theresaliu, @TimBMK, @TimTeaFan, @Torvaney, @turbanisch, @weiyangtham, @wurli, @xet869, @yuliaUU, @yutannihilation, and @zeehio.