dplyr 1.0.4: if_any() and if_all()

  dplyr

  Romain Francois

We’re happy to announce the release of dplyr 1.0.4, featuring: two new functions if_all() and if_any(), and improved performance improvements of across().

You can install it from CRAN with:

install.packages("dplyr")

You can see a full list of changes in the release notes.

if_any() and if_all()

The new across() function introduced as part of dplyr 1.0.0 is proving to be a successful addition to dplyr. In case you missed it, across() lets you conveniently express a set of actions to be performed across a tidy selection of columns.

across() is very useful within summarise() and mutate(), but it’s hard to use it with filter() because it is not clear how the results would be combined into one logical vector. So to fill the gap, we’re introducing two new functions if_all() and if_any(). Let’s directly dive in to an example:

library(dplyr, warn.conflicts = FALSE)
library(palmerpenguins)

big <- function(x) {
  x > mean(x, na.rm = TRUE)
}

# keep rows if all the selected columns are "big"
penguins %>% 
  filter(if_all(contains("bill"), big))
#> # A tibble: 61 x 8
#>    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
#>  1 Adelie  Torge…           46            21.5              194        4200
#>  2 Adelie  Dream            44.1          19.7              196        4400
#>  3 Adelie  Torge…           45.8          18.9              197        4150
#>  4 Adelie  Biscoe           45.6          20.3              191        4600
#>  5 Adelie  Torge…           44.1          18                210        4000
#>  6 Gentoo  Biscoe           44.4          17.3              219        5250
#>  7 Gentoo  Biscoe           50.8          17.3              228        5600
#>  8 Chinst… Dream            46.5          17.9              192        3500
#>  9 Chinst… Dream            50            19.5              196        3900
#> 10 Chinst… Dream            51.3          19.2              193        3650
#> # … with 51 more rows, and 2 more variables: sex <fct>, year <int>

# keep rows where at least one of the columns is "big"
penguins %>% 
  filter(if_any(contains("bill"), big))
#> # A tibble: 296 x 8
#>    species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g
#>    <fct>   <fct>           <dbl>         <dbl>            <int>       <int>
#>  1 Adelie  Torge…           39.1          18.7              181        3750
#>  2 Adelie  Torge…           39.5          17.4              186        3800
#>  3 Adelie  Torge…           40.3          18                195        3250
#>  4 Adelie  Torge…           36.7          19.3              193        3450
#>  5 Adelie  Torge…           39.3          20.6              190        3650
#>  6 Adelie  Torge…           38.9          17.8              181        3625
#>  7 Adelie  Torge…           39.2          19.6              195        4675
#>  8 Adelie  Torge…           34.1          18.1              193        3475
#>  9 Adelie  Torge…           42            20.2              190        4250
#> 10 Adelie  Torge…           37.8          17.3              180        3700
#> # … with 286 more rows, and 2 more variables: sex <fct>, year <int>

Both functions operate similarly to across() but go the extra mile of aggregating the results to indicate if all the results are true when using if_all(), or if at least one is true when using if_any().

Although if_all() and if_any() were designed with filter() in mind, we then discovered that they can also be useful within mutate() and/or summarise():

penguins %>% 
  filter(!is.na(bill_length_mm)) %>% 
  mutate(
    category = case_when(
      if_all(contains("bill"), big) ~ "both big", 
      if_any(contains("bill"), big) ~ "one big", 
      TRUE                          ~ "small"
    )) %>% 
  count(category)
#> # A tibble: 3 x 2
#>   category     n
#> * <chr>    <int>
#> 1 both big    61
#> 2 one big    235
#> 3 small       46

Faster across()

One of the main motivations for across() was eliminating the need for every verb to have a _at, _if, and _all variant. Unfortunately, however, this came with a performance cost. In this release, we have redesigned across() to eliminate that performance penalty in many cases. In the following example, you can now see that the old and new approaches take the same amount of time.

library(vroom)

mun2014 <- vroom(
  "https://data.regardscitoyens.org/elections/2014_municipales/MN14_Bvot_T1_01-49.txt", 
  col_select = -c('X4','X9','X10','X11'), col_types = list(), col_names = FALSE, 
  locale = locale(encoding = "WINDOWS-1252"), altrep = FALSE
) 

bench::workout({
  a <- mun2014 %>% group_by_if(is.character)
  b <- a %>% summarise_if(is.numeric, sum)
})
#> # A tibble: 2 x 3
#>   exprs                                       process     real
#>   <bch:expr>                                 <bch:tm> <bch:tm>
#> 1 a <- mun2014 %>% group_by_if(is.character)    151ms    151ms
#> 2 b <- a %>% summarise_if(is.numeric, sum)      847ms    848ms

bench::workout({
  c <- mun2014 %>% group_by(across(where(is.character)))
  d <- c %>% summarise(across(where(is.numeric), sum)) 
})
#> `summarise()` has grouped output by 'X2', 'X3', 'X5'. You can override using the `.groups` argument.
#> # A tibble: 2 x 3
#>   exprs                                                   process     real
#>   <bch:expr>                                             <bch:tm> <bch:tm>
#> 1 c <- mun2014 %>% group_by(across(where(is.character)))    179ms    179ms
#> 2 d <- c %>% summarise(across(where(is.numeric), sum))      776ms    777ms

Acknowledgements

Merci to all contributors of code, issues and documentation to this release:

@abalter, @cuixueqin, @eggrandio, @everetr, @hadley, @hjohns12, @iago-pssjd, @jahonamir, @krlmlr, @lionel-, @lotard, @luispfonseca, @mbcann01, @mutahiwachira, @Robinlovelace, @romainfrancois, @rpruim, @shahronak47, @shangguandong1996, @sylvainDaubree, @tomazweiss, @vhorvath, @wasdoff, and @Yunuuuu.