tidyselect 1.2.0

  lifecycle, tidyselect

  Lionel Henry and Hadley Wickham

tidyselect 1.2.0 hit CRAN last week and includes a few updates to the syntax of selections in tidyverse functions like dplyr::select(...) and tidyr::pivot_longer(cols = ).

tidyselect is a low-level package that provides the backend for selection contexts in tidyverse functions. A selection context is an argument like cols in pivot_longer() or a set of arguments like ... in select() 1. In these special contexts, you can use a domain specific language that helps you create a selection of columns. For example, you can select multiple columns with c(), a range of columns with :, and complex matches with selection helpers such as starts_with(). Under the hood, this selection syntax is interpreted and processed by the tidyselect package.

In this post, we’ll cover the most important lifecycle changes in the selection syntax that tidyverse users (package developers in particular) should know about. You can see a full list of changes in the release notes. We’ll start with a quick recap of what it means in practice for a feature to be deprecated or soft-deprecated.

Note: With this release of tidyselect, some error messages will be suboptimal until dplyr 1.1.0 is released (planned in late October). We recommend waiting until then before updating tidyselect (though it’s not a big deal if you have already updated).

About soft-deprecation

Deprecation of features in tidyverse packages is handled by the lifecycle package. See https://www.tidyverse.org/blog/2021/02/lifecycle-1-0-0/ for an introduction.

The main feature of lifecycle is to distinguish between two stages of deprecation and two usage modes, direct and indirect.

  • For script users, direct usage is when you use a deprecated feature from the global environment. If the deprecated feature was used inside a package function that you are calling, it is considered indirect usage.

  • For package developers, the distinction between direct and indirect usages is made by testthat in unit tests. If a function in your package calls the feature, it is considered direct usage. If that’s a function in another package that you are calling, it’s indirect usage.

To sum up, direct usage is when your own code uses the deprecated feature, and indirect usage is when someone else’s code uses it. This distinction matters because it determines how verbose (and thus how annoying) the deprecation warnings are.

  • For soft-deprecation, indirect usage is always silent because we only want to alert people who are actually able to fix the problem.

    Direct usage only generates one warning every 8 hours to avoid being too annoying during this transition period, so that you can continue to work with existing code, ignore the warnings, and update to the new patterns on your own time.

  • For deprecation, it’s now really time to update the code. Direct usage gives a warning every time so that deprecated features can no longer be ignored.

    Indirect usage will now also warn, but only once every 8 hours since indirect users can’t fix the problem themselves. The warning message automatically picks up the package URL where the usage was detected so that you can easily report the deprecation to the relevant maintainers.

lifecycle warnings are set up to helpfully inform you about upcoming changes while being as discreet as possible. All of the features deprecated in tidyselect in this blog post are in the soft-deprecation stage, and will remain this way for at least one year.

Supplying character vectors of column names outside of all_of() and any_of()

To specify a column selection using a character vector of names, you normally use all_of() or any_of().

vars <- c("cyl", "am")
mtcars |> select(all_of(vars)) |> glimpse()
#> Rows: 32
#> Columns: 2
#> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4…
#> $ am  <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…

all_of() is adamant that it must select all of the requested columns:

mtcars |> select(all_of(letters))
#> Error in `select()`:
#>  In argument: `all_of(letters)`.
#> Caused by error in `all_of()`:
#> ! Can't subset elements that don't exist.
#>  Elements `a`, `b`, `c`, `d`, `e`, etc. don't exist.

any_of() is more lenient and ignores any names that are not present in the data frame. In this case, it ends up selecting nothing:

mtcars |> select(any_of(letters))
#> data frame with 0 columns and 32 rows

Another feature of all_of() and any_of() is that they remove all ambiguity between variables in your environment like vars or letters (env-variables) and variables inside the data frame like cyl or am (data-variables). Let’s add vars in the data frame to see what happens:

my_data <- mtcars |> mutate(vars = 1:n())
my_data |> select(all_of(vars)) |> glimpse()
#> Rows: 32
#> Columns: 2
#> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4…
#> $ am  <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…

Because vars was supplied to all_of(), select() will never confuse it with mtcars$vars. In technical terms, there is no data-masking within selection helpers like all_of(), any_of(), or even starts_with(). It is safe to supply env-variables to these functions without worrying about data-masking ambiguity.

This is not the case however if you supply a character vector outside of all_of():

my_data |> select(vars) |> glimpse()
#> Rows: 32
#> Columns: 1
#> $ vars <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,…

This is why we have decided to deprecate direct supply of character vectors in favour of using all_of() and any_of(). You will now get a soft-deprecation warning recommending to use all_of():

mtcars |> select(vars) |> glimpse()
#> Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
#>  Please use `all_of()` or `any_of()` instead.
#>   # Was:
#>   data %>% select(vars)
#> 
#>   # Now:
#>   data %>% select(all_of(vars))
#> 
#> See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.#> Rows: 32
#> Columns: 2
#> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4…
#> $ am  <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…

Using .data inside selections

The .data pronoun is a convenient way of programming with data-masking functions like mutate() and filter(). It has two main functions:

  1. Retrieve a data frame column from a name stored in a variable with [[.

    var <- "am"
    mtcars |> transmute(am = .data[[var]] * 10) |> glimpse()
    #> Rows: 32
    #> Columns: 1
    #> $ am <dbl> 10, 10, 10, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 10…
  2. For package developers, .data is helpful to silence R CMD check notes about unknown variables. When the static analysis checker of R encounters an expression like mtcars |> mutate(am * 2), it has no way of knowing that am is a data frame variable. Since it doesn’t see any variable am in your environment, it emits a warning about a potential typo in the code.

    The .data$col pattern is used to work around this issue: mtcars |> mutate(.data$am * 2) doesn’t produce any warnings.

Whereas .data is very useful in data-masking functions, its usage in selections is much more limited. As we have seen in the previous section, retrieving a variable from character vector should be done with all_of():

var <- "am"
mtcars |> select(all_of(var)) |> glimpse()
#> Rows: 32
#> Columns: 1
#> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,…

And to avoid the R CMD check note about unknown variables, it is much cleaner to wrap the column name in quotes:

mtcars |> select("am") |> glimpse()
#> Rows: 32
#> Columns: 1
#> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,…

Allowing the .data pronoun in selection contexts also makes the distinction between tidy-selections and data-masking blurrier. And so we have decided to deprecate it in selections:

var <- "am"
mtcars |> select(.data[[var]]) |> invisible()
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#>  Please use `all_of(var)` (or `any_of(var)`) instead of `.data[[var]]`
mtcars |> select(.data$am) |> invisible()
#> Warning: Use of .data in tidyselect expressions was deprecated in tidyselect 1.2.0.
#>  Please use `"am"` instead of `.data$am`

This deprecation does not affect the use of .data in data-masking contexts.

Acknowledgements

Many thanks to all contributors (issues and PRs) to this release!

@alexpghayes, @angela-li, @apreshill, @arneschillert, @batpigandme, @behrman, @bensoltoff, @braceandbracket, @brshallo, @bwalsh5, @carneybill, @ChrisDunleavy, @ColinFay, @courtiol, @csgillespie, @DavisVaughan, @dgrtwo, @DivadNojnarg, @dpprdan, @dpseidel, @drmowinckels, @dylan-cooper, @EconomiCurtis, @edgararuiz-zz, @EdwinTh, @elben10, @EmilHvitfeldt, @espinielli, @fenguoerbian, @gaborcsardi, @giocomai, @gregrs-uk, @gregswinehart, @gvelasq, @hadley, @hfrick, @hplieninger, @ismayc, @jameslairdsmith, @jayhesselberth, @jemus42, @jennybc, @jimhester, @juliasilge, @justmytwospence, @karawoo, @krlmlr, @leafyoung, @lionel-, @lorenzwalthert, @LucyMcGowan, @maelle, @markdly, @martin-ueding, @maurolepore, @MichaelChirico, @mikemahoney218, @mine-cetinkaya-rundel, @mitchelloharawild, @pkq, @PursuitOfDataScience, @rgerecke, @richierocks, @Robinlovelace, @robinsones, @romainfrancois, @rosseji, @rudeboybert, @saghirb, @sbearrows, @sharlagelfand, @simonpcouch, @stedy, @stephlocke, @stragu, @sysilviakim, @thisisdaryn, @thomasp85, @thuettel, @tmstauss, @topepo, @tracykteal, @tyluRp, @vspinu, @warint, @wibeasley, @yitao-li, and @yutannihilation.


  1. If you are wondering whether a particular argument supports selections, look in the function documentation. Arguments tagged with <tidy-select> implement the selection dialect. By contrast, arguments tagged with <data-masking> only allow to refer to data frame columns directly. ↩︎