tidyselect 1.2.0 hit CRAN last week and includes a few updates to the syntax of selections in tidyverse functions like
tidyr::pivot_longer(cols = ).
tidyselect is a low-level package that provides the backend for selection contexts in tidyverse functions. A selection context is an argument like
pivot_longer() or a set of arguments like
select() 1. In these special contexts, you can use a domain specific language that helps you create a selection of columns. For example, you can select multiple columns with
c(), a range of columns with
:, and complex matches with selection helpers such as
starts_with(). Under the hood, this selection syntax is interpreted and processed by the tidyselect package.
In this post, we’ll cover the most important lifecycle changes in the selection syntax that tidyverse users (package developers in particular) should know about. You can see a full list of changes in the release notes. We’ll start with a quick recap of what it means in practice for a feature to be deprecated or soft-deprecated.
Note: With this release of tidyselect, some error messages will be suboptimal until dplyr 1.1.0 is released (planned in late October). We recommend waiting until then before updating tidyselect (though it’s not a big deal if you have already updated).
Deprecation of features in tidyverse packages is handled by the lifecycle package. See https://www.tidyverse.org/blog/2021/02/lifecycle-1-0-0/ for an introduction.
The main feature of lifecycle is to distinguish between two stages of deprecation and two usage modes, direct and indirect.
For script users, direct usage is when you use a deprecated feature from the global environment. If the deprecated feature was used inside a package function that you are calling, it is considered indirect usage.
For package developers, the distinction between direct and indirect usages is made by testthat in unit tests. If a function in your package calls the feature, it is considered direct usage. If that’s a function in another package that you are calling, it’s indirect usage.
To sum up, direct usage is when your own code uses the deprecated feature, and indirect usage is when someone else’s code uses it. This distinction matters because it determines how verbose (and thus how annoying) the deprecation warnings are.
For soft-deprecation, indirect usage is always silent because we only want to alert people who are actually able to fix the problem.
Direct usage only generates one warning every 8 hours to avoid being too annoying during this transition period, so that you can continue to work with existing code, ignore the warnings, and update to the new patterns on your own time.
For deprecation, it’s now really time to update the code. Direct usage gives a warning every time so that deprecated features can no longer be ignored.
Indirect usage will now also warn, but only once every 8 hours since indirect users can’t fix the problem themselves. The warning message automatically picks up the package URL where the usage was detected so that you can easily report the deprecation to the relevant maintainers.
lifecycle warnings are set up to helpfully inform you about upcoming changes while being as discreet as possible. All of the features deprecated in tidyselect in this blog post are in the soft-deprecation stage, and will remain this way for at least one year.
all_of() is adamant that it must select all of the requested columns:
any_of() is more lenient and ignores any names that are not present in the data frame. In this case, it ends up selecting nothing:
Another feature of
any_of() is that they remove all ambiguity between variables in your environment like
letters (env-variables) and variables inside the data frame like
am (data-variables). Let’s add
vars in the data frame to see what happens:
vars was supplied to
select() will never confuse it with
mtcars$vars. In technical terms, there is no data-masking within selection helpers like
any_of(), or even
starts_with(). It is safe to supply env-variables to these functions without worrying about data-masking ambiguity.
This is not the case however if you supply a character vector outside of
mtcars |> select(vars) |> glimpse() #> Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0. #> ℹ Please use `all_of()` or `any_of()` instead. #> # Was: #> data %>% select(vars) #> #> # Now: #> data %>% select(all_of(vars)) #> #> See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.#> Rows: 32 #> Columns: 2 #> $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4… #> $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1…
Retrieve a data frame column from a name stored in a variable with
For package developers,
.datais helpful to silence R CMD check notes about unknown variables. When the static analysis checker of R encounters an expression like
mtcars |> mutate(am * 2), it has no way of knowing that
amis a data frame variable. Since it doesn’t see any variable
amin your environment, it emits a warning about a potential typo in the code.
.data$colpattern is used to work around this issue:
mtcars |> mutate(.data$am * 2)doesn’t produce any warnings.
.data is very useful in data-masking functions, its usage in selections is much more limited. As we have seen in the previous section, retrieving a variable from character vector should be done with
And to avoid the R CMD check note about unknown variables, it is much cleaner to wrap the column name in quotes:
.data pronoun in selection contexts also makes the distinction between tidy-selections and data-masking blurrier. And so we have decided to deprecate it in selections:
This deprecation does not affect the use of
.data in data-masking contexts.
Many thanks to all contributors (issues and PRs) to this release!
@alexpghayes, @angela-li, @apreshill, @arneschillert, @batpigandme, @behrman, @bensoltoff, @braceandbracket, @brshallo, @bwalsh5, @carneybill, @ChrisDunleavy, @ColinFay, @courtiol, @csgillespie, @DavisVaughan, @dgrtwo, @DivadNojnarg, @dpprdan, @dpseidel, @drmowinckels, @dylan-cooper, @EconomiCurtis, @edgararuiz-zz, @EdwinTh, @elben10, @EmilHvitfeldt, @espinielli, @fenguoerbian, @gaborcsardi, @giocomai, @gregrs-uk, @gregswinehart, @gvelasq, @hadley, @hfrick, @hplieninger, @ismayc, @jameslairdsmith, @jayhesselberth, @jemus42, @jennybc, @jimhester, @juliasilge, @justmytwospence, @karawoo, @krlmlr, @leafyoung, @lionel-, @lorenzwalthert, @LucyMcGowan, @maelle, @markdly, @martin-ueding, @maurolepore, @MichaelChirico, @mikemahoney218, @mine-cetinkaya-rundel, @mitchelloharawild, @pkq, @PursuitOfDataScience, @rgerecke, @richierocks, @Robinlovelace, @robinsones, @romainfrancois, @rosseji, @rudeboybert, @saghirb, @sbearrows, @sharlagelfand, @simonpcouch, @stedy, @stephlocke, @stragu, @sysilviakim, @thisisdaryn, @thomasp85, @thuettel, @tmstauss, @topepo, @tracykteal, @tyluRp, @vspinu, @warint, @wibeasley, @yitao-li, and @yutannihilation.
If you are wondering whether a particular argument supports selections, look in the function documentation. Arguments tagged with
<tidy-select>implement the selection dialect. By contrast, arguments tagged with
<data-masking>only allow to refer to data frame columns directly. ↩︎