Erratum tidyr 0.7.0

Photo by Edu Grande

In tidyr 0.7.0, we introduced a stronger distinction between data expressions and context expressions for selection verbs like gather(). However that change caused a lot of trouble and confusion and we have updated tidyselect (the backend for selection functions) to revert that behaviour. In that article, we provide a few comments on these changes as well as some notes on good practices for writing functions with tidyverse tools. Finally we introduce two new selection features that help write safer code: improved support for strings and character vectors and a new selection helper last_col().

You can install the new version of tidyselect from CRAN:

install.packages("tidyselect")

Updated selection rules

Since tidyr 0.7.0, selecting functions like gather() use the tidyselect package as backend. Tidyselect was extracted from dplyr and provides the mechanism for helpers like starts_with() or everything(). However, tidyselect had one big change compared to dplyr: data expressions could no longer refer to contextual variables. A data expression is defined as either a bare symbol like var, or a call like var1:var2 or c(var1, var2). Any other expressions is a context expression. The semantic change meant that it was no longer legal to refer to a contextual variable in a data expression:

var <- 5
mtcars %>% gather("key", "value", 1:var)

We thought this was a relatively uncommon occurrence in practice. However that broke a lot of code that had this form:

df %>% gather("key", "value", 1:ncol(df))

Although the change was well-intentioned, it proved to be too disruptive and we have reverted it.

Note that we still maintain a distinction between data and context expressions. The notion of context expression (anything that’s not a symbol or a call to : or c()) was introduced in dplyr 0.7.0. Since that version, context expressions cannot refer to the data. This makes it safer to refer to other objects.

Safety versus convenience

Most datawise functions in R obey the same set rules regarding the visibility of objects. Visibility is hierarchical: data frame objects override those found in the context. This is convenient for interactive use and scripts, but can also cause issues. In the following example, should gather() select the first three columns (using the x defined in the global environment), or should it select the first two columns (using the column named x)?

x <- 3
df <- tibble(w = 1, x = 2, y = 3)
df %>% gather("key", "value", 1:x)
#> # A tibble: 2 x 3
#>       y   key value
#>   <dbl> <chr> <dbl>
#> 1     3     w     1
#> 2     3     x     2

The answer is that it selects the first two variables because x is first found in the data.

In practice, the hierarchical ambiguity is not a big problem when you use these tools interactively or in scripts. However it becomes worrying when you’re writing reusable functions because you don’t know in advance the variables in the input data frame. For those cases where ambiguity matters, the tidy eval feature of quasiquotation allows you to explicitly pick a variable from the context:

df %>% gather("key", "value", 1:(!! x))
#> # A tibble: 3 x 2
#>     key value
#>   <chr> <dbl>
#> 1     w     1
#> 2     x     2
#> 3     y     3

The special semantics of selection functions

Selection functions have always been a bit special in the tidyverse. They don’t behave exactly like regular quoting functions. In almost all quoting functions in R, variables represent a data frame column. That’s why expressions like this are natural:

lm(mass ~ scale(height), data = starwars)

Since variables represent actual columns, you can include them in expressions as if they were actual objects. For instance scale(height) is equivalent to scale(starwars$height). The same is true for most tidyverse tools, e.g. dplyr::mutate():

starwars %>% mutate(height = scale(height))

However in selection functions, variables do not represent columns but column positions. That is a subtle but important distinction. When you type height, tidyselect actually sees the integer 2. This makes sense for several reasons:

  • Expressions such as name:mass evaluate to 1:3 in a natural way.

  • You can select the same column multiple times. For example if you supply a selection for the dataset starwars with starts_with("s") and ends_with("s"), the variables species and starships would be matched twice. It is easy for tidyselect to take the intersection of the two sets of column positions. If the sets contained vectors instead, it could not determine whether there were two different but identical vectors rather than the same vector selected twice.

  • Finally and most importantly, if the variables evaluated to the column vectors, we would have no information about their names or positions, which we need to reconstitute the data frame.

Since the variables represent positions, expressions such as select(sqrt(hair_color):mass^2) are valid but won’t do what you might think at first. In the selection context, that expression translates to 2:9 because hair_color and mass are the fourth and third column of the data frame.

Safety in selection functions

Given the special semantics of selection functions, we had more freedom to solve the hierarchical ambiguity of quoting functions. Indeed, apart from :, c() or -, there rarely is any need for referring to column positions in helpers like starts_with() or contains(). For this reason, dplyr 0.7.0 introduced the notion of context expressions. Data frame columns are no longer in scope in these calls in order to solve the hierarchical ambiguity. This has the downside that context expressions in selection functions behave a bit differently from the rest of the tidyverse, but we gain safety in exchange.

Given these special semantics, it seemed to make sense to give data expressions the opposite behaviour and only allow references to the data. This would solve the ambiguity in the opposite direction. As explained above, this broke too much code. We had to change it back and the issue of hierarchical ambiguity along with it.

Luckily tidyselect 0.2.0 also introduces a few features that help writing safer code for data expressions. First, the support for strings and character vectors has been improved. All data expressions fully support strings. It is now valid to supply strings to - and ::

starwars %>% gather("key", "value", "name" : "films")
#> # A tibble: 957 x 4
#>     vehicles starships   key     value
#>       <list>    <list> <chr>    <list>
#>  1 <chr [2]> <chr [2]>  name <chr [1]>
#>  2 <chr [0]> <chr [0]>  name <chr [1]>
#>  3 <chr [0]> <chr [0]>  name <chr [1]>
#>  4 <chr [0]> <chr [1]>  name <chr [1]>
#>  5 <chr [1]> <chr [0]>  name <chr [1]>
#>  6 <chr [0]> <chr [0]>  name <chr [1]>
#>  7 <chr [0]> <chr [0]>  name <chr [1]>
#>  8 <chr [0]> <chr [0]>  name <chr [1]>
#>  9 <chr [0]> <chr [1]>  name <chr [1]>
#> 10 <chr [1]> <chr [5]>  name <chr [1]>
#> # ... with 947 more rows
starwars %>% gather("key", "value", -"height")
#> # A tibble: 1,044 x 3
#>    height   key     value
#>     <int> <chr>    <list>
#>  1    172  name <chr [1]>
#>  2    167  name <chr [1]>
#>  3     96  name <chr [1]>
#>  4    202  name <chr [1]>
#>  5    150  name <chr [1]>
#>  6    178  name <chr [1]>
#>  7    165  name <chr [1]>
#>  8     97  name <chr [1]>
#>  9    183  name <chr [1]>
#> 10    182  name <chr [1]>
#> # ... with 1,034 more rows

Note that this only applies to c(), - or : because it would not make sense to write seq("name", "mass"). Also, it only makes sense to support strings in this way because of the special nature of selection functions. This wouldn’t work with mutate() or lm() since they wouldn’t be able to differentiate between a column name or an actual column (by recycling the string to column length if the data frame has more than one row).

The purpose of supporting strings in selection function is to make it easier to unquote column names. Excluding columns with quasiquotation is now as simple as this:

x <- "height"
starwars %>% gather("key", "value", -(!! x))
#> # A tibble: 1,044 x 3
#>    height   key     value
#>     <int> <chr>    <list>
#>  1    172  name <chr [1]>
#>  2    167  name <chr [1]>
#>  3     96  name <chr [1]>
#>  4    202  name <chr [1]>
#>  5    150  name <chr [1]>
#>  6    178  name <chr [1]>
#>  7    165  name <chr [1]>
#>  8     97  name <chr [1]>
#>  9    183  name <chr [1]>
#> 10    182  name <chr [1]>
#> # ... with 1,034 more rows

The second feature introduced in tidyselect 0.2.0 is the last_col() helper. We noticed in bug reports that many people use variants of:

x <- starwars
x %>% gather("key", "value", 3 : ncol(x))
#> # A tibble: 957 x 4
#>                  name height   key     value
#>                 <chr>  <int> <chr>    <list>
#>  1     Luke Skywalker    172  mass <dbl [1]>
#>  2              C-3PO    167  mass <dbl [1]>
#>  3              R2-D2     96  mass <dbl [1]>
#>  4        Darth Vader    202  mass <dbl [1]>
#>  5        Leia Organa    150  mass <dbl [1]>
#>  6          Owen Lars    178  mass <dbl [1]>
#>  7 Beru Whitesun lars    165  mass <dbl [1]>
#>  8              R5-D4     97  mass <dbl [1]>
#>  9  Biggs Darklighter    183  mass <dbl [1]>
#> 10     Obi-Wan Kenobi    182  mass <dbl [1]>
#> # ... with 947 more rows

That is potentially unsafe in functions since the data frame might contain a column named x. You can now use last_col() instead:

# Importing last_col() because it's not exported in dplyr yet
last_col <- tidyselect::last_col

x %>% gather("key", "value", 3 : last_col())
#> # A tibble: 957 x 4
#>                  name height   key     value
#>                 <chr>  <int> <chr>    <list>
#>  1     Luke Skywalker    172  mass <dbl [1]>
#>  2              C-3PO    167  mass <dbl [1]>
#>  3              R2-D2     96  mass <dbl [1]>
#>  4        Darth Vader    202  mass <dbl [1]>
#>  5        Leia Organa    150  mass <dbl [1]>
#>  6          Owen Lars    178  mass <dbl [1]>
#>  7 Beru Whitesun lars    165  mass <dbl [1]>
#>  8              R5-D4     97  mass <dbl [1]>
#>  9  Biggs Darklighter    183  mass <dbl [1]>
#> 10     Obi-Wan Kenobi    182  mass <dbl [1]>
#> # ... with 947 more rows
Contents
Upcoming events
Washington, DC
Aug 23-24
This two-day course will provide an overview of using R for supervised learning. The session will step through the process of building, visualizing, testing, and comparing models that are focused on prediction.