In tidyr 0.7.0, we introduced a stronger distinction between data expressions and context expressions for selection verbs like gather()
. However that change caused a lot of trouble and confusion and we have updated tidyselect
(the backend for selection functions) to revert that behaviour. In that article, we provide a few comments on these changes as well as some notes on good practices for writing functions with tidyverse tools. Finally we introduce two new selection features that help write safer code: improved support for strings and character vectors and a new selection helper last_col()
.
You can install the new version of tidyselect from CRAN:
install.packages("tidyselect")
Updated selection rules
Since tidyr 0.7.0, selecting functions like gather()
use the tidyselect package as backend. Tidyselect was extracted from dplyr and provides the mechanism for helpers like starts_with()
or everything()
. However, tidyselect had one big change compared to dplyr: data expressions could no longer refer to contextual variables. A data expression is defined as either a bare symbol like var
, or a call like var1:var2
or c(var1, var2)
. Any other expressions is a context expression. The semantic change meant that it was no longer legal to refer to a contextual variable in a data expression:
var <- 5
mtcars %>% gather("key", "value", 1:var)
We thought this was a relatively uncommon occurrence in practice. However that broke a lot of code that had this form:
df %>% gather("key", "value", 1:ncol(df))
Although the change was well-intentioned, it proved to be too disruptive and we have reverted it.
Note that we still maintain a distinction between data and context expressions. The notion of context expression (anything that’s not a symbol or a call to :
or c()
) was introduced in dplyr 0.7.0. Since that version, context expressions cannot refer to the data. This makes it safer to refer to other objects.
Safety versus convenience
Most datawise functions in R obey the same set rules regarding the visibility of objects. Visibility is hierarchical: data frame objects override those found in the context. This is convenient for interactive use and scripts, but can also cause issues. In the following example, should gather()
select the first three columns (using the x
defined in the global environment), or should it select the first two columns (using the column named x
)?
x <- 3
df <- tibble(w = 1, x = 2, y = 3)
df %>% gather("key", "value", 1:x)
#> # A tibble: 2 x 3
#> y key value
#> <dbl> <chr> <dbl>
#> 1 3 w 1
#> 2 3 x 2
The answer is that it selects the first two variables because x
is first found in the data.
In practice, the hierarchical ambiguity is not a big problem when you use these tools interactively or in scripts. However it becomes worrying when you’re writing reusable functions because you don’t know in advance the variables in the input data frame. For those cases where ambiguity matters, the tidy eval feature of quasiquotation allows you to explicitly pick a variable from the context:
df %>% gather("key", "value", 1:(!! x))
#> # A tibble: 3 x 2
#> key value
#> <chr> <dbl>
#> 1 w 1
#> 2 x 2
#> 3 y 3
The special semantics of selection functions
Selection functions have always been a bit special in the tidyverse. They don’t behave exactly like regular quoting functions. In almost all quoting functions in R, variables represent a data frame column. That’s why expressions like this are natural:
lm(mass ~ scale(height), data = starwars)
Since variables represent actual columns, you can include them in expressions as if they were actual objects. For instance scale(height)
is equivalent to scale(starwars$height)
. The same is true for most tidyverse tools, e.g. dplyr::mutate()
:
starwars %>% mutate(height = scale(height))
However in selection functions, variables do not represent columns but column positions. That is a subtle but important distinction. When you type height
, tidyselect actually sees the integer 2
. This makes sense for several reasons:
Expressions such as
name:mass
evaluate to1:3
in a natural way.You can select the same column multiple times. For example if you supply a selection for the dataset
starwars
withstarts_with("s")
andends_with("s")
, the variablesspecies
andstarships
would be matched twice. It is easy for tidyselect to take the intersection of the two sets of column positions. If the sets contained vectors instead, it could not determine whether there were two different but identical vectors rather than the same vector selected twice.Finally and most importantly, if the variables evaluated to the column vectors, we would have no information about their names or positions, which we need to reconstitute the data frame.
Since the variables represent positions, expressions such as select(sqrt(hair_color):mass^2)
are valid but won’t do what you might think at first. In the selection context, that expression translates to 2:9
because hair_color
and mass
are the fourth and third column of the data frame.
Safety in selection functions
Given the special semantics of selection functions, we had more freedom to solve the hierarchical ambiguity of quoting functions. Indeed, apart from :
, c()
or -
, there rarely is any need for referring to column positions in helpers like starts_with()
or contains()
. For this reason, dplyr 0.7.0 introduced the notion of context expressions. Data frame columns are no longer in scope in these calls in order to solve the hierarchical ambiguity. This has the downside that context expressions in selection functions behave a bit differently from the rest of the tidyverse, but we gain safety in exchange.
Given these special semantics, it seemed to make sense to give data expressions the opposite behaviour and only allow references to the data. This would solve the ambiguity in the opposite direction. As explained above, this broke too much code. We had to change it back and the issue of hierarchical ambiguity along with it.
Luckily tidyselect 0.2.0 also introduces a few features that help writing safer code for data expressions. First, the support for strings and character vectors has been improved. All data expressions fully support strings. It is now valid to supply strings to -
and :
:
starwars %>% gather("key", "value", "name" : "films")
#> # A tibble: 957 x 4
#> vehicles starships key value
#> <list> <list> <chr> <list>
#> 1 <chr [2]> <chr [2]> name <chr [1]>
#> 2 <chr [0]> <chr [0]> name <chr [1]>
#> 3 <chr [0]> <chr [0]> name <chr [1]>
#> 4 <chr [0]> <chr [1]> name <chr [1]>
#> 5 <chr [1]> <chr [0]> name <chr [1]>
#> 6 <chr [0]> <chr [0]> name <chr [1]>
#> 7 <chr [0]> <chr [0]> name <chr [1]>
#> 8 <chr [0]> <chr [0]> name <chr [1]>
#> 9 <chr [0]> <chr [1]> name <chr [1]>
#> 10 <chr [1]> <chr [5]> name <chr [1]>
#> # ... with 947 more rows
starwars %>% gather("key", "value", -"height")
#> # A tibble: 1,044 x 3
#> height key value
#> <int> <chr> <list>
#> 1 172 name <chr [1]>
#> 2 167 name <chr [1]>
#> 3 96 name <chr [1]>
#> 4 202 name <chr [1]>
#> 5 150 name <chr [1]>
#> 6 178 name <chr [1]>
#> 7 165 name <chr [1]>
#> 8 97 name <chr [1]>
#> 9 183 name <chr [1]>
#> 10 182 name <chr [1]>
#> # ... with 1,034 more rows
Note that this only applies to c()
, -
or :
because it would not make sense to write seq("name", "mass")
. Also, it only makes sense to support strings in this way because of the special nature of selection functions. This wouldn’t work with mutate()
or lm()
since they wouldn’t be able to differentiate between a column name or an actual column (by recycling the string to column length if the data frame has more than one row).
The purpose of supporting strings in selection function is to make it easier to unquote column names. Excluding columns with quasiquotation is now as simple as this:
x <- "height"
starwars %>% gather("key", "value", -(!! x))
#> # A tibble: 1,044 x 3
#> height key value
#> <int> <chr> <list>
#> 1 172 name <chr [1]>
#> 2 167 name <chr [1]>
#> 3 96 name <chr [1]>
#> 4 202 name <chr [1]>
#> 5 150 name <chr [1]>
#> 6 178 name <chr [1]>
#> 7 165 name <chr [1]>
#> 8 97 name <chr [1]>
#> 9 183 name <chr [1]>
#> 10 182 name <chr [1]>
#> # ... with 1,034 more rows
The second feature introduced in tidyselect 0.2.0 is the last_col()
helper. We noticed in bug reports that many people use variants of:
x <- starwars
x %>% gather("key", "value", 3 : ncol(x))
#> # A tibble: 957 x 4
#> name height key value
#> <chr> <int> <chr> <list>
#> 1 Luke Skywalker 172 mass <dbl [1]>
#> 2 C-3PO 167 mass <dbl [1]>
#> 3 R2-D2 96 mass <dbl [1]>
#> 4 Darth Vader 202 mass <dbl [1]>
#> 5 Leia Organa 150 mass <dbl [1]>
#> 6 Owen Lars 178 mass <dbl [1]>
#> 7 Beru Whitesun lars 165 mass <dbl [1]>
#> 8 R5-D4 97 mass <dbl [1]>
#> 9 Biggs Darklighter 183 mass <dbl [1]>
#> 10 Obi-Wan Kenobi 182 mass <dbl [1]>
#> # ... with 947 more rows
That is potentially unsafe in functions since the data frame might contain a column named x
. You can now use last_col()
instead:
# Importing last_col() because it's not exported in dplyr yet
last_col <- tidyselect::last_col
x %>% gather("key", "value", 3 : last_col())
#> # A tibble: 957 x 4
#> name height key value
#> <chr> <int> <chr> <list>
#> 1 Luke Skywalker 172 mass <dbl [1]>
#> 2 C-3PO 167 mass <dbl [1]>
#> 3 R2-D2 96 mass <dbl [1]>
#> 4 Darth Vader 202 mass <dbl [1]>
#> 5 Leia Organa 150 mass <dbl [1]>
#> 6 Owen Lars 178 mass <dbl [1]>
#> 7 Beru Whitesun lars 165 mass <dbl [1]>
#> 8 R5-D4 97 mass <dbl [1]>
#> 9 Biggs Darklighter 183 mass <dbl [1]>
#> 10 Obi-Wan Kenobi 182 mass <dbl [1]>
#> # ... with 947 more rows