dplyr 0.7.5

Photo by Phúc Long

We’re excited to announce version 0.7.5 of the dplyr package, the grammar of data manipulation in the tidyverse. This minor release includes the move to tidyselect, features like scoped operations on grouped data frames and support for raw vectors, and a number of bug fixes. Please see the release notes for the full list of improvements and bug fixes.

The next planned release of dplyr, for which work has already started, will be a feature release. Many of the features are available in the development version.

tidyselect

dplyr always supported selecting by name, excluding columns, selecting by range, by match, or by position:

tbl <- data.frame(a1 = 1, a2 = 2, a3 = 3, b = "x")

tbl %>%
  select(a1, a2)
#>   a1 a2
#> 1  1  2
tbl %>%
  select(-b)
#>   a1 a2 a3
#> 1  1  2  3
tbl %>%
  select(a1:a3)
#>   a1 a2 a3
#> 1  1  2  3
tbl %>%
  select(starts_with("a"))
#>   a1 a2 a3
#> 1  1  2  3
tbl %>%
  select(2:4)
#>   a2 a3 b
#> 1  2  3 x
vars <- syms(c("a2", "b"))
tbl %>%
  select(!!!vars)
#>   a2 b
#> 1  2 x

Last year, the core code that provides this functionality was moved out of dplyr into the fairly new tidyselect package. Selecting columns in a data frame (or items in a character vector, for that matter) is a common task in many other situations. The tidyselect package offers a consistent and convenient interface with full support for quasiquotation, and is used by more than 20 packages, and now also by dplyr. Internally, the select() calls above are translated into the following tidyselect operations:

tbl_names <- names(tbl)

tbl_names %>%
  tidyselect::vars_select(a1, a2)
#>   a1   a2 
#> "a1" "a2"
tbl_names %>%
  tidyselect::vars_select(-b)
#>   a1   a2   a3 
#> "a1" "a2" "a3"
tbl_names %>%
  tidyselect::vars_select(a1:a3)
#>   a1   a2   a3 
#> "a1" "a2" "a3"
tbl_names %>%
  tidyselect::vars_select(starts_with("a"))
#>   a1   a2   a3 
#> "a1" "a2" "a3"
tbl_names %>%
  tidyselect::vars_select(2:4)
#>   a2   a3    b 
#> "a2" "a3"  "b"
vars <- syms(c("a2", "b"))
tbl_names %>%
  tidyselect::vars_select(!!!vars)
#>   a2    b 
#> "a2"  "b"

The net effect of this change is improved consistency across the tidyverse and the other packages that use tidyselect. The user interface is affected in two ways:

  • The select_vars(), select_var() and rename_vars() functions are soft-deprecated and will start issuing warnings in a future version. Instead, use tidyselect::vars_select(), tidyselect::vars_pull() and tidyselect::vars_rename(), respectively.

  • select() and rename() fully support character vectors. You can now unquote variables like this:

    vars <- c("a2", "b")
    select(tbl, !!vars)
    #>   a2 b
    #> 1  2 x
    
    select(tbl, -(!!vars))
    #>   a1 a3
    #> 1  1  3
    

Scoped verbs for grouped data

Scoped verbs are useful when you want to apply the same operation on multiple columns. These functions end with _all (affect all columns), _at (affect selected columns), or _if (affect columns that satisfy a predicate), and replaced the older _each family of functions in dplyr 0.7.0. In the most recent version of dplyr, these functions have been extended to work on grouped data frames. Because the group columns need to stay unchanged, these operations work slightly differently on grouped data frames:

  • In select_*(), the group columns are always kept.

    grouped_iris <-
      iris %>%
      group_by(Species) %>%
      slice(1:2)
    
    grouped_iris %>%
      select_if(funs(is.numeric))
    #> # A tibble: 6 x 5
    #> # Groups:   Species [3]
    #>   Species    Sepal.Length Sepal.Width Petal.Length Petal.Width
    #>   <fct>             <dbl>       <dbl>        <dbl>       <dbl>
    #> 1 setosa              5.1         3.5          1.4         0.2
    #> 2 setosa              4.9         3            1.4         0.2
    #> 3 versicolor          7           3.2          4.7         1.4
    #> 4 versicolor          6.4         3.2          4.5         1.5
    #> 5 virginica           6.3         3.3          6           2.5
    #> 6 virginica           5.8         2.7          5.1         1.9
    
  • In mutate_*() and transmute_*(), group columns are never altered:

    grouped_iris %>%
      mutate_all(funs(. + 1))
    #> # A tibble: 6 x 5
    #> # Groups:   Species [3]
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
    #>          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
    #> 1          6.1         4.5          2.4         1.2 setosa    
    #> 2          5.9         4            2.4         1.2 setosa    
    #> 3          8           4.2          5.7         2.4 versicolor
    #> 4          7.4         4.2          5.5         2.5 versicolor
    #> 5          7.3         4.3          7           3.5 virginica 
    #> 6          6.8         3.7          6.1         2.9 virginica
    
  • filter_...() currently includes group columns:

    grouped_iris %>%
      filter_if(funs(is.numeric), all_vars(. > 1))
    #> # A tibble: 4 x 5
    #> # Groups:   Species [2]
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
    #>          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
    #> 1          7           3.2          4.7         1.4 versicolor
    #> 2          6.4         3.2          4.5         1.5 versicolor
    #> 3          6.3         3.3          6           2.5 virginica 
    #> 4          5.8         2.7          5.1         1.9 virginica
    
  • arrange_...() ignores group columns:

    grouped_iris %>%
      arrange_all()
    #> # A tibble: 6 x 5
    #> # Groups:   Species [3]
    #>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species   
    #>          <dbl>       <dbl>        <dbl>       <dbl> <fct>     
    #> 1          4.9         3            1.4         0.2 setosa    
    #> 2          5.1         3.5          1.4         0.2 setosa    
    #> 3          5.8         2.7          5.1         1.9 virginica 
    #> 4          6.3         3.3          6           2.5 virginica 
    #> 5          6.4         3.2          4.5         1.5 versicolor
    #> 6          7           3.2          4.7         1.4 versicolor
    

Raw vectors

The raw data type is a first-class citizen in R’s type system, but has been somewhat neglected in dplyr so far. In this version of dplyr you can compute on that data type:

raw_tbl <- tibble(a = 1:3, b = as.raw(1:3))
raw_tbl %>%
  filter(b < 2)
#> # A tibble: 1 x 2
#>       a b    
#>   <int> <raw>
#> 1     1 01
raw_tbl %>%
  arrange(desc(b))
#> # A tibble: 3 x 2
#>       a b    
#>   <int> <raw>
#> 1     3 03   
#> 2     2 02   
#> 3     1 01
all_equal(raw_tbl, slice(raw_tbl, 3:1))
#> [1] TRUE
left_join(slice(raw_tbl, 1:2), slice(raw_tbl, 2:3), by = "b")
#> # A tibble: 2 x 3
#>     a.x b       a.y
#>   <int> <raw> <int>
#> 1     1 01       NA
#> 2     2 02        2

Be aware that raw doesn’t know an NA value, the result of a join may be surprising.

left_join(slice(raw_tbl, 1:2), slice(raw_tbl, 2:3), by = "a")
#> # A tibble: 2 x 3
#>       a b.x   b.y  
#>   <int> <raw> <raw>
#> 1     1 01    00   
#> 2     2 02    02

Welcome back Romain

Romain François, the author of the data frame backend for dplyr, has joined the team and hit the ground running. He has implemented many of the features described in this blog post, and is now focused on features of the next release.

The next release involves substantial refactoring of the internals to make hybrid evaluation simpler and less surprising, a new implementation of grouping that better respects levels of factors, and redesign of the grouping metadata to replace the current collection of attributes by a single tidy tibble. This is ambitious work, it is great to have Romain on board to tackle it.

Welcome (back), Romain, looking forward to a great time!

Acknowledgments

Thanks to all contributors to dplyr, your feedback helps make this package better and easier to use: @2533245542, @aammd, @ablack3, @adder, @AHoerner, @AjarKeen, @ajay-d, @alexfun, @alexhallam, @alexiglaser, @AljazJ, @amjiuzi, @andreaspano, @AndreMikulec, @andresimi, @andrewjpfeiffer, @anescalc, @AngryR11, @apreshill, @aswan89, @Athospd, @aurelberra, @austensen, @baileych, @batpigandme, @behrman, @benmarwick, @bensoltoff, @bheavner, @bigmw, @billdenney, @bilydr, @BishtDinesh, @bjornerstedt, @bkkkk, @bobokdalibor, @brendanf, @brianstamper, @briglass, @brooke-watson, @capelastegui, @cderv, @CerebralMastication, @ChadEfaw, @ChiWPak, @chrnin, @chunjiw, @cipherz, @cjyetman, @ckarras, @cmhh, @cnjr2, @colearendt, @ColinFay, @coloneltriq, @congdanh8391, @coolbutuseless, @copernican, @courtiol, @cperk, @cturbelin, @cuttlefish44, @daattali, @dadwalrajiv, @dan87134, @danielcanueto, @danielmcauley, @danielsjf, @danishahmadamu, @dantonnoriega, @darrkj, @DasHammett, @DataStrategist, @DataWookie, @davharris, @davidkane9, @DavisVaughan, @deeenes, @deymos314, @dgromer, @dhicks, @djbirke, @dkincaid, @donaldmusgrove, @dpeterson71, @dpolychr, @dpprdan, @drf5n, @dustindall, @eamoncaddigan, @earthcli, @echasnovski, @econandrew, @EconomiCurtis, @edgararuiz, @eduardgrebe, @edublancas, @EdwardJRoss, @edwindj, @EdwinTh, @edzer, @elben10, @EmilRehnberg, @emilyriederer, @enesn, @erikerhardt, @etiennebr, @evanbiederstedt, @filipefilardi, @flying-sheep, @fmichonneau, @fnamugera, @foo-bar-baz-qux, @foundinblank, @fpmcardoso, @Fredo-XVII, @gadenbuie, @ganong123, @garrettgman, @GeorgeRJacobs, @ggrothendieck, @ghaarsma, @gireeshkbogu, @greg-botwin, @gtumuluri, @GuillaumePressiat, @halpo, @hameddashti, @hannesmuehleisen, @happyfishyqy, @happyshows, @harryzyming, @hdelrio, @heavywatal, @Henrik-P, @homerhanumat, @Hong-Revo, @HuangRicky, @huftis, @hughjonesd, @iangow, @ijlyttle, @ilyaminati, @iron0012, @itcarroll, @jabranham, @Jafet, @jakefrost, @jalsalam, @jamesthurgood34, @jarauh, @jarekj71, @jarodmeng, @JasonAizkalns, @jasperDD, @javierluraschi, @jbao, @jcfisher, @jcheng5, @jennybc, @jerryfuyu0104, @jerryzhujian9, @jessekps, @jfcharney, @jgellar, @jhofman, @jianboli, @jimvine, @jjacks12, @jjchern, @JLYJabc, @jnolis, @joelgombin, @JohnMount, @jonocarroll, @josnarog, @jrosen48, @jrubinstein, @jschelbert, @jtelleriar, @jthurner, @jtrecenti, @juliangkr, @jwhendy, @jwnorman, @karldw, @KasperSkytte, @kdaily, @kerry-ja, @ketansahils, @kevinykuo, @kforner, @klmr, @knbknb, @knokknok, @Koantig, @komalsrathi, @konny0201, @kravhowe, @kylebarron, @kylelundstedt, @larmarange, @lawremi, @lbakerIsazi, @lepennec, @leungi, @lgautier, @lhunsicker, @lindesaysh, @lionel-, @lmullen, @Lopa2016, @lorenzwalthert, @lukeholman, @MargaretJones, @markvanderloo, @Marlein, @mathematiguy, @matsuobasho, @mattbaggott, @matthieugomez, @MatthieuStigler, @md0u80c9, @mdancho84, @mdlincoln, @metanoid, @mgirlich, @michaellevy, @mienkoja, @mikldk, @mine-cetinkaya-rundel, @mkirzon, @mkwiecinski, @mlell, @moodymudskipper, @mr-majkel, @mredaelli, @mrkowalski, @msberends, @msgoussi, @mtmorgan, @mungojam, @mwillumz, @my-katie, @MZLABS, @nachocab, @nc6, @neelrakholia, @Nick-Rivera, @nickbond, @nilescbn, @OssiLehtinen, @otoomet, @otsaw, @pachamaltese, @paulponcet, @petehobo, @PeterBolo, @pgensler, @phirsch, @piccolbo, @pierucci, @potterzot, @profdave, @Prometheus77, @pssguy, @QuentinRoy, @ramongallego, @rappster, @rasmusrhl, @rebeccaferrell, @renlund, @rgknight, @RickPack, @ringprince, @rkrug, @rtaph, @rundel, @russellpierce, @s-geissler, @saberbouabid, @salim-b, @sammcq, @sandan, @saurfang, @SeabassWells, @sfirke, @shizidushu, @shntnu, @sibojan, @Sidt1, @simonthelwall, @skranz, @sollano, @spedygiorgio, @srlivingstone, @stephlocke, @steromano, @stevenfazzio, @strengejacke, @stufield, @SulevR, @sz-cgt, @t-kalinowski, @takahisah, @thomascwells, @thomasp85, @timothyslau, @tobiasgerstenberg, @topepo, @tslumley, @tvedebrink, @twolodzko, @tzoltak, @VikrantDogra, @VincentGuyader, @vitallish, @vjcitn, @vnijs, @vpanfilov, @vspinu, @washcycle, @WaterworthD, @wch, @wenbostar, @wodsworth, @xuefliang, @youcc, @yutannihilation, @zeehio, @zenggyu, @zhangchuck, and @zx8754

Contents
Upcoming events
Washington, DC
Aug 23-24
This two-day course will provide an overview of using R for supervised learning. The session will step through the process of building, visualizing, testing, and comparing models that are focused on prediction.
Seattle, WA
Oct 4-5
This is a two-day hands-on workshop designed for experienced R and RStudio users who want to learn holistic workflows that address the most common sources of friction in data analysis.