dplyr 0.8.2

Romain François

Introduction

We’re delighted to announce the release of dplyr 0.8.2 on CRAN 🍉 !

This is a minor maintenance release in the 0.8.* series, addressing a collection of issues since the 0.8.1 and 0.8.0 versions.

top_n() and top_frac()

top_n() has been around for a long time in dplyr, as a convenient wrapper around filter() and min_rank(), to select top (or bottom) entries in each group of a tibble.

In this release, top_n() is no longer limited to a constant number of entries per group, its n argument is now quoted to be evaluated later in the context of the group.

Here are the top half countries, i.e. n() / 2, in terms of life expectancy in 2007.

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
gapminder::gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  top_n(n() / 2, lifeExp)
#> # A tibble: 70 x 6
#> # Groups:   continent [5]
#>    country   continent  year lifeExp        pop gdpPercap
#>    <fct>     <fct>     <int>   <dbl>      <int>     <dbl>
#>  1 Algeria   Africa     2007    72.3   33333216     6223.
#>  2 Argentina Americas   2007    75.3   40301927    12779.
#>  3 Australia Oceania    2007    81.2   20434176    34435.
#>  4 Austria   Europe     2007    79.8    8199783    36126.
#>  5 Bahrain   Asia       2007    75.6     708573    29796.
#>  6 Belgium   Europe     2007    79.4   10392226    33693.
#>  7 Benin     Africa     2007    56.7    8078314     1441.
#>  8 Canada    Americas   2007    80.7   33390141    36319.
#>  9 Chile     Americas   2007    78.6   16284741    13172.
#> 10 China     Asia       2007    73.0 1318683096     4959.
#> # … with 60 more rows

top_frac() is new convenience shortcut for the top n percent, i.e.

gapminder::gapminder %>% 
  filter(year == 2007) %>% 
  group_by(continent) %>% 
  top_frac(0.5, lifeExp)
#> # A tibble: 70 x 6
#> # Groups:   continent [5]
#>    country   continent  year lifeExp        pop gdpPercap
#>    <fct>     <fct>     <int>   <dbl>      <int>     <dbl>
#>  1 Algeria   Africa     2007    72.3   33333216     6223.
#>  2 Argentina Americas   2007    75.3   40301927    12779.
#>  3 Australia Oceania    2007    81.2   20434176    34435.
#>  4 Austria   Europe     2007    79.8    8199783    36126.
#>  5 Bahrain   Asia       2007    75.6     708573    29796.
#>  6 Belgium   Europe     2007    79.4   10392226    33693.
#>  7 Benin     Africa     2007    56.7    8078314     1441.
#>  8 Canada    Americas   2007    80.7   33390141    36319.
#>  9 Chile     Americas   2007    78.6   16284741    13172.
#> 10 China     Asia       2007    73.0 1318683096     4959.
#> # … with 60 more rows

tbl_vars() and group_cols()

tbl_vars() now returns a dplyr_sel_vars object that keeps track of the grouping variables. This information powers group_cols(), which can now be used in every function that uses tidy selection of columns.

Functions in the tidyverse and beyond may start to use the tbl_vars()/group_cols() duo, starting from tidyr and this pull request

# pak::pkg_install("tidyverse/tidyr#668")

iris %>%
  group_by(Species) %>% 
  tidyr::gather("flower_att", "measurement", -group_cols())
#> # A tibble: 600 x 3
#> # Groups:   Species [3]
#>    Species flower_att   measurement
#>    <fct>   <chr>              <dbl>
#>  1 setosa  Sepal.Length         5.1
#>  2 setosa  Sepal.Length         4.9
#>  3 setosa  Sepal.Length         4.7
#>  4 setosa  Sepal.Length         4.6
#>  5 setosa  Sepal.Length         5  
#>  6 setosa  Sepal.Length         5.4
#>  7 setosa  Sepal.Length         4.6
#>  8 setosa  Sepal.Length         5  
#>  9 setosa  Sepal.Length         4.4
#> 10 setosa  Sepal.Length         4.9
#> # … with 590 more rows

group_split(), group_map(), group_modify()

group_split() always keeps a ptype attribute to track the prototype of the splits.

mtcars %>%
  group_by(cyl) %>%
  filter(hp < 0) %>% 
  group_split()
#> list()
#> attr(,"ptype")
#> # A tibble: 0 x 11
#> # … with 11 variables: mpg <dbl>, cyl <dbl>, disp <dbl>, hp <dbl>,
#> #   drat <dbl>, wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>

group_map() and group_modify() benefit from this in the edge case where there are no groups.

mtcars %>%
  group_by(cyl) %>%
  filter(hp < 0) %>% 
  group_map(~.x)
#> list()
#> attr(,"ptype")
#> # A tibble: 0 x 10
#> # … with 10 variables: mpg <dbl>, disp <dbl>, hp <dbl>, drat <dbl>,
#> #   wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>, carb <dbl>

mtcars %>%
  group_by(cyl) %>%
  filter(hp < 0) %>% 
  group_modify(~.x)
#> # A tibble: 0 x 11
#> # Groups:   cyl [0]
#> # … with 11 variables: cyl <dbl>, mpg <dbl>, disp <dbl>, hp <dbl>,
#> #   drat <dbl>, wt <dbl>, qsec <dbl>, vs <dbl>, am <dbl>, gear <dbl>,
#> #   carb <dbl>

Thanks

Thanks to all contributors for this release.

@abirasathiy, @ajkroeg, @alejandroschuler, @anuj2054, @arider2, @arielfuentes, @artidata, @BenPVD, @bkmontgom, @brodieG, @cderv, @clanker, @clemenshug, @CSheehan1, @danielecook, @dannyparsons, @daskandalis, @davidbaniadam, @DavisVaughan, @deliciouslytyped, @earowang, @fkatharina, @hadley, @Hardervidertsie, @iago-pssjd, @IndrajeetPatil, @jackdolgin, @jangorecki, @jimhester, @jjesusfilho, @jonjhitchcock, @jxu, @krlmlr, @laresbernardo, @lionel-, @LukeGoodsell, @madmark81, @MarkusBerroth, @matheus-donato, @mattfidler, @MatthieuStigler, @md0u80c9, @michaelhogersosis, @MikeJohnPage, @MJL9588, @moodymudskipper, @mwillumz, @Nelson-Gon, @qdread, @randomgambit, @rcorty, @romainfrancois, @romatik, @spressi, @sstoeckl, @stephLH, @urskalbitzer, @vpanfilov, and @ZahraEconomist.

Upcoming events
London
November 18 – 19, 2019
This two-day course will provide an overview of using R for supervised learning. You will learn how to build, visualize, test, and compare prediction-based models.
Miami, FL
December 16 – 18, 2019
Improve your tool building skills with this small hands-on workshop on a boat. All profits go to support the mission of the field school.