dplyr 0.8.0

We’re tickled pink to announce the release of version 0.8.0 of dplyr, the grammar of data manipulation in the tidyverse. This is a major update that has kept us busy for almost a year. We take the coincidence of a Valentine’s day release as a sign of continuous ❤️ for dplyr's approach to tidy data manipulation.

Important changes are discussed in detail in the pre-release post, we are grateful to members of the community for their feedback in the last couple of months, this has been tremendously useful in making the release process smoother.

The bulk of the changes are internal, and part of an ongoing effort to make the codebase more robust and less surprising. This is an investment that will continue to pay off for years, and serve as a foundation for more innovations in the future.

For a comprehensive list of changes, please see the NEWS for the 0.8.0 release, the sections below discusses the main changes.

Group hug

Grouping has always been at the center of what dplyr is about, this release expands on the existing group_by() with a set of experimental functions with a variety of perspectives on the notion of grouping.

We believe they offer new unique possibilities, but we welcome community feedback and use cases before we put a 💍 on them. Let’s illustrate them with a subset from the well-known gapminder data.

oceania <- gapminder::gapminder %>% 
  filter(continent == "Oceania") %>% 
  mutate(yr1952 = year - 1952) %>% 
  select(-continent) %>% 
  group_by(country)
oceania
#> # A tibble: 24 x 6
#> # Groups:   country [2]
#>    country    year lifeExp      pop gdpPercap yr1952
#>    <fct>     <int>   <dbl>    <int>     <dbl>  <dbl>
#>  1 Australia  1952    69.1  8691212    10040.      0
#>  2 Australia  1957    70.3  9712569    10950.      5
#>  3 Australia  1962    70.9 10794968    12217.     10
#>  4 Australia  1967    71.1 11872264    14526.     15
#>  5 Australia  1972    71.9 13177000    16789.     20
#>  6 Australia  1977    73.5 14074100    18334.     25
#>  7 Australia  1982    74.7 15184200    19477.     30
#>  8 Australia  1987    76.3 16257249    21889.     35
#>  9 Australia  1992    77.6 17481977    23425.     40
#> 10 Australia  1997    78.8 18565243    26998.     45
#> # … with 14 more rows

group_nest() is similar to tidyr::nest(), but focuses on the variables to nest by instead of the nested columns.

oceania %>% 
  group_nest()
#> # A tibble: 2 x 2
#>   country     data             
#>   <fct>       <list>           
#> 1 Australia   <tibble [12 × 5]>
#> 2 New Zealand <tibble [12 × 5]>

group_split() is a tidy version of base::split(). In particular, it respects a group_by()-like grouping specification, and refuses to name its result.

oceania %>% 
  group_split()
#> [[1]]
#> # A tibble: 12 x 6
#>    country    year lifeExp      pop gdpPercap yr1952
#>    <fct>     <int>   <dbl>    <int>     <dbl>  <dbl>
#>  1 Australia  1952    69.1  8691212    10040.      0
#>  2 Australia  1957    70.3  9712569    10950.      5
#>  3 Australia  1962    70.9 10794968    12217.     10
#>  4 Australia  1967    71.1 11872264    14526.     15
#>  5 Australia  1972    71.9 13177000    16789.     20
#>  6 Australia  1977    73.5 14074100    18334.     25
#>  7 Australia  1982    74.7 15184200    19477.     30
#>  8 Australia  1987    76.3 16257249    21889.     35
#>  9 Australia  1992    77.6 17481977    23425.     40
#> 10 Australia  1997    78.8 18565243    26998.     45
#> 11 Australia  2002    80.4 19546792    30688.     50
#> 12 Australia  2007    81.2 20434176    34435.     55
#> 
#> [[2]]
#> # A tibble: 12 x 6
#>    country      year lifeExp     pop gdpPercap yr1952
#>    <fct>       <int>   <dbl>   <int>     <dbl>  <dbl>
#>  1 New Zealand  1952    69.4 1994794    10557.      0
#>  2 New Zealand  1957    70.3 2229407    12247.      5
#>  3 New Zealand  1962    71.2 2488550    13176.     10
#>  4 New Zealand  1967    71.5 2728150    14464.     15
#>  5 New Zealand  1972    71.9 2929100    16046.     20
#>  6 New Zealand  1977    72.2 3164900    16234.     25
#>  7 New Zealand  1982    73.8 3210650    17632.     30
#>  8 New Zealand  1987    74.3 3317166    19007.     35
#>  9 New Zealand  1992    76.3 3437674    18363.     40
#> 10 New Zealand  1997    77.6 3676187    21050.     45
#> 11 New Zealand  2002    79.1 3908037    23190.     50
#> 12 New Zealand  2007    80.2 4115771    25185.     55

group_map() and group_walk() offer a way to iterate on groups of a grouped data frame.

oceania %>% 
  mutate(yr1952 = year - 1952) %>% 
  group_map(~broom::tidy(stats::lm(lifeExp ~ yr1952, data = .x)))
#> # A tibble: 4 x 6
#> # Groups:   country [2]
#>   country     term        estimate std.error statistic  p.value
#>   <fct>       <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 Australia   (Intercept)   68.4      0.337      203.  2.07e-19
#> 2 Australia   yr1952         0.228    0.0104      21.9 8.67e-10
#> 3 New Zealand (Intercept)   68.7      0.437      157.  2.66e-18
#> 4 New Zealand yr1952         0.193    0.0135      14.3 5.41e- 8

group_data(), group_rows(), and group_keys() expose the grouping information, that has been restructured in a tibble.

oceania %>% 
  group_data()
#> # A tibble: 2 x 2
#>   country     .rows     
#>   <fct>       <list>    
#> 1 Australia   <int [12]>
#> 2 New Zealand <int [12]>

oceania %>% 
  group_keys()
#> # A tibble: 2 x 1
#>   country    
#>   <fct>      
#> 1 Australia  
#> 2 New Zealand

oceania %>% 
  group_rows()
#> [[1]]
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12
#> 
#> [[2]]
#>  [1] 13 14 15 16 17 18 19 20 21 22 23 24

group_by() gains a .drop argument which you can set to FALSE to respect empty groups associated with factors (more on this below).

Give factors some love

The internal grouping algorithm has been redesigned to make it possible to better respect factor levels and empty groups. To limit the disruption, we have not made this the default behaviour. To keep empty groups, you have to set group_by()‘s .drop argument to FALSE.

This can make data manipulation more predictable and reliable, because when factors are involved, the groups are based on the levels of the factors, rather than which levels have data points.

Let’s illustrate this with our favourite flowers 💐, and a function, species_count(), that counts the number of each species after a filter(), and structures it as a tibble with one column per species.

species_count <- function(...) {
  iris %>% 
    filter(...) %>% 
    group_by(Species, .drop = FALSE) %>% 
    summarise(n = n()) %>% 
    tidyr::spread(Species, n)
}

Because we use .drop = FALSE we get one column per level of the factor, even when there’s no data associated with a level:

species_count(Petal.Length > 3)
#> # A tibble: 1 x 3
#>   setosa versicolor virginica
#>    <int>      <int>     <int>
#> 1      0         49        50
species_count(Petal.Length > 6.5)
#> # A tibble: 1 x 3
#>   setosa versicolor virginica
#>    <int>      <int>     <int>
#> 1      0          0         4
species_count(Petal.Length > 42)
#> # A tibble: 1 x 3
#>   setosa versicolor virginica
#>    <int>      <int>     <int>
#> 1      0          0         0

These 0 instead of missing columns make the experience easier when you want to combine multiple results:

limits <- seq(0, 8, by = .5)
limits %>% 
  purrr::map_dfr( ~species_count(Petal.Length > .x)) %>% 
  mutate(Sepal.Length = limits) %>% 
  select(Sepal.Length, everything())
#> # A tibble: 17 x 4
#>    Sepal.Length setosa versicolor virginica
#>           <dbl>  <int>      <int>     <int>
#>  1          0       50         50        50
#>  2          0.5     50         50        50
#>  3          1       49         50        50
#>  4          1.5     13         50        50
#>  5          2        0         50        50
#>  6          2.5      0         50        50
#>  7          3        0         49        50
#>  8          3.5      0         45        50
#>  9          4        0         34        50
#> 10          4.5      0         14        49
#> 11          5        0          1        41
#> 12          5.5      0          0        25
#> 13          6        0          0         9
#> 14          6.5      0          0         4
#> 15          7        0          0         0
#> 16          7.5      0          0         0
#> 17          8        0          0         0

Thanks

Thanks to all contributors for this release.

@abouf, @adisarid, @adrfantini, @aetiologicCanada, @afdta, @albertomv83, @alistaire47, @aloes2512, @andresimi, @antaldaniel, @AnthonyEbert, @ArtemSokolov, @AshesITR, @bakaburg1, @batpigandme, @bbachrach, @bbolker, @behrman, @BenjaminLouis, @bifouba, @billdenney, @bnicenboim, @BobMuenchen, @brooke-watson, @CarolineBarret, @cbailiss, @CerebralMastication, @cfhammill, @cfry-propeller, @choisy, @ChrisBeeley, @chrsigg, @clauswilke, @ClaytonJY, @colearendt, @ColinFay, @coolbutuseless, @Copepoda, @cpsievert, @dah33, @damianooldoni, @DanChaltiel, @danyal123, @DavisVaughan, @Demetrio92, @dewoller, @dfalbel, @DiogoFerrari, @dirkschumacher, @dmenne, @dmvianna, @dongzhuoer, @earowang, @echasnovski, @eddelbuettel, @EdwinTh, @eijoac, @elbersb, @Eli-Berkow, @EmilHvitfeldt, @epetrovski, @erblast, @etienne-s, @foundinblank, @FrancoisGuillem, @geotheory, @ggrothendieck, @GoldbergData, @gowerc, @grayskripko, @GrimTrigger88, @grizzthepro64, @hadley, @hafen, @heavywatal, @helix123, @henrikmidtiby, @hpeaker, @htc502, @hughjonesd, @ignacio82, @igoldin2u, @igordot, @ilarischeinin, @Ilia-Kosenkov, @IndrajeetPatil, @ipofanes, @jasonmhoule, @jayhesselberth, @jennybc, @jepusto, @jflynn264, @jialu512, @JiaxiangBU, @jimhester, @jkylearmstrongibx, @jnolis, @JohnMount, @jonkeane, @jonthegeek, @jschelbert, @jsekamane, @jtelleria, @kendonB, @kevinykuo, @krlmlr, @langbe, @ldecicco-USGS, @leungi, @libbieweimer, @lionel-, @liz-is, @lloven, @ltrgoddard, @luccastermans, @maicel1978, @Make42, @MalditoBarbudo, @markdly, @markvanderloo, @mattbk, @maxheld83, @melissakey, @mem48, @mgirlich, @mikmart, @MilesMcBain, @minhsphuc12, @mkoohafkan, @momeara, @moodymudskipper, @move[bot], @nealpsmith, @NightWinkle, @o1iv3r, @PascalKieslich, @petermeissner, @peterzsohar, @philstraforelli, @PMassicotte, @PPICARDO, @privefl, @prokulski, @quartin, @rabutler-usbr, @ramongallego, @randomgambit, @rappster, @rensa, @reshmamena, @richard987, @richierocks, @RickPack, @riship2009, @RobertMyles, @romainfrancois, @rontomer, @roumail, @rozsoma, @rundel, @rupesh2017, @s-fleck, @S-UP, @salmansyed0709, @schloerke, @seasmith, @sharlagelfand, @shizidushu, @simon-anasta, @skaltman, @skylarhopkins, @sowla, @statsccpr, @stenhaug, @streamline55, @stuartE9, @stufield, @suzanbaert, @sverchkov, @thackl, @the-knife, @ThiAmm, @thisisnic, @tinyheero, @tmelconian, @tobadia, @tonyelhabr, @torbjorn, @trueNico, @tungmilan, @TylerGrantSmith, @ukkonen, @vincentanutama, @vnijs, @wanfahmi, @waynelapierre, @wch, @wdenton, @wgrundlingh, @wmayner, @wolski, @yiqinfu, @yutannihilation, @Zanidean, @Zedseayou, @zslajchrt, @zx8754, and @zzygyx9119.