dplyr 1.0.0: last minute additions

Hadley Wickham, Kirill Müller

This post is the latest in a series of post leading up the the dplyr 1.0.0 release on May 15. So far, the series has covered:

Today I wanted to talk about two cool new features that we’ve added since I started blogging about dplyr 1.0.0: summarise() now gives you greater control over how the results are grouped, and a new set of functions make it easier to modify rows.

Getting the dev version

If you’d like to try out anything you see in this blog post, you can install the development version of dplyr with:

devtools::install_github("tidyverse/dplyr")

Note that the development version won’t become 1.0.0 until it’s released, but at this point, it’s very similar to what we’ll be submitting to CRAN on May 15.

library(dplyr, warn.conflicts = FALSE)

summarise() and grouping

There's a common confusion about the result of summarise(). How do you think the result of the following code will be grouped?

homeworld_species <- starwars %>% 
  group_by(homeworld, species) %>% 
  summarise(n = n())

You might be surprised to learn that it’s grouped by homeworld:

head(homeworld_species, 3)
#> # A tibble: 3 x 3
#> # Groups:   homeworld [3]
#>   homeworld   species     n
#>   <chr>       <chr>   <int>
#> 1 Alderaan    Human       3
#> 2 Aleen Minor Aleena      1
#> 3 Bespin      Human       1

That’s because summarise() always peels off the last group, based on the logic that this group now occupies a single row so there’s no point grouping by it. This behaviour made perfect sense to me at the time I implemented it, but it’s been a long standing source of confusion among dplyr users (and it doesn’t make sense if your summary returns multiple rows).

Unfortunately, it would be very difficult to change this default now because a lot of code probably relies on it. Instead, we’re doing the next best thing: exposing the default behaviour more explicitly and making it easier to change. In dplyr 1.0.0, the code above will display a message telling you how the result has been grouped:

homeworld_species <- starwars %>% 
  group_by(homeworld, species) %>% 
  summarise(n = n())
#> `summarise()` regrouping by 'homeworld' (override with `.groups` argument)

The text hints at how to take control of grouping and eliminate the message: a new .groups argument allows you to control the grouping of the result. It currently has four possible values:

  • .groups = "drop_last" drops the last grouping level (i.e. the default behaviour sans message).
  • .groups = "drop" drops all grouping levels and returns a tibble.
  • .groups = "keep" preserves the grouping of the input.
  • .groups = "rowwise" turns each row into its own group.

If you find the default message annoying, you can suppress by setting a global option:

options(dplyr.summarise.inform = FALSE)

.groups is very new, so we’ve marked it as experimental, meaning that it may change in the future. Please let us know what you think of it to help us make a decision about its future.

Row mutation

Thanks to Kirill Müller, dplyr has a new experimental family of row mutation functions inspired by SQL’s UPDATE, INSERT, UPSERT, and DELETE. Like the join functions, they all work with a pair of data frames:

  • rows_update(x, y) updates existing rows in x with values in y.
  • rows_patch(x, y) works like rows_update() but only changes NA values.
  • rows_insert(x, y) adds new rows to x from y.
  • rows_upsert(x, y) updates existing rows in x and adds new rows from y.
  • rows_delete(x, y) deletes rows in x that match rows in y.

The rows_ functions match x and y using keys. A key is one or more variables that uniquely identifies each row. All rows_ functions check that the keys of x and y are valid (i.e. unique) before doing anything.

Let’s see how these work with some toy data:

df <- tibble(a = 1:3, b = letters[c(1:2, NA)], c = 0.5 + 0:2)
df
#> # A tibble: 3 x 3
#>       a b         c
#>   <int> <chr> <dbl>
#> 1     1 a       0.5
#> 2     2 b       1.5
#> 3     3 <NA>    2.5

We can use rows_insert() to add new rows:

new <- tibble(a = c(4, 5), b = c("d", "e"), c = c(3.5, 4.5))
df %>% rows_insert(new)
#> Matching, by = "a"
#> # A tibble: 5 x 3
#>       a b         c
#>   <dbl> <chr> <dbl>
#> 1     1 a       0.5
#> 2     2 b       1.5
#> 3     3 <NA>    2.5
#> 4     4 d       3.5
#> 5     5 e       4.5

Note that rows_insert() will fail if we attempt to insert a row that already exists:

df %>% rows_insert(tibble(a = 3, b = "c"))
#> Matching, by = "a"
#> Error: Attempting to insert duplicate rows.

(The error messages are very minimal right now; if people find these functions useful we’ll invest more effort in useful errors.)

If you want to update existing values, use rows_update(). As you might expect, it’ll error if one of the rows to update doesn’t exist:

df %>% rows_update(tibble(a = 3, b = "c"))
#> Matching, by = "a"
#> # A tibble: 3 x 3
#>       a b         c
#>   <int> <chr> <dbl>
#> 1     1 a       0.5
#> 2     2 b       1.5
#> 3     3 c       2.5

df %>% rows_update(tibble(a = 4, b = "d"))
#> Matching, by = "a"
#> Error: Attempting to update missing rows.

rows_patch() is a variant of rows_update() that will only update values in x that are NA.

df %>% 
  rows_patch(tibble(a = 2:3, b = "B"))
#> Matching, by = "a"
#> # A tibble: 3 x 3
#>       a b         c
#>   <int> <chr> <dbl>
#> 1     1 a       0.5
#> 2     2 b       1.5
#> 3     3 B       2.5

If you want to update or insert, you can use rows_upsert():

df %>% 
  rows_upsert(tibble(a = 3, b = "c")) %>% 
  rows_upsert(tibble(a = 4, b = "d"))
#> Matching, by = "a"
#> Matching, by = "a"
#> # A tibble: 4 x 3
#>       a b         c
#>   <dbl> <chr> <dbl>
#> 1     1 a       0.5
#> 2     2 b       1.5
#> 3     3 c       2.5
#> 4     4 d      NA

These functions are designed particularly with an eye towards mutable backends where you really might want to modify existing datasets in place (e.g. data.tables, databases, and googlesheets). That’s a dangerous operation so you’ll need to explicitly opt-in to modification with in_place = TRUE. For example, the dm package will use these functions to update multiple related tables in the correct order, in memory or on the database. Expect to hear more about this in the future.