dplyr backends: multidplyr 0.1.0, dtplyr 1.1.0, dbplyr 2.1.0

  dplyr

  Hadley Wickham

One of my favourite things about dplyr is that it decouples describing the data manipulation you want from its actual computation. This makes it possible to take basically the same dplyr code and execute it in radically different ways by using different backends. This blog post covers a passel of updates to the dplyr backends that we maintain:

  • multidplyr, which spreads computation over multiple cores, is now on CRAN!

  • dtplyr, which translates your dplyr code to the wonderfully fast data.table package, now supports all dplyr 1.0.0 features.

  • dbplyr, which translates your dplyr code to SQL, now also translates many tidyr verbs.

You can install these packages in one fell sweep with:

install.packages(c("multidplyr", "dtplyr", "dbplyr"))

I’ll explain these changes in more detail below. But to get started, we need to load dplyr.

library(dplyr, warn.conflicts = FALSE)

multidplyr 0.1.0

multidplyr creates multiple R processes and spreads your data out across them, providing a simple way to take advantage of multiple cores. To use it, start by creating a cluster of R processes and load dplyr on them:

library(multidplyr)
cluster <- new_cluster(4)
cluster_library(cluster, "dplyr")

Then spread data across those processes using partition():

flight_dest <- nycflights13::flights %>% 
  group_by(dest) %>% 
  partition(cluster)
flight_dest
#> Source: party_df [336,776 x 19]
#> Groups: dest
#> Shards: 4 [81,594--86,548 rows]
#> 
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      544            545        -1     1004           1022
#> 2  2013     1     1      558            600        -2      923            937
#> 3  2013     1     1      559            600        -1      854            902
#> 4  2013     1     1      602            610        -8      812            820
#> 5  2013     1     1      602            605        -3      821            805
#> 6  2013     1     1      611            600        11      945            931
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

The data is now spread across four “shards”, each consisting of around 80,000 rows. Generally, you’ll want to group the data before partitioning which ensures that all observations in one group end up on the same worker.

You can work with this party_df as if it was a data frame, but any work will be spread out across all the processes (which your operating system will usually allocate to different cores).

mean_delay <- flight_dest %>% 
  summarise(delay = mean(arr_delay, na.rm = TRUE), n = n()) %>% 
  filter(n > 25)
mean_delay
#> Source: party_df [96 x 3]
#> Shards: 4 [21--26 rows]
#> 
#>   dest  delay     n
#>   <chr> <dbl> <int>
#> 1 ABQ    4.38   254
#> 2 AUS    6.02  2439
#> 3 BQN    8.25   896
#> 4 BTV    8.95  2589
#> 5 BUF    8.95  4681
#> 6 CLE    9.18  4573
#> # … with 90 more rows

Once you’re done with expensive computation, you can bring the results back to the current session with collect(). Learn more in vignette("multidplyr").

results <- collect(mean_delay)
head(results)
#> # A tibble: 6 x 3
#>   dest  delay     n
#>   <chr> <dbl> <int>
#> 1 ABQ    4.38   254
#> 2 AUS    6.02  2439
#> 3 BQN    8.25   896
#> 4 BTV    8.95  2589
#> 5 BUF    8.95  4681
#> 6 CLE    9.18  4573

multidplyr is a good fit for problems where the bottleneck is complex, non-dplyr computation (e.g. fitting models). There’s some overhead initially partitioning the data and then transferring the commands to each worker, so it’s not a magic bullet, but it is very easy to use.

multidplyr is still quite young, so please try it out and let us know about any problems that you encounter.

dtplyr 1.1.0

dtplyr translates dplyr pipelines into equivalent data.table code. data.table is incredibly fast, so this often yields performance improvements.

To use it, start by creating a lazy_dt() object which records your dplyr actions:

library(dtplyr)
dt <- lazy_dt(mtcars)

cyl_summary <- dt %>% 
  group_by(cyl) %>% 
  summarise(across(disp:wt, mean))

You can see the translation with show_query() or execute the data table code by converting back to a data frame, data table, or tibble:

cyl_summary %>% show_query()
#> `_DT1`[, .(disp = mean(disp), hp = mean(hp), drat = mean(drat), 
#>     wt = mean(wt)), keyby = .(cyl)]

cyl_summary %>% as_tibble()
#> # A tibble: 3 x 5
#>     cyl  disp    hp  drat    wt
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  105.  82.6  4.07  2.29
#> 2     6  183. 122.   3.59  3.12
#> 3     8  353. 209.   3.23  4.00

The big news in this release is dtplyr can now translate all features that arrived in dplyr 1.0.0. This includes:

Thanks to Mark Fairbanks, dtplyr has also gained it’s first translation of a tidyr function: pivot_wider(), which is translated to dcast(). You can expect more tidyr translations in the next release 😄.

I also took this as an opportunity to thoroughly refresh the documentation so that all translated verbs now have their own help pages that briefly describe how the translation works. You can read about the other minor improvements and bug fixes in the release notes.

dbplyr 2.1.0

dbplyr translates dplyr pipelines to their SQL equivalents. If you’re new to using dplyr and SQL together, I highly recommend Irene Steve’s rstudio::global() talk, "The dynamic duo: SQL and R. It discusses why you might want to use dbplyr to generate SQL and why you should still learn SQL.

The biggest change to this release is the addition of many translations for tidyr verbs like pivot_longer(), pivot_wider(), complete(), and replace_na(). These were contributed by Maximilian Girlich, and in recognition of his sustained and substantial contributions to the package, he has been added as a package author.

This release also includes major improvements to the across() translation, including translation of formulas (like dtplyr, across() can’t currently use where(), because I don’t know of a way to figure out the column types without executing the query). There are also a bunch of other minor translation improvements and bug fixes, which you can read about in the release notes.

Acknowledgements

A big thanks to all of the contributors who helped make these releases possible: