One of my favourite things about dplyr is that it decouples describing the data manipulation you want from its actual computation. This makes it possible to take basically the same dplyr code and execute it in radically different ways by using different backends. This blog post covers a passel of updates to the dplyr backends that we maintain:
multidplyr, which spreads computation over multiple cores, is now on CRAN!
dbplyr, which translates your dplyr code to SQL, now also translates many tidyr verbs.
You can install these packages in one fell sweep with:
I’ll explain these changes in more detail below. But to get started, we need to load dplyr.
multidplyr creates multiple R processes and spreads your data out across them, providing a simple way to take advantage of multiple cores. To use it, start by creating a cluster of R processes and load dplyr on them:
Then spread data across those processes using
flight_dest <- nycflights13::flights %>% group_by(dest) %>% partition(cluster) flight_dest #> Source: party_df [336,776 x 19] #> Groups: dest #> Shards: 4 [81,594--86,548 rows] #> #> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time #> <int> <int> <int> <int> <int> <dbl> <int> <int> #> 1 2013 1 1 544 545 -1 1004 1022 #> 2 2013 1 1 558 600 -2 923 937 #> 3 2013 1 1 559 600 -1 854 902 #> 4 2013 1 1 602 610 -8 812 820 #> 5 2013 1 1 602 605 -3 821 805 #> 6 2013 1 1 611 600 11 945 931 #> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>, #> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>, #> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
The data is now spread across four “shards”, each consisting of around 80,000 rows. Generally, you’ll want to group the data before partitioning which ensures that all observations in one group end up on the same worker.
You can work with this
party_df as if it was a data frame, but any work will be spread out across all the processes (which your operating system will usually allocate to different cores).
mean_delay <- flight_dest %>% summarise(delay = mean(arr_delay, na.rm = TRUE), n = n()) %>% filter(n > 25) mean_delay #> Source: party_df [96 x 3] #> Shards: 4 [21--26 rows] #> #> dest delay n #> <chr> <dbl> <int> #> 1 ABQ 4.38 254 #> 2 AUS 6.02 2439 #> 3 BQN 8.25 896 #> 4 BTV 8.95 2589 #> 5 BUF 8.95 4681 #> 6 CLE 9.18 4573 #> # … with 90 more rows
multidplyr is a good fit for problems where the bottleneck is complex, non-dplyr computation (e.g. fitting models). There’s some overhead initially partitioning the data and then transferring the commands to each worker, so it’s not a magic bullet, but it is very easy to use.
multidplyr is still quite young, so please try it out and let us know about any problems that you encounter.
To use it, start by creating a
lazy_dt() object which records your dplyr actions:
You can see the translation with
show_query() or execute the data table code by converting back to a data frame, data table, or tibble:
cyl_summary %>% show_query() #> `_DT1`[, .(disp = mean(disp), hp = mean(hp), drat = mean(drat), #> wt = mean(wt)), keyby = .(cyl)] cyl_summary %>% as_tibble() #> # A tibble: 3 x 5 #> cyl disp hp drat wt #> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 4 105. 82.6 4.07 2.29 #> 2 6 183. 122. 3.59 3.12 #> 3 8 353. 209. 3.23 4.00
The big news in this release is dtplyr can now translate all features that arrived in dplyr 1.0.0. This includes:
where()is not currently supported because I don’t know how to figure out the column types without executing the pipeline (which might take a long time).
relocate(), which is translated to the
dt %>% group_by(cyl) %>% slice_sample(n = 5) %>% show_query() #> `_DT1`[`_DT1`[, .I[sample.int(.N, min(5L, .N))], by = .(cyl)]$V1] dt %>% group_by(cyl) %>% slice_head(n = 1) %>% show_query() #> `_DT1`[, head(.SD, 1L), keyby = .(cyl)] dt %>% group_by(cyl) %>% slice_min(mpg, n = 1) %>% show_query() #> `_DT1`[, .SD[order(mpg)][frankv(mpg, ties.method = "min", na.last = "keep") <= #> 1L], keyby = .(cyl)]
Mark Fairbanks, dtplyr has also gained it’s first translation of a tidyr function:
pivot_wider(), which is translated to
dcast(). You can expect more tidyr translations in the next release 😄.
I also took this as an opportunity to thoroughly refresh the documentation so that all translated verbs now have their own help pages that briefly describe how the translation works. You can read about the other minor improvements and bug fixes in the release notes.
dbplyr translates dplyr pipelines to their SQL equivalents. If you’re new to using dplyr and SQL together, I highly recommend Irene Steve’s rstudio::global() talk, "The dynamic duo: SQL and R. It discusses why you might want to use dbplyr to generate SQL and why you should still learn SQL.
The biggest change to this release is the addition of many translations for tidyr verbs like
replace_na(). These were contributed by
Maximilian Girlich, and in recognition of his sustained and substantial contributions to the package, he has been added as a package author.
This release also includes major improvements to the
across() translation, including translation of formulas (like dtplyr,
across() can’t currently use
where(), because I don’t know of a way to figure out the column types without executing the query). There are also a bunch of other minor translation improvements and bug fixes, which you can read about in the
A big thanks to all of the contributors who helped make these releases possible:
dbplyr: @abalter, @Alternikaner, @andrew-schulman, @andyquinterom, @awong234, @ben1787, @bersbersbers, @bwcastillo, @chris-billingham, @coponhub, @DavidPatShuiFong, @dcaud, @dpprdan, @dstoeckel, @elicit-bergmann, @hadley, @hdplsa, @iangow, @Janlow, @LukasTang, @McWraith, @mfherman, @mgirlich, @mr-c, @mszefler, @N1h1l1sT, @nathaneastwood, @nlneas1, @okhoma, @pachamaltese, @peterdutey, @pgramme, @robchallen, @shearer, @sheepworrier, @shosaco, @spirosparaskevasFBB, @tonyk7440, @TuomoNieminen, @yitao-li, and @yiugn
dtplyr: @AdrienMtgn, @batpigandme, @boerjames, @cassiel74, @dan-reznik, @ds-jim, @edavidaja, @edgararuiz-zz, @engineerchange, @fkgruber, @gmonaie, @hadley, @hope-data-science, @jasonopolis, @jimhester, @JohnMount, @larspijnappel, @lbenz-mdsol, @markfairbanks, @MichaelChirico, @Mitschka, @myoung3, @nigeljmckernan, @PMassicotte, @pnacht, @psanker, @rossellhayes, @RudolfCardinal, @sbashevkin, @ShixiangWang, @skiamu, @smingerson, @sonoshah, @tingjhenjiang, @tylerferguson, @TysonStanley, @yiugn, and @ykaeber.
multidplyr: @12tafran, @adviksh, @ahoho, @baldeagle, @borisveytsman, @brianmsm, @ChiWPak, @cluelessgumshoe, @CorradoLanera, @cscheid, @cwaldock1, @damiaan, @david-awam-jansen, @dewoller, @donaldRwilliams, @dzhang32, @eliferden, @FvD, @GegznaV, @germanium, @ghost, @guokai8, @hadley, @huisaddison, @iago-pssjd, @impactanalysts, @isaac-florence, @javadba, @jiho, @JosiahParry, @julou, @kartiksubbarao, @kyp0717, @lucazav, @MarioClueless, @Maschette, @McChickenNuggets, @miho87, @njudd, @philiporlando, @picarus, @samkhan1, @SGMStalin, @stanstrup, @taqtiqa-mark, @tmstauss, @tsengj, @wibeasley, @willtudorevans, and @zhengjiji456.