We’re well chuffed to announce the release of duckplyr 1.1.0. This is a dplyr backend powered by DuckDB, a fast in-memory analytical database system1. duckplyr uses the power of DuckDB for impressive performance where it can, and seemlessly falls back to R where it can’t. You can install it from CRAN with:
install.packages("duckplyr")
This article shows how duckplyr can be used instead of dplyr, explain how you can help improve the package, and share a selection of further resources.
A drop-in replacement for dplyr
Imagine you have to wrangle a huge dataset, like this one from the TPC-H benchmark, a famous database benchmarking dataset.
lineitem_tbl <- duckdb:::sql("INSTALL tpch; LOAD tpch; CALL dbgen(sf=1); FROM lineitem;")
lineitem_tbl <- tibble::as_tibble(lineitem_tbl)
dplyr::glimpse(lineitem_tbl)
#> Rows: 6,001,215
#> Columns: 16
#> $ l_orderkey <dbl> 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5, 6, …
#> $ l_partkey <dbl> 155190, 67310, 63700, 2132, 24027, 15635, 106170, 4297…
#> $ l_suppkey <dbl> 7706, 7311, 3701, 4633, 1534, 638, 1191, 1798, 6540, 3…
#> $ l_linenumber <dbl> 1, 2, 3, 4, 5, 6, 1, 1, 2, 3, 4, 5, 6, 1, 1, 2, 3, 1, …
#> $ l_quantity <dbl> 17, 36, 8, 28, 24, 32, 38, 45, 49, 27, 2, 28, 26, 30, …
#> $ l_extendedprice <dbl> 21168.23, 45983.16, 13309.60, 28955.64, 22824.48, 4962…
#> $ l_discount <dbl> 0.04, 0.09, 0.10, 0.09, 0.10, 0.07, 0.00, 0.06, 0.10, …
#> $ l_tax <dbl> 0.02, 0.06, 0.02, 0.06, 0.04, 0.02, 0.05, 0.00, 0.00, …
#> $ l_returnflag <chr> "N", "N", "N", "N", "N", "N", "N", "R", "R", "A", "A",…
#> $ l_linestatus <chr> "O", "O", "O", "O", "O", "O", "O", "F", "F", "F", "F",…
#> $ l_shipdate <date> 1996-03-13, 1996-04-12, 1996-01-29, 1996-04-21, 1996-…
#> $ l_commitdate <date> 1996-02-12, 1996-02-28, 1996-03-05, 1996-03-30, 1996-…
#> $ l_receiptdate <date> 1996-03-22, 1996-04-20, 1996-01-31, 1996-05-16, 1996-…
#> $ l_shipinstruct <chr> "DELIVER IN PERSON", "TAKE BACK RETURN", "TAKE BACK RE…
#> $ l_shipmode <chr> "TRUCK", "MAIL", "REG AIR", "AIR", "FOB", "MAIL", "RAI…
#> $ l_comment <chr> "to beans x-ray carefull", " according to the final fo…
To work with this in duckplyr instead of dplyr, all you need to do is load duckplyr:
library(duckplyr)
#> Loading required package: dplyr
#> The duckplyr package is configured to fall back to dplyr when it encounters an incompatibility.
#> Fallback events can be collected and uploaded for analysis to guide future development. By
#> default, data will be collected but no data will be uploaded.
#> ℹ Automatic fallback uploading is not controlled and therefore disabled, see
#> `?duckplyr::fallback()`.
#> ✔ Number of reports ready for upload: 4.
#> → Review with `duckplyr::fallback_review()`, upload with `duckplyr::fallback_upload()`.
#> ℹ Configure automatic uploading with `duckplyr::fallback_config()`.
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.
Now we can express the well-known (at least in the database community!) “TPC-H benchmark query 1” in dplyr syntax and execute it in DuckDB via duckplyr.
tpch_dplyr <- function(lineitem) {
lineitem |>
filter(l_shipdate <= !!as.Date("1998-09-02")) |>
summarise(
sum_qty = sum(l_quantity),
sum_base_price = sum(l_extendedprice),
sum_disc_price = sum(l_extendedprice * (1 - l_discount)),
sum_charge = sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
avg_qty = mean(l_quantity),
avg_price = mean(l_extendedprice),
avg_disc = mean(l_discount),
count_order = n(),
.by = c(l_returnflag, l_linestatus)
) |>
arrange(l_returnflag, l_linestatus)
}
tpch_dplyr(lineitem_tbl)
#> # A tibble: 4 × 10
#> l_returnflag l_linestatus sum_qty sum_base_price sum_disc_price sum_charge
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A F 37734107 56586554401. 53758257135. 55909065223.
#> 2 N F 991417 1487504710. 1413082168. 1469649223.
#> 3 N O 74476040 111701729698. 106118230308. 110367043872.
#> 4 R F 37719753 56568041381. 53741292685. 55889619120.
#> # ℹ 4 more variables: avg_qty <dbl>, avg_price <dbl>, avg_disc <dbl>,
#> # count_order <int>
Like other dplyr backends such as dtplyr and dbplyr, duckplyr gives you higher performance without learning a different syntax. Unlike other dplyr backends, duckplyr does not require you to change existing code or learn specific idiosyncrasies. Not only is the syntax the same, the semantics are too! If an operation cannot be carried out with DuckDB, it is automatically outsourced to dplyr. Over time, we expect fewer and fewer fallbacks to dplyr to be needed.
How to use duckplyr
There are two ways to use duckplyr:
-
As above, you can
library(duckplyr)
, and replace all existing dplyr methods. This is safe because duckplyr is guaranteed to give the exactly same the results as dplyr, unlike other backends. -
Create individual “duck frames” using conversion functions like
duckdplyr::duckdb_tibble()
orduckdplyr::as_duckdb_tibble()
, or ingestion functions likeduckdplyr::read_csv_duckdb()
.
Here’s an example of the second form:
out <- lineitem_tbl |>
duckplyr::as_duckdb_tibble() |>
tpch_dplyr()
out
#> # A duckplyr data frame: 10 variables
#> l_returnflag l_linestatus sum_qty sum_base_price sum_disc_price sum_charge
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 A F 37734107 56586554401. 53758257135. 55909065223.
#> 2 N F 991417 1487504710. 1413082168. 1469649223.
#> 3 N O 74476040 111701729698. 106118230308. 110367043872.
#> 4 R F 37719753 56568041381. 53741292685. 55889619120.
#> # ℹ 4 more variables: avg_qty <dbl>, avg_price <dbl>, avg_disc <dbl>,
#> # count_order <int>
Note that the resulting object is indistinguishable from a regular tibble, except for the additional class.
typeof(out)
#> [1] "list"
class(out)
#> [1] "duckplyr_df" "tbl_df" "tbl" "data.frame"
out$count_order
#> [1] 1478493 38854 2920374 1478870
Operations not yet supported by duckplyr are automatically outsourced to dplyr. For instance, filtering on grouped data is not supported, but it still works thanks to the fallback mechanism. By default, the fallback is silent, but you can make it visible by setting an environment variable. This is useful if you want to better understanding what’s making your code slow.
Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE)
lineitem_tbl |>
duckplyr::as_duckdb_tibble() |>
filter(l_quantity == max(l_quantity), .by = c(l_returnflag, l_linestatus))
#> Cannot process duckplyr query with DuckDB, falling back to dplyr.
#> ℹ `filter(.by = ...)` not implemented, try `mutate(.by = ...)` followed by a simple `filter()`.
#> # A duckplyr data frame: 16 variables
#> l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 5 37531 35 3 50 73426.
#> 2 131 44255 9264 2 50 59962.
#> 3 199 132072 9612 1 50 55204.
#> 4 231 198124 644 3 50 61106
#> 5 260 155887 5888 1 50 97144
#> 6 263 142891 434 3 50 96694.
#> 7 323 163628 1177 1 50 84581
#> 8 354 58125 8126 3 50 54156
#> 9 484 183351 5870 3 50 71718.
#> 10 485 149523 9524 1 50 78626
#> # ℹ more rows
#> # ℹ 10 more variables: l_discount <dbl>, l_tax <dbl>, l_returnflag <chr>,
#> # l_linestatus <chr>, l_shipdate <date>, l_commitdate <date>,
#> # l_receiptdate <date>, l_shipinstruct <chr>, l_shipmode <chr>,
#> # l_comment <chr>
You can also directly use DuckDB functions with the dd$
qualifier. Functions with this prefix will not be translated at all and passed through directly to DuckDB. For example, the following code uses DuckDB’s internal implementation of
Levenstein distance:
tibble(a = "dbplyr", b = "duckplyr") %>%
mutate(c = dd$levenshtein(a, b))
#> # A tibble: 1 × 3
#> a b c
#> <chr> <chr> <dbl>
#> 1 dbplyr duckplyr 3
See
vignette("duckdb")
for more information on these features.
If you’re working with dbplyr too, you can use
as_tbl()
you to convert a duckplyr tibble to a dbplyr lazy table. This allows you to seamlessly interact with existing code that might use inline SQL or other dbplyr functionality. With
as_duckdb_tibble()
, you can convert a dbplyr lazy table to a duckplyr tibble. Both operations work without intermediate materialization.
Benchmark
duckplyr is often much faster than dplyr. The comparison below is done in a fresh R session where dplyr is attached but duckplyr is not.
We use tpch_dplyr()
as defined above to run the query with dplyr. The function that runs it with duckplyr only wraps the input data in a duck frame and forwards it to the dplyr function. The
collect()
at the end is required only for this benchmark to ensure fairness.2
tpch_duckplyr <- function(lineitem) {
lineitem |>
duckplyr::as_duckdb_tibble() |>
tpch_dplyr() |>
collect()
}
And now we compare the two:
bench::mark(
tpch_dplyr(lineitem_tbl),
tpch_duckplyr(lineitem_tbl),
check = ~ all.equal(.x, .y, tolerance = 1e-10)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 tpch_dplyr(lineitem_tbl) 611.6ms 611.6ms 1.64 1.25GB 1.64
#> 2 tpch_duckplyr(lineitem_tbl) 71.4ms 72.3ms 13.8 314.38KB 0
In this example, duckplyr is a lot faster than dplyr. It also appears to use much less memory, but this is misleading: DuckDB manages the memory, not R, so the memory usage is not visible to
bench::mark()
.
Out-of-memory data
As well as improved speed with in-memory datasets, duckplyr makes it easy to work with datasets that are too big to fit in memory. In this case, you want:
- To work with data stored in modern formats designed for large data (e.g. Parquet).
- To be able to store large intermediate results on disk, keeping them out of memory.
- Fast computation!
duckdplyr provides each of these features:
- You can read data from disk with functions like
read_parquet_duckdb()
. - You can save intermediate results to disk with
compute_parquet()
andcompute_csv()
. - duckdplyr takes advantage of DuckDB’s query planner which considers your entire pipeline holistically to figure out the most efficient way to get the data you need.
See
vignette("large")
for a walkthrough and more details.
Help us improve duckplyr!
Our goals for future development of duckplyr include:
- Enabling users to provide custom translations of dplyr functionality;
- Making it easier to contribute code to duckplyr;
- Supporting more dplyr and tidyr functionality natively in DuckDB.
You can help!
- Please report any issues, especially regarding unknown incompabilities. See
vignette("limits")
. - Contribute to the codebase after reading duckplyr’s contributing guide.
- Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See
vignette("telemetry")
andduckplyr::fallback_sitrep()
.
Additional resources
Eager to learn more about duckplyr – beside by trying it out yourself? The duckplyr website features several articles. Furthermore, the blog post “duckplyr: dplyr Powered by DuckDB” by Hannes Mühleisen provides some context on duckplyr including its inner workings, as also seen in a section of the R-hub blog post “Lazy introduction to laziness in R” by Maëlle Salmon, Athanasia Mo Mowinckel and Hannah Frick.
Acknowledgements
A big thanks to all folks who filed issues, created PRs and generally helped to improve duckplyr and its workhorse duckdb!
@adamschwing, @alejandrohagan, @andreranza, @apalacio9502, @apsteinmetz, @barracuda156, @beniaminogreen, @bob-rietveld, @brichards920, @cboettig, @davidjayjackson, @DavisVaughan, @Ed2uiz, @eitsupi, @era127, @etiennebacher, @eutwt, @fmichonneau, @hadley, @hannes, @hawkfish, @IndrajeetPatil, @JanSulavik, @JavOrraca, @jeroen, @jhk0530, @joakimlinde, @JosiahParry, @kevbaer, @larry77, @lnkuiper, @lorenzwalthert, @lschneiderbauer, @luisDVA, @math-mcshane, @meersel, @multimeric, @mytarmail, @nicki-dese, @PMassicotte, @prasundutta87, @rafapereirabr, @Robinlovelace, @romainfrancois, @sparrow925, @stefanlinner, @szarnyasg, @thomasp85, @TimTaylor, @Tmonster, @toppyy, @wibeasley, @yjunechoe, @ywhcuhk, @zhjx19, @ablack3, @actuarial-lonewolf, @ajdamico, @amirmazmi, @anderson461123, @andrewGhazi, @Antonov548, @appiehappie999, @ArthurAndrews, @arthurgailes, @babaknaimi, @bcaradima, @bdforbes, @bergest, @bill-ash, @BorgeJorge, @brianmsm, @chainsawriot, @ckarnes, @clementlefevre, @cregouby, @cy-james-lee, @daranzolin, @david-cortes, @DavZim, @denis-or, @developertest1234, @dicorynia, @dsolito, @e-kotov, @EAVWing, @eddelbuettel, @edward-burn, @elefeint, @eli-daniels, @elysabethpc, @erikvona, @florisvdh, @gaborcsardi, @ggrothendieck, @hdmm3, @hope-data-science, @IoannaNika, @jabrown-aepenergy, @JamesLMacAulay, @jangorecki, @javierlenzi, @Joe-Heffer-Shef, @kalibera, @lboller-pwbm, @lgaborini, @m-muecke, @meztez, @mgirlich, @mtmorgan, @nassuphis, @nbc, @olivroy, @pdet, @phdjsep, @pierre-lamarche, @r2evans, @ran-codes, @rplsmn, @Saarialho, @SimonCoulombe, @tau31, @thohan88, @ThomasSoeiro, @timothygmitchell, @vincentarelbundock, @VincentGuyader, @wlangera, @xbasics, @xiaodaigh, @xtimbeau, @yng-me, @Yousuf28, @yutannihilation, and @zcatav
Special thanks to Joe Thorley ( @joethorley) for help with choosing the right words.
-
If you haven’t heard of it yet, watch Hannes Mühleisen’s keynote at posit::conf(2024). ↩︎
-
If omitted, the results would be unchanged but the measurements would be wrong. The computation would then be triggered by the check. See
vignette("prudence")
for details. ↩︎