duckplyr fully joins the tidyverse!

We’re well chuffed to announce the release of duckplyr 1.1.0. This is a dplyr backend powered by DuckDB, a fast in-memory analytical database system¹. duckplyr uses the power of DuckDB for impressive performance where it can, and seemlessly falls back to R where it can’t. You can install it from CRAN with:

install.packages("duckplyr")

This article shows how duckplyr can be used instead of dplyr, explain how you can help improve the package, and share a selection of further resources.

A drop-in replacement for dplyr

Imagine you have to wrangle a huge dataset, like this one from the TPC-H benchmark, a famous database benchmarking dataset.

lineitem_tbl <- duckdb:::sql("INSTALL tpch; LOAD tpch; CALL dbgen(sf=1); FROM lineitem;")
lineitem_tbl <- tibble::as_tibble(lineitem_tbl)
dplyr::glimpse(lineitem_tbl)
#> Rows: 6,001,215
#> Columns: 16
#> $ l_orderkey      <dbl> 1, 1, 1, 1, 1, 1, 2, 3, 3, 3, 3, 3, 3, 4, 5, 5, 5, 6, …
#> $ l_partkey       <dbl> 155190, 67310, 63700, 2132, 24027, 15635, 106170, 4297…
#> $ l_suppkey       <dbl> 7706, 7311, 3701, 4633, 1534, 638, 1191, 1798, 6540, 3…
#> $ l_linenumber    <dbl> 1, 2, 3, 4, 5, 6, 1, 1, 2, 3, 4, 5, 6, 1, 1, 2, 3, 1, …
#> $ l_quantity      <dbl> 17, 36, 8, 28, 24, 32, 38, 45, 49, 27, 2, 28, 26, 30, …
#> $ l_extendedprice <dbl> 21168.23, 45983.16, 13309.60, 28955.64, 22824.48, 4962…
#> $ l_discount      <dbl> 0.04, 0.09, 0.10, 0.09, 0.10, 0.07, 0.00, 0.06, 0.10, …
#> $ l_tax           <dbl> 0.02, 0.06, 0.02, 0.06, 0.04, 0.02, 0.05, 0.00, 0.00, …
#> $ l_returnflag    <chr> "N", "N", "N", "N", "N", "N", "N", "R", "R", "A", "A",…
#> $ l_linestatus    <chr> "O", "O", "O", "O", "O", "O", "O", "F", "F", "F", "F",…
#> $ l_shipdate      <date> 1996-03-13, 1996-04-12, 1996-01-29, 1996-04-21, 1996-…
#> $ l_commitdate    <date> 1996-02-12, 1996-02-28, 1996-03-05, 1996-03-30, 1996-…
#> $ l_receiptdate   <date> 1996-03-22, 1996-04-20, 1996-01-31, 1996-05-16, 1996-…
#> $ l_shipinstruct  <chr> "DELIVER IN PERSON", "TAKE BACK RETURN", "TAKE BACK RE…
#> $ l_shipmode      <chr> "TRUCK", "MAIL", "REG AIR", "AIR", "FOB", "MAIL", "RAI…
#> $ l_comment       <chr> "to beans x-ray carefull", " according to the final fo…

To work with this in duckplyr instead of dplyr, all you need to do is load duckplyr:

library(duckplyr)
#> Loading required package: dplyr
#> The duckplyr package is configured to fall back to dplyr when it encounters an incompatibility.
#> Fallback events can be collected and uploaded for analysis to guide future development. By
#> default, data will be collected but no data will be uploaded.
#> ℹ Automatic fallback uploading is not controlled and therefore disabled, see
#>   `?duckplyr::fallback()`.
#> ✔ Number of reports ready for upload: 4.
#> → Review with `duckplyr::fallback_review()`, upload with `duckplyr::fallback_upload()`.
#> ℹ Configure automatic uploading with `duckplyr::fallback_config()`.
#> ✔ Overwriting dplyr methods with duckplyr methods.
#> ℹ Turn off with `duckplyr::methods_restore()`.

Now we can express the well-known (at least in the database community!) “TPC-H benchmark query 1” in dplyr syntax and execute it in DuckDB via duckplyr.

tpch_dplyr <- function(lineitem) {
  lineitem |>
    filter(l_shipdate <= !!as.Date("1998-09-02")) |>
    summarise(
      sum_qty = sum(l_quantity),
      sum_base_price = sum(l_extendedprice),
      sum_disc_price = sum(l_extendedprice * (1 - l_discount)),
      sum_charge = sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)),
      avg_qty = mean(l_quantity),
      avg_price = mean(l_extendedprice),
      avg_disc = mean(l_discount),
      count_order = n(),
      .by = c(l_returnflag, l_linestatus)
    ) |>
    arrange(l_returnflag, l_linestatus)
}

tpch_dplyr(lineitem_tbl)
#> # A tibble: 4 × 10
#>   l_returnflag l_linestatus  sum_qty sum_base_price sum_disc_price    sum_charge
#>   <chr>        <chr>           <dbl>          <dbl>          <dbl>         <dbl>
#> 1 A            F            37734107   56586554401.   53758257135.  55909065223.
#> 2 N            F              991417    1487504710.    1413082168.   1469649223.
#> 3 N            O            74476040  111701729698.  106118230308. 110367043872.
#> 4 R            F            37719753   56568041381.   53741292685.  55889619120.
#> # ℹ 4 more variables: avg_qty <dbl>, avg_price <dbl>, avg_disc <dbl>,
#> #   count_order <int>

Like other dplyr backends such as dtplyr and dbplyr, duckplyr gives you higher performance without learning a different syntax. Unlike other dplyr backends, duckplyr does not require you to change existing code or learn specific idiosyncrasies. Not only is the syntax the same, the semantics are too! If an operation cannot be carried out with DuckDB, it is automatically outsourced to dplyr. Over time, we expect fewer and fewer fallbacks to dplyr to be needed.

How to use duckplyr

There are two ways to use duckplyr:

As above, you can library(duckplyr), and replace all existing dplyr methods. This is safe because duckplyr is guaranteed to give the exactly same the results as dplyr, unlike other backends.
Create individual “duck frames” using conversion functions like duckdplyr::duckdb_tibble() or duckdplyr::as_duckdb_tibble(), or ingestion functions like duckdplyr::read_csv_duckdb().

Here’s an example of the second form:

out <- lineitem_tbl |>
  duckplyr::as_duckdb_tibble() |>
  tpch_dplyr()

out
#> # A duckplyr data frame: 10 variables
#>   l_returnflag l_linestatus  sum_qty sum_base_price sum_disc_price    sum_charge
#>   <chr>        <chr>           <dbl>          <dbl>          <dbl>         <dbl>
#> 1 A            F            37734107   56586554401.   53758257135.  55909065223.
#> 2 N            F              991417    1487504710.    1413082168.   1469649223.
#> 3 N            O            74476040  111701729698.  106118230308. 110367043872.
#> 4 R            F            37719753   56568041381.   53741292685.  55889619120.
#> # ℹ 4 more variables: avg_qty <dbl>, avg_price <dbl>, avg_disc <dbl>,
#> #   count_order <int>

Note that the resulting object is indistinguishable from a regular tibble, except for the additional class.

typeof(out)
#> [1] "list"
class(out)
#> [1] "duckplyr_df" "tbl_df"      "tbl"         "data.frame"
out$count_order
#> [1] 1478493   38854 2920374 1478870

Operations not yet supported by duckplyr are automatically outsourced to dplyr. For instance, filtering on grouped data is not supported, but it still works thanks to the fallback mechanism. By default, the fallback is silent, but you can make it visible by setting an environment variable. This is useful if you want to better understanding what’s making your code slow.

Sys.setenv(DUCKPLYR_FALLBACK_INFO = TRUE)

lineitem_tbl |>
  duckplyr::as_duckdb_tibble() |>
  filter(l_quantity == max(l_quantity), .by = c(l_returnflag, l_linestatus))
#> Cannot process duckplyr query with DuckDB, falling back to dplyr.
#> ℹ `filter(.by = ...)` not implemented, try `mutate(.by = ...)` followed by a simple `filter()`.
#> # A duckplyr data frame: 16 variables
#>    l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice
#>         <dbl>     <dbl>     <dbl>        <dbl>      <dbl>           <dbl>
#>  1          5     37531        35            3         50          73426.
#>  2        131     44255      9264            2         50          59962.
#>  3        199    132072      9612            1         50          55204.
#>  4        231    198124       644            3         50          61106 
#>  5        260    155887      5888            1         50          97144 
#>  6        263    142891       434            3         50          96694.
#>  7        323    163628      1177            1         50          84581 
#>  8        354     58125      8126            3         50          54156 
#>  9        484    183351      5870            3         50          71718.
#> 10        485    149523      9524            1         50          78626 
#> # ℹ more rows
#> # ℹ 10 more variables: l_discount <dbl>, l_tax <dbl>, l_returnflag <chr>,
#> #   l_linestatus <chr>, l_shipdate <date>, l_commitdate <date>,
#> #   l_receiptdate <date>, l_shipinstruct <chr>, l_shipmode <chr>,
#> #   l_comment <chr>

You can also directly use DuckDB functions with the dd$ qualifier. Functions with this prefix will not be translated at all and passed through directly to DuckDB. For example, the following code uses DuckDB’s internal implementation of Levenstein distance:

tibble(a = "dbplyr", b = "duckplyr") %>%
  mutate(c = dd$levenshtein(a, b))
#> # A tibble: 1 × 3
#>   a      b            c
#>   <chr>  <chr>    <dbl>
#> 1 dbplyr duckplyr     3

See vignette("duckdb") for more information on these features.

If you’re working with dbplyr too, you can use as_tbl() you to convert a duckplyr tibble to a dbplyr lazy table. This allows you to seamlessly interact with existing code that might use inline SQL or other dbplyr functionality. With as_duckdb_tibble(), you can convert a dbplyr lazy table to a duckplyr tibble. Both operations work without intermediate materialization.

Benchmark

duckplyr is often much faster than dplyr. The comparison below is done in a fresh R session where dplyr is attached but duckplyr is not.

We use tpch_dplyr() as defined above to run the query with dplyr. The function that runs it with duckplyr only wraps the input data in a duck frame and forwards it to the dplyr function. The collect() at the end is required only for this benchmark to ensure fairness.²

tpch_duckplyr <- function(lineitem) {
  lineitem |>
    duckplyr::as_duckdb_tibble() |>
    tpch_dplyr() |>
    collect()
}

And now we compare the two:

bench::mark(
  tpch_dplyr(lineitem_tbl),
  tpch_duckplyr(lineitem_tbl),
  check = ~ all.equal(.x, .y, tolerance = 1e-10)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression                       min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>                  <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 tpch_dplyr(lineitem_tbl)     611.6ms  611.6ms      1.64    1.25GB     1.64
#> 2 tpch_duckplyr(lineitem_tbl)   71.4ms   72.3ms     13.8   314.38KB     0

In this example, duckplyr is a lot faster than dplyr. It also appears to use much less memory, but this is misleading: DuckDB manages the memory, not R, so the memory usage is not visible to bench::mark().

Out-of-memory data

As well as improved speed with in-memory datasets, duckplyr makes it easy to work with datasets that are too big to fit in memory. In this case, you want:

To work with data stored in modern formats designed for large data (e.g. Parquet).
To be able to store large intermediate results on disk, keeping them out of memory.
Fast computation!

duckdplyr provides each of these features:

You can read data from disk with functions like read_parquet_duckdb().
You can save intermediate results to disk with compute_parquet() and compute_csv().
duckdplyr takes advantage of DuckDB’s query planner which considers your entire pipeline holistically to figure out the most efficient way to get the data you need.

See vignette("large") for a walkthrough and more details.

Help us improve duckplyr!

Our goals for future development of duckplyr include:

Enabling users to provide custom translations of dplyr functionality;
Making it easier to contribute code to duckplyr;
Supporting more dplyr and tidyr functionality natively in DuckDB.

You can help!

Please report any issues, especially regarding unknown incompabilities. See vignette("limits").
Contribute to the codebase after reading duckplyr’s contributing guide.
Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See vignette("telemetry") and duckplyr::fallback_sitrep().

Additional resources

Eager to learn more about duckplyr – beside by trying it out yourself? The duckplyr website features several articles. Furthermore, the blog post “duckplyr: dplyr Powered by DuckDB” by Hannes Mühleisen provides some context on duckplyr including its inner workings, as also seen in a section of the R-hub blog post “Lazy introduction to laziness in R” by Maëlle Salmon, Athanasia Mo Mowinckel and Hannah Frick.

Acknowledgements

A big thanks to all folks who filed issues, created PRs and generally helped to improve duckplyr and its workhorse duckdb!

@adamschwing, @alejandrohagan, @andreranza, @apalacio9502, @apsteinmetz, @barracuda156, @beniaminogreen, @bob-rietveld, @brichards920, @cboettig, @davidjayjackson, @DavisVaughan, @Ed2uiz, @eitsupi, @era127, @etiennebacher, @eutwt, @fmichonneau, @hadley, @hannes, @hawkfish, @IndrajeetPatil, @JanSulavik, @JavOrraca, @jeroen, @jhk0530, @joakimlinde, @JosiahParry, @kevbaer, @larry77, @lnkuiper, @lorenzwalthert, @lschneiderbauer, @luisDVA, @math-mcshane, @meersel, @multimeric, @mytarmail, @nicki-dese, @PMassicotte, @prasundutta87, @rafapereirabr, @Robinlovelace, @romainfrancois, @sparrow925, @stefanlinner, @szarnyasg, @thomasp85, @TimTaylor, @Tmonster, @toppyy, @wibeasley, @yjunechoe, @ywhcuhk, @zhjx19, @ablack3, @actuarial-lonewolf, @ajdamico, @amirmazmi, @anderson461123, @andrewGhazi, @Antonov548, @appiehappie999, @ArthurAndrews, @arthurgailes, @babaknaimi, @bcaradima, @bdforbes, @bergest, @bill-ash, @BorgeJorge, @brianmsm, @chainsawriot, @ckarnes, @clementlefevre, @cregouby, @cy-james-lee, @daranzolin, @david-cortes, @DavZim, @denis-or, @developertest1234, @dicorynia, @dsolito, @e-kotov, @EAVWing, @eddelbuettel, @edward-burn, @elefeint, @eli-daniels, @elysabethpc, @erikvona, @florisvdh, @gaborcsardi, @ggrothendieck, @hdmm3, @hope-data-science, @IoannaNika, @jabrown-aepenergy, @JamesLMacAulay, @jangorecki, @javierlenzi, @Joe-Heffer-Shef, @kalibera, @lboller-pwbm, @lgaborini, @m-muecke, @meztez, @mgirlich, @mtmorgan, @nassuphis, @nbc, @olivroy, @pdet, @phdjsep, @pierre-lamarche, @r2evans, @ran-codes, @rplsmn, @Saarialho, @SimonCoulombe, @tau31, @thohan88, @ThomasSoeiro, @timothygmitchell, @vincentarelbundock, @VincentGuyader, @wlangera, @xbasics, @xiaodaigh, @xtimbeau, @yng-me, @Yousuf28, @yutannihilation, and @zcatav

Special thanks to Joe Thorley ( @joethorley) for help with choosing the right words.

If you haven’t heard of it yet, watch Hannes Mühleisen’s keynote at posit::conf(2024). ↩︎
If omitted, the results would be unchanged but the measurements would be wrong. The computation would then be triggered by the check. See vignette("prudence") for details. ↩︎