dtplyr 1.3.0

  dplyr, dtplyr

  Hadley Wickham

We’re thrilled to announce the release of dtplyr 1.3.0. dtplyr gives you the speed of data.table with the syntax of dplyr; you write dplyr (and tidyr) code and dtplyr translates it to the data.table equivalent.

You can install it from CRAN with:

install.packages("dtplyr")

This blog post will give you an overview of the changes in this version: dtplyr no longer adds translations directly to data.tables, it includes some dplyr 1.1.0 updates, and we have made some performance improvements. As always, you can see a full list of changes in the release notes

library(dtplyr)
library(dplyr, warn.conflicts = FALSE)

Breaking changes

In previous versions, dtplyr registered translations that kicked in whenever you used a data.table. This caused problems because merely loading dtplyr could cause otherwise ok code to fail because dplyr and tidyr functions would now return lazy_dt objects instead of data.table objects. To avoid this problem, we have removed those S3 methods so you must now explicitly opt-in to dtplyr translations by using lazy_dt().

dplyr 1.1.0

This release brings support for dplyr 1.1.0’s per-operation grouping and pick():

dt <- lazy_dt(data.frame(x = 1:10, id = 1:2))
dt |> 
  summarise(mean = mean(x), .by = id) |> 
  show_query()
#> `_DT1`[, .(mean = mean(x)), keyby = .(id)]

dt <- lazy_dt(data.frame(x = 1:10, y = runif(10)))
dt |> 
  mutate(row_sum = rowSums(pick(x))) |> 
  show_query()
#> copy(`_DT2`)[, `:=`(row_sum = rowSums(data.table(x = x)))]

Per-operation grouping was one of the dplyr 1.1.0 features inspired by data.table, so it’s neat to see it come full circle in this dtplyr release. Future releases will add support for other dplyr 1.1.0 features like the new join_by() syntax and reframe().

Improved translations

dtplyr gains new translations for add_count() and unite(), and the ranking functions, min_rank(), dense_rank(), percent_rank(), & cume_dist() are now mapped to their data.table equivalents:

dt |> add_count() |> show_query()
#> copy(`_DT2`)[, `:=`(n = .N)]

dt |> tidyr::unite("z", c(x, y)) |> show_query()
#> copy(`_DT2`)[, `:=`(z = paste(x, y, sep = "_"))][, `:=`(c("x", 
#> "y"), NULL)]

dt |> mutate(r = min_rank(x)) |> show_query()
#> copy(`_DT2`)[, `:=`(r = frank(x, ties.method = "min", na.last = "keep"))]

dt |> mutate(r = dense_rank(x)) |> show_query()
#> copy(`_DT2`)[, `:=`(r = frank(x, ties.method = "dense", na.last = "keep"))]

This release also includes three translation improvements that yield better performance. When data has previously been copied arrange() will use setorder() instead of order() and select() will drop unwanted columns by reference (i.e. with var := NULL). And slice() now uses an intermediate variable to reduce computation time of row selection.

Acknowledgements

A massive thanks to Mark Fairbanks who did most of the work for this release, ably aided by the other dtplyr maintainers @eutwt and Maximilian Girlich. And thanks to everyone else who helped make this release possible, whether it was with code, documentation, or insightful comments: @abalter, @akaviaLab, @camnesia, @caparks2, @DavisVaughan, @eipi10, @hadley, @jmbarbone, @johnF-moore, @lschneiderbauer, and @NicChr.