We’re thrilled to announce the release of dtplyr 1.3.0. dtplyr gives you the speed of data.table with the syntax of dplyr; you write dplyr (and tidyr) code and dtplyr translates it to the data.table equivalent.
You can install it from CRAN with:
install.packages("dtplyr")
This blog post will give you an overview of the changes in this version: dtplyr no longer adds translations directly to data.tables, it includes some dplyr 1.1.0 updates, and we have made some performance improvements. As always, you can see a full list of changes in the release notes
Breaking changes
In previous versions, dtplyr registered translations that kicked in whenever you used a data.table. This
caused problems because merely loading dtplyr could cause otherwise ok code to fail because dplyr and tidyr functions would now return lazy_dt
objects instead of data.table
objects. To avoid this problem, we have removed those S3 methods so you must now explicitly opt-in to dtplyr translations by using
lazy_dt()
.
dplyr 1.1.0
This release brings support for dplyr 1.1.0’s
per-operation grouping and
pick()
:
dt <- lazy_dt(data.frame(x = 1:10, id = 1:2))
dt |>
summarise(mean = mean(x), .by = id) |>
show_query()
#> `_DT1`[, .(mean = mean(x)), keyby = .(id)]
dt <- lazy_dt(data.frame(x = 1:10, y = runif(10)))
dt |>
mutate(row_sum = rowSums(pick(x))) |>
show_query()
#> copy(`_DT2`)[, `:=`(row_sum = rowSums(data.table(x = x)))]
Per-operation grouping was one of the dplyr 1.1.0 features inspired by data.table, so it’s neat to see it come full circle in this dtplyr release. Future releases will add support for other dplyr 1.1.0 features like the new
join_by()
syntax and
reframe()
.
Improved translations
dtplyr gains new translations for
add_count()
and unite()
, and the ranking functions,
min_rank()
,
dense_rank()
,
percent_rank()
, &
cume_dist()
are now mapped to their data.table
equivalents:
dt |> add_count() |> show_query()
#> copy(`_DT2`)[, `:=`(n = .N)]
dt |> tidyr::unite("z", c(x, y)) |> show_query()
#> copy(`_DT2`)[, `:=`(z = paste(x, y, sep = "_"))][, `:=`(c("x",
#> "y"), NULL)]
dt |> mutate(r = min_rank(x)) |> show_query()
#> copy(`_DT2`)[, `:=`(r = frank(x, ties.method = "min", na.last = "keep"))]
dt |> mutate(r = dense_rank(x)) |> show_query()
#> copy(`_DT2`)[, `:=`(r = frank(x, ties.method = "dense", na.last = "keep"))]
This release also includes three translation improvements that yield better performance. When data has previously been copied
arrange()
will use setorder()
instead of
order()
and
select()
will drop unwanted columns by reference (i.e. with var := NULL
). And
slice()
now uses an intermediate variable to reduce computation time of row selection.
Acknowledgements
A massive thanks to Mark Fairbanks who did most of the work for this release, ably aided by the other dtplyr maintainers @eutwt and Maximilian Girlich. And thanks to everyone else who helped make this release possible, whether it was with code, documentation, or insightful comments: @abalter, @akaviaLab, @camnesia, @caparks2, @DavisVaughan, @eipi10, @hadley, @jmbarbone, @johnF-moore, @lschneiderbauer, and @NicChr.