dtplyr 1.2.0

  dplyr

  Hadley Wickham

We’re thrilled to announce that dtplyr 1.2.0 is now on CRAN. dtplyr gives you the speed of data.table with the syntax of dplyr; you write dplyr (and tidyr) code and dtplyr translates it to the data.table equivalent.

You can install dtplyr from CRAN with:

install.packages("dtplyr")

I’ll discuss three major changes in this blog post:

  • New authors
  • New tidyr translations
  • Improvements to join translations

There are also over 20 minor improvements to the quality of translations; you can see a full list in the release notes.

library(dtplyr)

library(dplyr, warn.conflicts = FALSE)
library(tidyr)

New authors

The biggest news in this release is the addition of three new authors: Mark Fairbanks, Maximilian Girlich, and Ryan Dickerson are now dtplyr authors in recognition of their significant and sustained contributions. In fact, they implemented the bulk of the improvements in this release!

tidyr translations

dtplyr gains translations for many more tidyr verbs including complete(), drop_na(), expand(), fill(), nest(), pivot_longer(), replace_na(), and separate(). A few examples are shown below:

dt <- lazy_dt(data.frame(x = c(NA, "x.y", "x.z", "y.z")))
dt %>% 
  separate(x, c("A", "B"), sep = "\\.", remove = FALSE) %>% 
  show_query()
#> copy(`_DT1`)[, `:=`(c("A", "B"), tstrsplit(x, split = "\\."))]

dt <- lazy_dt(data.frame(x = c(1, NA, NA, 2, NA)))
dt %>% 
  fill(x) %>% 
  show_query()
#> copy(`_DT2`)[, `:=`(x = nafill(x, "locf"))]

dt %>% 
  replace_na(list(x = 99)) %>% 
  show_query()
#> copy(`_DT2`)[, `:=`(x = fcoalesce(x, 99))]

dt <- lazy_dt(relig_income)
dt %>%
  pivot_longer(!religion, names_to = "income", values_to = "count") %>% 
  show_query()
#> melt(`_DT3`, measure.vars = c("<$10k", "$10-20k", "$20-30k", 
#> "$30-40k", "$40-50k", "$50-75k", "$75-100k", "$100-150k", ">150k", 
#> "Don't know/refused"), variable.name = "income", value.name = "count", 
#>     variable.factor = FALSE)

Improvements to joins

The join functions have been overhauled: inner_join(), left_join(), and right_join() now all produce a call to [, rather than to merge():

dt1 <- lazy_dt(data.frame(x = 1:3))
dt2 <- lazy_dt(data.frame(x = 2:3, y = c("a", "b")))

dt1 %>% inner_join(dt2, by = "x") %>% show_query()
#> `_DT4`[`_DT5`, on = .(x), nomatch = NULL, allow.cartesian = TRUE]
dt1 %>% left_join(dt2, by = "x") %>% show_query()
#> `_DT5`[`_DT4`, on = .(x), allow.cartesian = TRUE]
dt2 %>% right_join(dt1, by = "x") %>% show_query()
#> `_DT5`[`_DT4`, on = .(x), allow.cartesian = TRUE]

This can make the translation a little longer for simple joins, but it greatly simplifies the underlying code. This simplification has made it easier to more closely match dplyr behaviour for column order, handling named by specifications, Cartesian joins with by = character(), and managing duplicated variable names.

Acknowledgements

As always, tidyverse packages wouldn’t be possible with the community, so a big thanks goes out to all 35 folks who helped to make this release a reality: @akr-source, @batpigandme, @bguillod, @cgoo4, @chenx2018, @D-Se, @eutwt, @hadley, @jatherrien, @jdmoralva, @jennybc, @jtlandis, @kmishra9, @lutzgruber, @lutzgruber-quantco, @markfairbanks, @mgirlich, @mrcaseb, @nassuphis, @nigeljmckernan, @NZambranoc, @PMassicotte, @psads-git, @quid-agis, @romainfrancois, @roni-fultheim, @samlipworth, @sanjmeh, @sbashevkin, @StatsGary, @torema-ed, @verajosemanuel, @Waldi73, @wurli, and @yiugn.