dbplyr 2.2.0

  dbplyr, dplyr

  Hadley Wickham

We’re chuffed to announce the release of dbplyr 2.2.0. dbplyr is a database backend for dplyr that allows you to use a remote database as if it was a collection of local data frames: you write ordinary dplyr code and dbplyr translates it to SQL for you.

You can install it from CRAN with:

install.packages("dbplyr")

This blog post will discuss some of the biggest improvements to SQL translations, introduce copy_inline(), and discuss support for dplyr’s row_ functions. You can see a full list of changes in the release notes.

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

SQL improvements

This release brings with it a host of useful improvements to SQL generation. Firstly, dbplyr uses * where possible. This is particularly nice when you have a table with many names:

lf <- lazy_frame(!!!setNames(as.list(1:26), letters))
lf
#> <SQL>
#> SELECT *
#> FROM `df`

If you’re familiar with dbplyr’s old SQL output, you’ll also notice that the output receives some basic syntax highlighting and much improved line breaks and indenting.

The use of * is particularly nice when you have a subquery. Previously the generated SQL would have repeated the column names a to z twice, once for each subquery.

lf |> 
  mutate(x2 = x + 1, x3 = x2 + 1)
#> <SQL>
#> SELECT *, `x2` + 1.0 AS `x3`
#> FROM (
#>   SELECT *, `x` + 1.0 AS `x2`
#>   FROM `df`
#> ) `q01`

show_query(), compute() and collect() have experimental support for common table expressions (CTEs), available by setting cte = TRUE argument. CTEs are the database equivalent of the pipe; they allow you to write subqueries in the order in which they’re evaluated, rather than the opposite.

lf |> 
  mutate(x2 = x + 1, x3 = x2 + 1) |> 
  show_query(cte = TRUE)
#> <SQL>
#> WITH `q01` AS (
#>   SELECT *, `x` + 1.0 AS `x2`
#>   FROM `df`
#> )
#> SELECT *, `x2` + 1.0 AS `x3`
#> FROM `q01`

We’ve also added support for translating cut(): this is a very useful base R function that’s fiddly to express in SQL:

lf <- lazy_frame(x = 1)

translate_sql(
  cut(x, c(0, 25, 50, 100))
)
#> <SQL> CASE
#> WHEN (`x` <= 0.0) THEN NULL
#> WHEN (`x` <= 25.0) THEN '(0,25]'
#> WHEN (`x` <= 50.0) THEN '(25,50]'
#> WHEN (`x` <= 100.0) THEN '(50,100]'
#> WHEN (`x` > 100.0) THEN NULL
#> END
  
# Can provide custom labels
translate_sql(
  cut(x, c(0, 25, 50, 100), labels = c("small", "medium", "large"))
)
#> <SQL> CASE
#> WHEN (`x` <= 0.0) THEN NULL
#> WHEN (`x` <= 25.0) THEN 'small'
#> WHEN (`x` <= 50.0) THEN 'medium'
#> WHEN (`x` <= 100.0) THEN 'large'
#> WHEN (`x` > 100.0) THEN NULL
#> END

# And use Inf/-Inf bounds
translate_sql(
  cut(
    x, 
    breaks = c(-Inf, 25, 50, Inf), 
    labels = c("small", "medium", "large")
  )
)
#> <SQL> CASE
#> WHEN (`x` <= 25.0) THEN 'small'
#> WHEN (`x` <= 50.0) THEN 'medium'
#> WHEN (`x` > 50.0) THEN 'large'
#> END

There are also a whole host of minor translation improvements which you can read about in the release notes.

copy_inline()

copy_inline() provides a new way to get data out of R and into the database by embedding the data directly in the query. This is a natural complement to copy_to() which writes data to a temporary table. copy_inline() is faster for small datasets and is particularly useful when you don’t have the permissions needed to create temporary tables. Here’s a very simple example of what the generated SQL will look like for PostgreSQL

df <- data.frame(x = 1:5, y = letters[1:5])
show_query(copy_inline(simulate_postgres(), df))
#> <SQL>
#> SELECT CAST(`x` AS INTEGER) AS `x`, CAST(`y` AS TEXT) AS `y`
#> FROM (  VALUES (1, 'a'), (2, 'b'), (3, 'c'), (4, 'd'), (5, 'e')) AS drvd(`x`, `y`)

Row modification

dplyr 1.0.0 added a family of row modification functions: rows_insert(), rows_append(), rows_update(), rows_patch(), rows_upsert(), and rows_delete(). These functions were inspired by SQL and are now supported by dbplyr.

The primary purpose of these functions is to modify the underlying tables. Because that purpose is dangerous, you’ll need to deliberate opt-in to modification by setting in_place = TRUE. Use the default behaviour, in_place = FALSE, to simulate what the result will be.

With in_place = FALSE, rows_insert() and rows_append() performs an INSERT, rows_update() and rows_path() perform an UPDATE, and rows_delete() performs a DELETE.

Acknowledgements

Most of the work in this release was done by dbplyr author @mgirlich: thanks for all your continued hard work!

And a big thanks to all 77 other contributors who’s comments, code, and discussion helped make a better package: @001ben, @1beb, @Ada-Nick, @admivsn, @alex-m-ffm, @andreassoteriadesmoj, @andyquinterom, @apalacio10, @apalacio9502, @aris-hastings, @asimumba, @ben1787, @boshek, @caljnj, @carlganz, @CLRafaelR, @coponhub, @cslewis04, @dbaston, @dpprdan, @DrFabach, @EarlGlynn, @edonnachie, @eipi10, @eitsupi, @fh-afrachioni, @fh-kpikhart, @ggpinto, @GuillaumePressiat, @hadley, @HarlanH, @hdplsa, @iangow, @James-G-Hill, @jennybc, @jiaqizhu-learning, @jonkeane, @jsspurgeon, @julieinsan, @k6adams, @kelnerrr, @kmishra9, @krlmlr, @Leprechault, @Liudvikas-vinted, @LukasWallrich, @m-sostero, @maelle, @mattcane, @mfherman, @mkoohafkan, @Mosk915, @nassuphis, @nirski, @nviets, @overmar, @p-schaefer, @plogacev, @randy3k, @recleev, @rmcd1024, @rsund, @rvomm, @samssann, @sfirke, @Sir-Chibi, @sitendug, @somatusag, @stephenashton-dhsc, @swnydick, @thothal, @torbjorn, @tsengj, @vspinu, @Waftmaster, @williamlai2, and @yitao-li.