dbplyr 2.0.0

  dbplyr, dplyr

  Hadley Wickham

We’re pleased to announce the release of dbplyr 2.0.0. dbplyr is a database backend for dplyr that allows you to use a remote database as if it was a collection of local data frames: you write ordinary dplyr code and dbplyr translates it to SQL for you.

You can install it from CRAN with:

install.packages("dbplyr")

This blog post covers the major improvements in this version:

Please see the release notes for a full list of changes.

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

dplyr 1.0.0 compatibility

dbplyr now supports all relevant features added in dplyr 1.0.0:

  • across() is now translated into individual SQL statements.

    lf <- lazy_frame(g = 1, a = 1, b = 2, c = 3)
    lf %>% 
      group_by(g) %>% 
      summarise(across(everything(), mean, na.rm = TRUE))
    #> <SQL>
    #> SELECT `g`, AVG(`g`) AS `g`, AVG(`a`) AS `a`, AVG(`b`) AS `b`, AVG(`c`) AS `c`
    #> FROM `df`
    #> GROUP BY `g`
  • rename() and select() support dplyr tidyselect syntax, apart from predicate functions which can’t easily work on computed queries. You can now use rename_with() to programmatically rename columns.

    lf <- lazy_frame(x1 = 1, x2 = 2, x3 = 3, y1 = 4, y2 = 3)
    lf %>% select(starts_with("x") & !"x3")
    #> <SQL>
    #> SELECT `x1`, `x2`
    #> FROM `df`
    lf %>% select(ends_with("2") | ends_with("3"))
    #> <SQL>
    #> SELECT `x2`, `y2`, `x3`
    #> FROM `df`
    lf %>% rename_with(toupper)
    #> <SQL>
    #> SELECT `x1` AS `X1`, `x2` AS `X2`, `x3` AS `X3`, `y1` AS `Y1`, `y2` AS `Y2`
    #> FROM `df`
  • relocate() makes it easy to move columns around:

    lf <- lazy_frame(x1 = 1, x2 = 2, y1 = 4, y2 = 3)
    lf %>% relocate(starts_with("y"))
    #> <SQL>
    #> SELECT `y1`, `y2`, `x1`, `x2`
    #> FROM `df`
  • slice_min(), slice_max(), and slice_sample() are now supported, and slice_head() and slice_tail() throw informative error messages (since they don’t make sense for databases).

    lf <- lazy_frame(g = rep(1:2, 5), x = 1:10)
    lf %>% 
      group_by(g) %>% 
      slice_min(x, prop = 0.5)
    #> <SQL>
    #> SELECT `g`, `x`
    #> FROM (SELECT `g`, `x`, CUME_DIST() OVER (PARTITION BY `g` ORDER BY `x`) AS `q01`
    #> FROM `df`) `q01`
    #> WHERE (`q01` <= 0.5)
    
    lf %>% 
      group_by(g) %>% 
      slice_sample(x, n = 10, with_ties = TRUE)
    #> <SQL>
    #> SELECT `g`, `x`
    #> FROM (SELECT `g`, `x`, ROW_NUMBER() OVER (PARTITION BY `g` ORDER BY random()) AS `q01`
    #> FROM `df`) `q01`
    #> WHERE (`q01` <= 10)

    Note that these slices are translated into window functions, and because you can’t use a window function directly inside a WHERE clause, they must be wrapped in a subquery.

SQL translation

The dbplyr documentation now does a much better job of providing the details of its SQL translation. Each backend and each major verb has a documentation page giving the basics of the translation. This will hopefully make it much easier to learn what is and isn’t supported by dbplyr. Visit https://dbplyr.tidyverse.org/reference/index.html to see the new docs.

There are also many improvements to SQL generation. Here are a few of the most important:

  • Join functions gain an na_matches argument that allows you to control whether or not NA (NULL) values match other NA values. The default is "never", which is the usual behaviour in databases. You can set na_matches = "na" to match R’s usual join behaviour.

    df1 <- tibble(x = c(1, 2, NA))
    df2 <- tibble(x = c(NA, 1), y = 1:2)
    df1 %>% inner_join(df2, by = "x")
    #> # A tibble: 2 x 2
    #>       x     y
    #>   <dbl> <int>
    #> 1     1     2
    #> 2    NA     1
    
    db1 <- memdb_frame(x = c(1, 2, NA))
    db2 <- memdb_frame(x = c(NA, 1), y = 1:2)
    db1 %>% inner_join(db2, by = "x")
    #> # Source:   lazy query [?? x 2]
    #> # Database: sqlite 3.34.1 [:memory:]
    #>       x     y
    #>   <dbl> <int>
    #> 1     1     2
    
    db1 %>% inner_join(db2, by = "x", na_matches = "na")
    #> # Source:   lazy query [?? x 2]
    #> # Database: sqlite 3.34.1 [:memory:]
    #>       x     y
    #>   <dbl> <int>
    #> 1    NA     1
    #> 2     1     2

    This translation is powered by the new sql_expr_matches() generic, because every database seems to have a slightly different way to express this idea. Learn more at https://modern-sql.com/feature/is-distinct-from.

    db1 %>% inner_join(db2, by = "x") %>% show_query()
    #> <SQL>
    #> SELECT `LHS`.`x` AS `x`, `y`
    #> FROM `dbplyr_001` AS `LHS`
    #> INNER JOIN `dbplyr_002` AS `RHS`
    #> ON (`LHS`.`x` = `RHS`.`x`)
    db1 %>% inner_join(db2, by = "x", na_matches = "na") %>% show_query()
    #> <SQL>
    #> SELECT `LHS`.`x` AS `x`, `y`
    #> FROM `dbplyr_001` AS `LHS`
    #> INNER JOIN `dbplyr_002` AS `RHS`
    #> ON (`LHS`.`x` IS `RHS`.`x`)
  • Subqueries no longer include an ORDER BY clause. This is not part of the formal SQL specification so it has very limited support across databases. Now such queries generate a warning suggesting that you move your arrange() call later in the pipeline.

    lf <- lazy_frame(g = rep(1:2, each = 5), x = sample(1:10))
    lf %>% 
      group_by(g) %>% 
      summarise(n = n()) %>% 
      arrange(desc(n)) %>% 
      filter(n > 1)
    #> Warning: ORDER BY is ignored in subqueries without LIMIT
    #>  Do you need to move arrange() later in the pipeline or use window_order() instead?
    #> <SQL>
    #> SELECT *
    #> FROM (SELECT `g`, COUNT(*) AS `n`
    #> FROM `df`
    #> GROUP BY `g`) `q01`
    #> WHERE (`n` > 1.0)

    As the warning suggests, there’s one exception: ORDER BY is still generated if a LIMIT is present. Across databases, this tends to change which rows are returned, but not necessarily their order.

    lf %>% 
      group_by(g) %>% 
      summarise(n = n()) %>% 
      arrange(desc(n)) %>% 
      head(5) %>% 
      filter(n > 1)
    #> <SQL>
    #> SELECT *
    #> FROM (SELECT `g`, COUNT(*) AS `n`
    #> FROM `df`
    #> GROUP BY `g`
    #> ORDER BY `n` DESC
    #> LIMIT 5) `q01`
    #> WHERE (`n` > 1.0)
  • dbplyr includes built-in backends for Redshift (which only differs from PostgreSQL in a few places) and SAP HANA. These require the development versions of RPostgres and odbc respectively.

    lf <- lazy_frame(x = "a", y = "b", con = simulate_redshift())
    lf %>% mutate(z = paste0(x, y))
    #> <SQL>
    #> SELECT `x`, `y`, `x` || `y` AS `z`
    #> FROM `df`

There are a number of minor changes that affect the translation of individual functions. Here are a few of the most important:

  • All backends now translate n() to count(*) and support ::

    lf <- lazy_frame(x = 1:10)
    lf %>% summarise(n = dplyr::n())
    #> <SQL>
    #> SELECT COUNT(*) AS `n`
    #> FROM `df`
  • PostgreSQL gets translations for lubridate period functions:

    lf <- lazy_frame(x = Sys.Date(), con = simulate_postgres())
    lf %>%
      mutate(year = x + years(1))
    #> <SQL>
    #> SELECT `x`, `x` + CAST('1 years' AS INTERVAL) AS `year`
    #> FROM `df`
  • Oracle assumes version 12c is available so we can use a simpler translation for head() that works in more places:

    lf <- lazy_frame(x = 1, con = simulate_oracle())
    lf %>% head(5)
    #> <SQL>
    #> SELECT *
    #> FROM (`df`) 
    #> FETCH FIRST 5 ROWS ONLY

Thanks to the artistic talents of Allison Horst, dbplyr has a beautiful new logo:

Extensibility

Finally, dbplyr introduces a number of new generics to help tease apart the currently overly complicated relationship with dplyr. This should make creating new backends much easier, but does require some changes from existing backends. These changes should be invisible to the end user and will play out slowly over the next 12 months. See vignette("backend-2", package = "dbplyr") for details.

Acknowledgements

A big thanks to everyone who helped with this release by reporting bugs, discussing issues, and contributing code: @abalter, @adhi-r, @adithya604, @admoseremic, @ahmed-alhindawi, @alexfun, @alexkyllo, @alistaire47, @batpigandme, @BenCarlsen, @bengowan, @bersbersbers, @bertrandh, @bkkkk, @boshek, @bradenkinard, @cderv, @CerebralMastication, @chris-billingham, @cmichaud92, @cole-johanson, @copernican, @daattali, @Daveyr, @davidchall, @DavidPatShuiFong, @dereksonderegger, @dfrankow, @dkulp2, @dpprdan, @dsen6644, @DSLituiev, @EarlGlynn, @edgararuiz, @edoardomichielon, @elbamos, @ericemc3, @fahadshery, @fh-jgutman, @ftoresh, @GrayAlex49, @gregleleu, @hadley, @halpo, @hannes101, @hansvancalster, @hrbrmstr, @huelf, @iangow, @ianmcook, @jakeybob, @Janlow, @jarodmeng, @javierluraschi, @jerisalan, @jessekps, @jimhester, @jkylearmstrong, @jmerone, @jonkeane, @kmishra9, @kohleth, @kondofersky, @krlmlr, @lionel-, @lorenzwalthert, @LukasWallrich, @lukerobert, @lymanmark, @machow, @martin-a-wade, @mgirlich, @MikeJohnPage, @millerh1, @mkirzon, @moodymudskipper, @mskyttner, @Naareman, @natbprice, @okhoma, @OssiLehtinen, @PauloJhonny, @r2evans, @ramnathv, @returnString, @rjpat, @rlh1994, @robchallen, @roboton, @romainfrancois, @rundel, @saadaslam, @samssann, @samstiyer, @schradj, @sheepworrier, @shosaco, @shyams80, @stiberger, @stvrd, @tarunn90, @tedmoorman, @tgvaughan, @tonyk7440, @trevorcampbell, @TuomoNieminen, @tvedebrink, @vadimus202, @vnijs, @wangyuchen, @yitao-li, and @ZahraEconomist.