readr 2.0.0

  Jim Hester

We’re thrilled to announce the release of readr 2.0.0!

The readr package makes it easy to get rectangular data out of comma separated (csv), tab separated (tsv) or fixed width files (fwf) and into R. It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.

The easiest way to install the latest version from CRAN is to install the whole tidyverse.

install.packages("tidyverse")

Alternatively, install just readr from CRAN:

install.packages("readr")

This blog post will show off the most important changes to the package. These include built-in support for reading multiple files at once, lazy reading and automatic guessing of delimiters among other changes.

You can see a full list of changes in the readr release notes and vroom release notes.

readr 2nd edition

readr 2.0.0 is a major release of readr and introduces a new 2nd edition parsing and writing engine implemented via the vroom package. This engine takes advantage of lazy reading, multi-threading and performance characteristics of modern SSD drives to significantly improve the performance of reading and writing compared to the 1st edition engine.

We have done our best to ensure that the two editions parse csv files as similarly as possible, but in case there are differences that affect your code, you can use the with_edition() or local_edition() functions to temporarily change the edition of readr for a section of code:

We will continue to support the 1st edition for a number of releases, but our goal is to ensure that the 2nd edition is uniformly better than the 1st edition so we plan to eventually deprecate and then remove the 1st edition code.

Reading multiple files at once

The 2nd edition has built-in support for reading sets of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function.

First we generate some files to read by splitting the nycflights dataset by airline.

library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(.x, glue::glue("/tmp/flights_{.y}.tsv"), delim = "\t") }
)

Then we can efficiently read them into one tibble by passing the filenames directly to readr.

If the filenames contain data, such as the date when the sample was collected, use id argument to include the paths as a column in the data. You will likely have to post-process the paths to keep only the relevant portion for your use case.

library(dplyr)
files <- fs::dir_ls(path = "/tmp", glob = "*flights*tsv")
files
#> /tmp/flights_9E.tsv /tmp/flights_AA.tsv /tmp/flights_AS.tsv 
#> /tmp/flights_B6.tsv /tmp/flights_DL.tsv /tmp/flights_EV.tsv 
#> /tmp/flights_F9.tsv /tmp/flights_FL.tsv /tmp/flights_HA.tsv 
#> /tmp/flights_MQ.tsv /tmp/flights_OO.tsv /tmp/flights_UA.tsv 
#> /tmp/flights_US.tsv /tmp/flights_VX.tsv /tmp/flights_WN.tsv 
#> /tmp/flights_YV.tsv
readr::read_tsv(files, id = "path")
#> Rows: 336776 Columns: 20
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr   (4): carrier, tailnum, origin, dest
#> dbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_t...
#> dttm  (1): time_hour
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 336,776 x 20
#>    path         year month   day dep_time sched_dep_time dep_delay arr_time
#>    <chr>       <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#>  1 /tmp/fligh…  2013     1     1      810            810         0     1048
#>  2 /tmp/fligh…  2013     1     1     1451           1500        -9     1634
#>  3 /tmp/fligh…  2013     1     1     1452           1455        -3     1637
#>  4 /tmp/fligh…  2013     1     1     1454           1500        -6     1635
#>  5 /tmp/fligh…  2013     1     1     1507           1515        -8     1651
#>  6 /tmp/fligh…  2013     1     1     1530           1530         0     1650
#>  7 /tmp/fligh…  2013     1     1     1546           1540         6     1753
#>  8 /tmp/fligh…  2013     1     1     1550           1550         0     1844
#>  9 /tmp/fligh…  2013     1     1     1552           1600        -8     1749
#> 10 /tmp/fligh…  2013     1     1     1554           1600        -6     1701
#> # … with 336,766 more rows, and 12 more variables: sched_arr_time <dbl>,
#> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

Lazy reading

Like vroom, the 2nd edition uses lazy reading by default. This means when you first call a read_*() function the delimiters and newlines throughout the entire file are found, but the data is not actually read until it is used in your program. This can provide substantial speed improvements for reading character data. It is particularly useful during interactive exploration of only a subset of a full dataset.

However this also means that problematic values are not necessarily seen immediately, only when they are actually read. Because of this a warning will be issued the first time a problem is encountered, which may happen after initial reading.

Run problems() on your dataset to read the entire dataset and return all of the problems found. Run problems(lazy = TRUE) if you only want to retrieve the problems found so far.

Deleting files after reading is also impacted by laziness. On Windows open files cannot be deleted as long as a process has the file open. Because readr keeps a file open when reading lazily this means you cannot read, then immediately delete the file. readr will in most cases close the file once it has been completely read. However, if you know you want to be able to delete the file after reading it is best to pass lazy = FALSE when reading the file.

Delimiter guessing

The 2nd edition supports automatic guessing of delimiters. This feature is inspired by the automatic guessing in data.table::fread(), though the precise method used to perform the guessing differs. Because of this you can now use read_delim() without specifying a delim argument in many cases.

x <- read_delim(readr_example("mtcars.csv"))
#> Rows: 32 Columns: 11
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.

New column specification output

On February 11, 2021 we conducted a survey on twitter asking for the community’s opinion on the column specification output in readr. We received over 750 😲 responses to the survey and it revealed a lot of useful information

  • 3/4 of respondents found printing the column specifications helpful. 👍
  • 2/3 of respondents preferred the 2nd edition output vs 1st edition output. 💅
  • Only 1/5 of respondents correctly knew how to suppress printing of the column specifications. 🤯

Based on these results we have added two new ways to more easily suppress the column specification printing.

We will also continue to print the column specifications and use the new style output.

Note you can still obtain the old output style by printing the column specification object directly.

spec(x)
#> cols(
#>   mpg = col_double(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )

Or show the new style by calling summary() on the specification object.

summary(spec(x))
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb

Column selection

The 2nd edition introduces a new argument, col_select, which makes selecting columns to keep (or omit) more straightforward than before. col_select uses the same interface as dplyr::select(), so you can perform very flexible selection operations.

  • Select with the column names directly.

    data <- read_tsv("/tmp/flights_AA.tsv", col_select = c(year, flight, tailnum))
    #> Rows: 32729 Columns: 3
    #> ── Column specification ───────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (1): tailnum
    #> dbl (2): year, flight
    #> 
    #>  Use `spec()` to retrieve the full column specification for this data.
    #>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • Or by numeric column.

    data <- read_tsv("/tmp/flights_AA.tsv", col_select = c(1, 2))
    #> Rows: 32729 Columns: 2
    #> ── Column specification ───────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> dbl (2): year, month
    #> 
    #>  Use `spec()` to retrieve the full column specification for this data.
    #>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • Drop columns by name by prefixing them with -.

    data <- read_tsv("/tmp/flights_AA.tsv",
      col_select = c(-dep_time, -(air_time:time_hour)))
    #> Rows: 32729 Columns: 13
    #> ── Column specification ───────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr (4): carrier, tailnum, origin, dest
    #> dbl (9): year, month, day, sched_dep_time, dep_delay, arr_time, sched_a...
    #> 
    #>  Use `spec()` to retrieve the full column specification for this data.
    #>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • Use the selection helpers such as ends_with().

    data <- read_tsv("/tmp/flights_AA.tsv", col_select = ends_with("time"))
    #> Rows: 32729 Columns: 5
    #> ── Column specification ───────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> dbl (5): dep_time, sched_dep_time, arr_time, sched_arr_time, air_time
    #> 
    #>  Use `spec()` to retrieve the full column specification for this data.
    #>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
  • Or even rename columns by using a named list.

    data <- read_tsv("/tmp/flights_AA.tsv", col_select = list(plane = tailnum, everything()))
    #> Rows: 32729 Columns: 19
    #> ── Column specification ───────────────────────────────────────────────────
    #> Delimiter: "\t"
    #> chr   (4): carrier, tailnum, origin, dest
    #> dbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_t...
    #> dttm  (1): time_hour
    #> 
    #>  Use `spec()` to retrieve the full column specification for this data.
    #>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
    data
    #> # A tibble: 32,729 x 19
    #>    plane   year month   day dep_time sched_dep_time dep_delay arr_time
    #>    <chr>  <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
    #>  1 N619AA  2013     1     1      542            540         2      923
    #>  2 N3ALAA  2013     1     1      558            600        -2      753
    #>  3 N3DUAA  2013     1     1      559            600        -1      941
    #>  4 N633AA  2013     1     1      606            610        -4      858
    #>  5 N3EMAA  2013     1     1      623            610        13      920
    #>  6 N3BAAA  2013     1     1      628            630        -2     1137
    #>  7 N3CYAA  2013     1     1      629            630        -1      824
    #>  8 N3GKAA  2013     1     1      635            635         0     1028
    #>  9 N4WNAA  2013     1     1      656            700        -4      854
    #> 10 N5FMAA  2013     1     1      656            659        -3      949
    #> # … with 32,719 more rows, and 11 more variables: sched_arr_time <dbl>,
    #> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>,
    #> #   dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
    #> #   time_hour <dttm>

Name repair

Often the names of columns in the original dataset are not ideal to work with. The 2nd edition uses the same name_repair argument as in the tibble package, so you can use one of the default name repair strategies or provide a custom function. One useful approach is to use the janitor::make_clean_names() function.

read_tsv("/tmp/flights_AA.tsv", name_repair = janitor::make_clean_names)
#> # A tibble: 32,729 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#>  1  2013     1     1      542            540         2      923
#>  2  2013     1     1      558            600        -2      753
#>  3  2013     1     1      559            600        -1      941
#>  4  2013     1     1      606            610        -4      858
#>  5  2013     1     1      623            610        13      920
#>  6  2013     1     1      628            630        -2     1137
#>  7  2013     1     1      629            630        -1      824
#>  8  2013     1     1      635            635         0     1028
#>  9  2013     1     1      656            700        -4      854
#> 10  2013     1     1      656            659        -3      949
#> # … with 32,719 more rows, and 12 more variables: sched_arr_time <dbl>,
#> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

read_tsv("/tmp/flights_AA.tsv", name_repair = ~ janitor::make_clean_names(., case = "lower_camel"))
#> # A tibble: 32,729 x 19
#>     year month   day depTime schedDepTime depDelay arrTime schedArrTime
#>    <dbl> <dbl> <dbl>   <dbl>        <dbl>    <dbl>   <dbl>        <dbl>
#>  1  2013     1     1     542          540        2     923          850
#>  2  2013     1     1     558          600       -2     753          745
#>  3  2013     1     1     559          600       -1     941          910
#>  4  2013     1     1     606          610       -4     858          910
#>  5  2013     1     1     623          610       13     920          915
#>  6  2013     1     1     628          630       -2    1137         1140
#>  7  2013     1     1     629          630       -1     824          810
#>  8  2013     1     1     635          635        0    1028          940
#>  9  2013     1     1     656          700       -4     854          850
#> 10  2013     1     1     656          659       -3     949          959
#> # … with 32,719 more rows, and 11 more variables: arrDelay <dbl>,
#> #   carrier <chr>, flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   airTime <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   timeHour <dttm>

UTF-16 and UTF-32 support

The 2nd edition now has much better support for UTF-16 and UTF-32 multi-byte unicode encodings. When files with these encodings are read they are automatically converted to UTF-8 internally in an efficient streaming fashion.

Control over quoting and escaping when writing

You can now explicitly control how fields are quoted and escaped when writing with the quote and escape arguments to write_*() functions.

quote has three options.

  1. ‘needed’ - which will quote fields only when needed.
  2. ‘all’ - which will always quote all fields.
  3. ‘none’ - which will never quote any fields.

escape also has three options, to control how quote characters are escaped.

  1. ‘double’ - which will use double quotes to escape quotes.
  2. ‘backslash’ - which will use a backslash to escape quotes.
  3. ‘none’ - which will not do anything to escape quotes.

We hope these options will give people the flexibility they need when writing files using readr.

Literal data

In the 1st edition the reading functions treated any input with a newline in it or vectors of length > 1 as literal data. In the 2nd edition two vectors of length > 1 are now assumed to correspond to multiple files. Because of this we now have a more explicit way to represent literal data, by putting I() around the input.

readr::read_csv(I("a,b\n1,2"))
#> Rows: 1 Columns: 2
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (2): a, b
#> 
#>  Use `spec()` to retrieve the full column specification for this data.
#>  Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2

Lighter installation requirements

readr now should be much easier to install. Previous versions of readr used the Boost C++ library to do some of the numeric parsing. While these are well written, robust libraries, the BH package which contains them has a large number of files (1500+) which can take a long time to install. In addition the code within these headers is complicated and can take a large amount of memory (2+ Gb) to compile, which made it challenging to compile readr from source in some cases.

readr no longer depends on Boost or the BH package, so should compile more quickly in most cases.

Deprecated and superseded functions and features

  • melt_csv(), melt_delim(), melt_tsv() and melt_fwf() have been superseded by functions in the same name in the meltr package. The versions in readr have been deprecated. These functions rely on the 1st edition parsing code and would be challenging to update to the new parser. When the 1st edition parsing code is eventually removed from readr they will be removed.

  • read_table2() has been renamed to read_table() and read_table2() has been deprecated. Most users seem to expect read_table() to work like utils::read.table(), so the different names caused confusion. If you want the previous strict behavior of read_table() you can use read_fwf() with fwf_empty() directly (#717).

  • Normalizing newlines in files with just carriage returns \r is no longer supported. The last major OS to use only CR as the newline was ‘classic’ Mac OS, which had its final release in 2001.

License changes

We are systematically re-licensing tidyverse and r-lib packages to use the MIT license, to make our package licenses as clear and permissive as possible.

To this end the readr and vroom packages are now released under the MIT license.

Acknowledgements

A big thanks to everyone who helped make this release possible by testing the development vearsions, asking questions, providing reprexes, writing code and more! @Aariq, @adamroyjones, @antoine-sachet, @basille, @batpigandme, @benjaminhlina, @bigey, @billdenney, @binkleym, @BrianOB, @cboettig, @CTMCBP, @Dana996, @DarwinAwardWinner, @deeenes, @dernst, @dicorynia, @estroger34, @FixTestRepeat, @GegznaV, @giocomai, @GiuliaPais, @hadley, @HedvigS, @HenrikBengtsson, @hidekoji, @hongooi73, @hsbadr, @idshklein, @jasyael, @JeremyPasco, @jimhester, @jonasfoe, @jzadra, @KasperThystrup, @keesdeschepper, @kingcrimsontianyu, @KnutEBakke, @krlmlr, @larnsce, @ldecicco-USGS, @M3IT, @maelle, @martinmodrak, @meowcat, @messersc, @mewu3, @mgperry, @michaelquinn32, @MikeJohnPage, @mine-cetinkaya-rundel, @msberends, @nbenn, @niheaven, @peranti, @petrbouchal, @pfh, @pgramme, @Raesu, @rmcd1024, @rmvpaeme, @sebneus, @seth127, @Shians, @sonicdoe, @svraka, @timothy-barry, @tmalsburg, @vankesteren, @xuqingyu, and @yutannihilation.