We’re thrilled to announce the release of readr 2.0.0!
The readr package makes it easy to get rectangular data out of comma separated (csv), tab separated (tsv) or fixed width files (fwf) and into R. It is designed to flexibly parse many types of data found in the wild, while still cleanly failing when data unexpectedly changes.
The easiest way to install the latest version from CRAN is to install the whole tidyverse.
install.packages("tidyverse")Alternatively, install just readr from CRAN:
install.packages("readr")This blog post will show off the most important changes to the package. These include built-in support for reading multiple files at once, lazy reading and automatic guessing of delimiters among other changes.
You can see a full list of changes in the readr release notes and vroom release notes.
readr 2nd edition
readr 2.0.0 is a major release of readr and introduces a new 2nd edition parsing and writing engine implemented via the vroom package. This engine takes advantage of lazy reading, multi-threading and performance characteristics of modern SSD drives to significantly improve the performance of reading and writing compared to the 1st edition engine.
We have done our best to ensure that the two editions parse csv files as similarly as possible, but in case there are differences that affect your code, you can use the
with_edition() or
local_edition() functions to temporarily change the edition of readr for a section of code:
with_edition(1, read_csv("my_file.csv"))will readmy_file.csvwith the 1st edition of readr.readr::local_edition(1)placed at the top of your function or script will use the 1st edition for the rest of the function or script.
We will continue to support the 1st edition for a number of releases, but our goal is to ensure that the 2nd edition is uniformly better than the 1st edition so we plan to eventually deprecate and then remove the 1st edition code.
Reading multiple files at once
The 2nd edition has built-in support for reading sets of files with the same columns into one output table in a single command. Just pass the filenames to be read in the same vector to the reading function.
First we generate some files to read by splitting the nycflights dataset by airline.
library(nycflights13)
purrr::iwalk(
  split(flights, flights$carrier),
  ~ { .x$carrier[[1]]; vroom::vroom_write(.x, glue::glue("/tmp/flights_{.y}.tsv"), delim = "\t") }
)Then we can efficiently read them into one tibble by passing the filenames directly to readr.
If the filenames contain data, such as the date when the sample was collected, use id argument to include the paths as a column in the data. You will likely have to post-process the paths to keep only the relevant portion for your use case.
library(dplyr)
files <- fs::dir_ls(path = "/tmp", glob = "*flights*tsv")
files
#> /tmp/flights_9E.tsv /tmp/flights_AA.tsv /tmp/flights_AS.tsv 
#> /tmp/flights_B6.tsv /tmp/flights_DL.tsv /tmp/flights_EV.tsv 
#> /tmp/flights_F9.tsv /tmp/flights_FL.tsv /tmp/flights_HA.tsv 
#> /tmp/flights_MQ.tsv /tmp/flights_OO.tsv /tmp/flights_UA.tsv 
#> /tmp/flights_US.tsv /tmp/flights_VX.tsv /tmp/flights_WN.tsv 
#> /tmp/flights_YV.tsv
readr::read_tsv(files, id = "path")
#> Rows: 336776 Columns: 20
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: "\t"
#> chr   (4): carrier, tailnum, origin, dest
#> dbl  (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_t...
#> dttm  (1): time_hour
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 336,776 x 20
#>    path         year month   day dep_time sched_dep_time dep_delay arr_time
#>    <chr>       <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#>  1 /tmp/fligh…  2013     1     1      810            810         0     1048
#>  2 /tmp/fligh…  2013     1     1     1451           1500        -9     1634
#>  3 /tmp/fligh…  2013     1     1     1452           1455        -3     1637
#>  4 /tmp/fligh…  2013     1     1     1454           1500        -6     1635
#>  5 /tmp/fligh…  2013     1     1     1507           1515        -8     1651
#>  6 /tmp/fligh…  2013     1     1     1530           1530         0     1650
#>  7 /tmp/fligh…  2013     1     1     1546           1540         6     1753
#>  8 /tmp/fligh…  2013     1     1     1550           1550         0     1844
#>  9 /tmp/fligh…  2013     1     1     1552           1600        -8     1749
#> 10 /tmp/fligh…  2013     1     1     1554           1600        -6     1701
#> # … with 336,766 more rows, and 12 more variables: sched_arr_time <dbl>,
#> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>Lazy reading
Like vroom, the 2nd edition uses lazy reading by default. This means when you first call a read_*() function the delimiters and newlines throughout the entire file are found, but the data is not actually read until it is used in your program. This can provide substantial speed improvements for reading character data. It is particularly useful during interactive exploration of only a subset of a full dataset.
However this also means that problematic values are not necessarily seen immediately, only when they are actually read. Because of this a warning will be issued the first time a problem is encountered, which may happen after initial reading.
Run
problems() on your dataset to read the entire dataset and return all of the problems found. Run
problems(lazy = TRUE) if you only want to retrieve the problems found so far.
Deleting files after reading is also impacted by laziness. On Windows open files cannot be deleted as long as a process has the file open. Because readr keeps a file open when reading lazily this means you cannot read, then immediately delete the file. readr will in most cases close the file once it has been completely read. However, if you know you want to be able to delete the file after reading it is best to pass lazy = FALSE when reading the file.
Delimiter guessing
The 2nd edition supports automatic guessing of delimiters. This feature is inspired by the automatic guessing in
data.table::fread(), though the precise method used to perform the guessing differs. Because of this you can now use
read_delim() without specifying a delim argument in many cases.
x <- read_delim(readr_example("mtcars.csv"))
#> Rows: 32 Columns: 11
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.New column specification output
On February 11, 2021 we conducted a survey on twitter asking for the community’s opinion on the column specification output in readr. We received over 750 😲 responses to the survey and it revealed a lot of useful information
- 3/4 of respondents found printing the column specifications helpful. 👍
 - 2/3 of respondents preferred the 2nd edition output vs 1st edition output. 💅
 - Only 1/5 of respondents correctly knew how to suppress printing of the column specifications. 🤯
 
Based on these results we have added two new ways to more easily suppress the column specification printing.
- Use
read_csv(show_col_types = FALSE)to disable printing for a single function call. - Use
options(readr.show_types = FALSE)to disable printing for the entire session. 
We will also continue to print the column specifications and use the new style output.
Note you can still obtain the old output style by printing the column specification object directly.
spec(x)
#> cols(
#>   mpg = col_double(),
#>   cyl = col_double(),
#>   disp = col_double(),
#>   hp = col_double(),
#>   drat = col_double(),
#>   wt = col_double(),
#>   qsec = col_double(),
#>   vs = col_double(),
#>   am = col_double(),
#>   gear = col_double(),
#>   carb = col_double()
#> )Or show the new style by calling
summary() on the specification object.
summary(spec(x))
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (11): mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carbColumn selection
The 2nd edition introduces a new argument, col_select, which makes selecting columns to keep (or omit) more straightforward than before. col_select uses the same interface as
dplyr::select(), so you can perform very flexible selection operations.
Select with the column names directly.
data <- read_tsv("/tmp/flights_AA.tsv", col_select = c(year, flight, tailnum)) #> Rows: 32729 Columns: 3 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> chr (1): tailnum #> dbl (2): year, flight #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Or by numeric column.
data <- read_tsv("/tmp/flights_AA.tsv", col_select = c(1, 2)) #> Rows: 32729 Columns: 2 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> dbl (2): year, month #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Drop columns by name by prefixing them with
-.data <- read_tsv("/tmp/flights_AA.tsv", col_select = c(-dep_time, -(air_time:time_hour))) #> Rows: 32729 Columns: 13 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> chr (4): carrier, tailnum, origin, dest #> dbl (9): year, month, day, sched_dep_time, dep_delay, arr_time, sched_a... #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Use the selection helpers such as
ends_with().data <- read_tsv("/tmp/flights_AA.tsv", col_select = ends_with("time")) #> Rows: 32729 Columns: 5 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> dbl (5): dep_time, sched_dep_time, arr_time, sched_arr_time, air_time #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.Or even rename columns by using a named list.
data <- read_tsv("/tmp/flights_AA.tsv", col_select = list(plane = tailnum, everything())) #> Rows: 32729 Columns: 19 #> ── Column specification ─────────────────────────────────────────────────── #> Delimiter: "\t" #> chr (4): carrier, tailnum, origin, dest #> dbl (14): year, month, day, dep_time, sched_dep_time, dep_delay, arr_t... #> dttm (1): time_hour #> #> ℹ Use `spec()` to retrieve the full column specification for this data. #> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message. data #> # A tibble: 32,729 x 19 #> plane year month day dep_time sched_dep_time dep_delay arr_time #> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 N619AA 2013 1 1 542 540 2 923 #> 2 N3ALAA 2013 1 1 558 600 -2 753 #> 3 N3DUAA 2013 1 1 559 600 -1 941 #> 4 N633AA 2013 1 1 606 610 -4 858 #> 5 N3EMAA 2013 1 1 623 610 13 920 #> 6 N3BAAA 2013 1 1 628 630 -2 1137 #> 7 N3CYAA 2013 1 1 629 630 -1 824 #> 8 N3GKAA 2013 1 1 635 635 0 1028 #> 9 N4WNAA 2013 1 1 656 700 -4 854 #> 10 N5FMAA 2013 1 1 656 659 -3 949 #> # … with 32,719 more rows, and 11 more variables: sched_arr_time <dbl>, #> # arr_delay <dbl>, carrier <chr>, flight <dbl>, origin <chr>, #> # dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, #> # time_hour <dttm>
Name repair
Often the names of columns in the original dataset are not ideal to work with. The 2nd edition uses the same name_repair argument as in the tibble package, so you can use one of the default name repair strategies or provide a custom function. One useful approach is to use the janitor::make_clean_names() function.
read_tsv("/tmp/flights_AA.tsv", name_repair = janitor::make_clean_names)
#> # A tibble: 32,729 x 19
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <dbl> <dbl> <dbl>    <dbl>          <dbl>     <dbl>    <dbl>
#>  1  2013     1     1      542            540         2      923
#>  2  2013     1     1      558            600        -2      753
#>  3  2013     1     1      559            600        -1      941
#>  4  2013     1     1      606            610        -4      858
#>  5  2013     1     1      623            610        13      920
#>  6  2013     1     1      628            630        -2     1137
#>  7  2013     1     1      629            630        -1      824
#>  8  2013     1     1      635            635         0     1028
#>  9  2013     1     1      656            700        -4      854
#> 10  2013     1     1      656            659        -3      949
#> # … with 32,719 more rows, and 12 more variables: sched_arr_time <dbl>,
#> #   arr_delay <dbl>, carrier <chr>, flight <dbl>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>
read_tsv("/tmp/flights_AA.tsv", name_repair = ~ janitor::make_clean_names(., case = "lower_camel"))
#> # A tibble: 32,729 x 19
#>     year month   day depTime schedDepTime depDelay arrTime schedArrTime
#>    <dbl> <dbl> <dbl>   <dbl>        <dbl>    <dbl>   <dbl>        <dbl>
#>  1  2013     1     1     542          540        2     923          850
#>  2  2013     1     1     558          600       -2     753          745
#>  3  2013     1     1     559          600       -1     941          910
#>  4  2013     1     1     606          610       -4     858          910
#>  5  2013     1     1     623          610       13     920          915
#>  6  2013     1     1     628          630       -2    1137         1140
#>  7  2013     1     1     629          630       -1     824          810
#>  8  2013     1     1     635          635        0    1028          940
#>  9  2013     1     1     656          700       -4     854          850
#> 10  2013     1     1     656          659       -3     949          959
#> # … with 32,719 more rows, and 11 more variables: arrDelay <dbl>,
#> #   carrier <chr>, flight <dbl>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   airTime <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   timeHour <dttm>UTF-16 and UTF-32 support
The 2nd edition now has much better support for UTF-16 and UTF-32 multi-byte unicode encodings. When files with these encodings are read they are automatically converted to UTF-8 internally in an efficient streaming fashion.
Control over quoting and escaping when writing
You can now explicitly control how fields are quoted and escaped when writing with the quote and escape arguments to write_*() functions.
quote has three options.
- ‘needed’ - which will quote fields only when needed.
 - ‘all’ - which will always quote all fields.
 - ‘none’ - which will never quote any fields.
 
escape also has three options, to control how quote characters are escaped.
- ‘double’ - which will use double quotes to escape quotes.
 - ‘backslash’ - which will use a backslash to escape quotes.
 - ‘none’ - which will not do anything to escape quotes.
 
We hope these options will give people the flexibility they need when writing files using readr.
Literal data
In the 1st edition the reading functions treated any input with a newline in it or vectors of length > 1 as literal data. In the 2nd edition two vectors of length > 1 are now assumed to correspond to multiple files. Because of this we now have a more explicit way to represent literal data, by putting
I() around the input.
readr::read_csv(I("a,b\n1,2"))
#> Rows: 1 Columns: 2
#> ── Column specification ───────────────────────────────────────────────────
#> Delimiter: ","
#> dbl (2): a, b
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2Lighter installation requirements
readr now should be much easier to install. Previous versions of readr used the Boost C++ library to do some of the numeric parsing. While these are well written, robust libraries, the BH package which contains them has a large number of files (1500+) which can take a long time to install. In addition the code within these headers is complicated and can take a large amount of memory (2+ Gb) to compile, which made it challenging to compile readr from source in some cases.
readr no longer depends on Boost or the BH package, so should compile more quickly in most cases.
Deprecated and superseded functions and features
melt_csv(),melt_delim(),melt_tsv()andmelt_fwf()have been superseded by functions in the same name in the meltr package. The versions in readr have been deprecated. These functions rely on the 1st edition parsing code and would be challenging to update to the new parser. When the 1st edition parsing code is eventually removed from readr they will be removed.read_table2()has been renamed toread_table()andread_table2()has been deprecated. Most users seem to expectread_table()to work likeutils::read.table(), so the different names caused confusion. If you want the previous strict behavior ofread_table()you can useread_fwf()withfwf_empty()directly (#717).Normalizing newlines in files with just carriage returns
\ris no longer supported. The last major OS to use only CR as the newline was ‘classic’ Mac OS, which had its final release in 2001.
License changes
We are systematically re-licensing tidyverse and r-lib packages to use the MIT license, to make our package licenses as clear and permissive as possible.
To this end the readr and vroom packages are now released under the MIT license.
Acknowledgements
A big thanks to everyone who helped make this release possible by testing the development versions, asking questions, providing reprexes, writing code and more! @Aariq, @adamroyjones, @antoine-sachet, @basille, @batpigandme, @benjaminhlina, @bigey, @billdenney, @binkleym, @BrianOB, @cboettig, @CTMCBP, @Dana996, @DarwinAwardWinner, @deeenes, @dernst, @dicorynia, @estroger34, @FixTestRepeat, @GegznaV, @giocomai, @GiuliaPais, @hadley, @HedvigS, @HenrikBengtsson, @hidekoji, @hongooi73, @hsbadr, @idshklein, @jasyael, @JeremyPasco, @jimhester, @jonasfoe, @jzadra, @KasperThystrup, @keesdeschepper, @kingcrimsontianyu, @KnutEBakke, @krlmlr, @larnsce, @ldecicco-USGS, @M3IT, @maelle, @martinmodrak, @meowcat, @messersc, @mewu3, @mgperry, @michaelquinn32, @MikeJohnPage, @mine-cetinkaya-rundel, @msberends, @nbenn, @niheaven, @peranti, @petrbouchal, @pfh, @pgramme, @Raesu, @rmcd1024, @rmvpaeme, @sebneus, @seth127, @Shians, @sonicdoe, @svraka, @timothy-barry, @tmalsburg, @vankesteren, @xuqingyu, and @yutannihilation.