readxl 1.1.0

Photo by David Hood

We are pleased to announce that readxl 1.1.0 is now available on CRAN. Learn more about readxl at http://readxl.tidyverse.org.

The readxl package makes it easy to get tabular data out of Excel files and into R with code, not mouse clicks. It supports both the legacy .xls format and the modern XML-based .xlsx format. readxl is expressly designed to be easy to install and use on all operating systems. Therefore it has no external dependencies, such as Java or Perl, which have historically been a source of aggravation with some R packages that read Excel files.

The easiest way to install the latest version from CRAN is to install the whole tidyverse.

install.packages("tidyverse")

Alternatively, install just readxl from CRAN:

install.packages("readxl")

Regardless, you will still need to load readxl explicitly via library(readxl), because it is not a core tidyverse package loaded via library(tidyverse). Do this like so in your scripts:

library(readxl)

readxl was last released almost exactly one year ago, in April 2017, at version 1.0.0. The associated blog post summarizes many nifty new features in version 1.0.0. In contrast, version 1.1.0 is considerably less exciting for most users, but includes two important improvements:

  • Security patches in the embedded libxls library
  • Better support for detecting the format of files, i.e. whether to read as xls or xlsx

Version 1.1.0 also includes several other small features and bug fixes. For full details on all versions, see the changelog.

Read (or skip!) to the end for a story about how readxl was used in a recent Los Angeles Times piece about homeless arrests. Or, rather, was not used.

Security fixes for libxls

readxl embeds the library libxls in order to read xls files. Three security vulnerabilities have been identified in libxls and were shown to affect readxl (CVE = Common Vulnerabilities and Exposures):

Those have now been addressed upstream and version 1.1.0 of readxl embeds a version of libxls that includes those fixes.

Two other CVEs mention readxl, but actually concern parts of libxls unused by readxl or problems that were identified and fixed before readxl existed. CVE-2017-12111, CVE-2017-2919

The recent maintenance of libxls by Evan Miller (@evanmiller) is a very positive development and has allowed us to close many readxl issues related to crashes or other unsavory behaviour seen when reading specific xls files.

Is it xls or xlsx?

readxl has 3 main functions for reading Excel files:

read_excel() has gotten better at inferring the file format and this functionality is now exposed, in case others need access. This is implemented in excel_format(), which first calls format_from_ext() to consult the extension and may then fall back to format_from_signature() to consult the file’s signature, a.k.a. its “magic number”.

format_from_ext() and, therefore, excel_format(), now recognize more file extensions in the xlsx family, namely .xlsm, .xltx, and .xltm, in addition to .xlsx.

library(readxl)
library(tibble)

tibble(
  excel_paths  = c("a.xlsx", "b.xls", "c.xlsm", "d.xltx", "e.jpg", "f.csv", "g"),
  format = excel_format(excel_paths, guess = FALSE) # consult extension **only**
)
#> # A tibble: 7 x 2
#>   excel_paths format
#>   <chr>       <chr> 
#> 1 a.xlsx      xlsx  
#> 2 b.xls       xls   
#> 3 c.xlsm      xlsx  
#> 4 d.xltx      xlsx  
#> 5 e.jpg       <NA>  
#> 6 f.csv       <NA>  
#> 7 g           <NA>

format_from_signature() and, therefore, excel_format() can determine the format of extensionless Excel files.

guess_my_format <- c(
  readxl_example("clippy.xls"),
  readxl_example("datasets.xlsx"),
  file.path(R.home("doc"), "html", "logo.jpg")
)

## copy and strip the extension
filename <- tools::file_path_sans_ext(basename(guess_my_format))
tmpdir <- tempdir()
file.copy(from = guess_my_format, to = file.path(tmpdir, filename))
#> [1] TRUE TRUE TRUE

tibble(
  filename = filename,
  format = excel_format(file.path(tmpdir, filename))
)
#> # A tibble: 3 x 2
#>   filename format
#>   <chr>    <chr> 
#> 1 clippy   xls   
#> 2 datasets xlsx  
#> 3 logo     <NA>

readxl in the real world

According to METACRAN, readxl has over 2.7 million cumulative downloads and is typically in the top 50 CRAN packages in terms of monthly downloads. readxl has relatively few reverse dependencies (only about 70), so I believe the vast majority of usage occurs in users’ R scripts. And these users throw an incredibly diverse set of Excel files at readxl, mostly with quiet success.

A recent failure, however, inspired an upgrade of readxl’s parsing of xlsx files. Christine Zhang (@underthecurve) did the data work behind a front-page Los Angeles Times story about homeless arrests. The Los Angeles Police Department provides daily arrest logs in the form of xlsx files, written by an unknown third-party tool. These files are a literal implementation of the “minimal conformant SpreadsheetML package” and we had never seen such a specimen before. Both R/tidyverse/readxl and Python/pandas/xlrd refused to read these files. Undeterred, Christine took advantage of the fact that xlsx is “just” a bunch of XML files and took matters into her own hands. But I got busy, making sure that readxl would be ready the next time.

Read the full backstory behind the LA Times story here: How We Found New Patterns in LA’s Homeless Arrest Data. In the end, Christine got her job done and readxl is a more capable package as a result of this challenge.

Future outlook

I predict the next user-visible changes in readxl will be improvements around column type specification, cell coercion, and reading from more general forms of input, e.g., from a URL. You can see other improvements and bug fixes that are on the radar in readxl’s GitHub issues.

The next release will also have a breaking-ish change around name repair. readxl will switch to tibble::set_tidy_names(), which remediates missing and duplicate variable names. I will make this change soon in the dev version, so that interested users can begin to adjust.

Acknowledgements

I am very thankful to the maintainers of the embedded libraries, especially the recent work on libxls by David Hoerl (@dhoerl) and Evan Miller (@evanmiller). readxl includes a great deal of compiled code, from disparate sources, and Jim Hester (@jimhester) is a fantastic troubleshooter. David Hood (@thoughtfulbloke) delivered some delightful Clippy photos (featured here), in response to an absurd Twitter request.

A big thanks goes out to the 86 users who contributed issues and pull requests since the previous readxl release: @aaa34169, @afdta, @alexeyknorre, @alexhallam, @anjurad, @arnyeinstein, @batpigandme, @bbrewington, @bellafeng, @burchill, @chrisholbrook, @Courvoisier13, @danielsjf, @DavisVaughan, @dchiu911, @Deepu298, @dpprdan, @ea-guerette, @espinielli, @gergness, @gp2x, @heseber, @hlynurhallgrims, @hrecht, @huftis, @hughmarera, @iiLaurens, @ilpepe, @Ironholds, @jameshowison, @jamesLSI, @jcolomb, @jebyrnes, @jekriske-lilly, @jennybc, @jeroen, @jimhester, @jjcad, @Jmarks199, @KKPMW, @krlmlr, @kwstat, @KyleHaynes, @Lu2017, @lz1nwm, @m-macaskill, @mbeer, @mdbauer, @melikovk, @MichaelChirico, @MidhunT, @MikhailLagutin, @mplatzer, @msgoussi, @nacnudus, @nealrichardson, @nick-ulle, @nortonle, @oozdmr, @PMassicotte, @ramanan82, @reinderien, @reinierv4, @robbriers, @RobertMyles, @rsbivand, @rstub, @ruaridhw, @sebastianschweer, @shoebodh, @simonthelwall, @slfan2013, @smasuda, @sncr-github, @stephlocke, @steve4444, @sz-cgt, @t-kalinowski, @tarunparmar, @tbeu, @thothal, @tomsing1, @tres-pitt, @vkapartzianis, @willtudorevans, and @zauster.

Contents
Upcoming events
Washington, DC
Aug 23-24
This two-day course will provide an overview of using R for supervised learning. The session will step through the process of building, visualizing, testing, and comparing models that are focused on prediction.