rvest 1.0.0

  rvest

  Hadley Wickham

I’m tickled pink to announce the release of rvest 1.0.0. rvest is designed to make it easy to scrape (i.e. harvest) data from HTML web pages.

You can install it from CRAN with:

install.packages("rvest")

This release includes two major improvements that make it easier to extract text and tables. I also took this opportunity to tidy up the interface to be better match the tidyverse standards that have emerged since rvest was created in 2012. This is a major release that marks rvest as stable. That means we promise to avoid breaking changes as much as possible, and where they are needed, we will provided a significant deprecation cycle.

You can see a full list of changes in the release notes.

New features

It’s been a while since I took a good look at rvest, and the GitHub issues suggested that there were two sources of long-standing frustration with rvest: html_text() and html_table().

html_text() was a source of frustration because it extracts raw text from underlying HTML. It ignores HTML’s line breaks (i.e. <br>) but preserves non-significant whitespace, making it a pain to use:

html <- minimal_html(
  "<p>  
    This is a paragraph.
    This another sentence.<br>This should start on a new line
  </p>"
)
html %>% html_text() %>% writeLines()
#>   
#>     This is a paragraph.
#>     This another sentence.This should start on a new line
#> 

The new html_text2() is inspired by Javascript’s innerText() function and uses a handful of heuristics to generate more useful output:

html %>% html_text2() %>% writeLines()
#> This is a paragraph. This another sentence.
#> This should start on a new line

html_table() was frustrating because it failed on many tables that used row or column spans. I’ve now re-written it from scratch, closely following the algorithm that browsers use. This means that there are far fewer tables for which it fails to produce useful output, and I have deprecated the fill argument because it’s no longer needed.

Here’s a little example with row span, column span, and a missing cell:

html <- minimal_html("<table>
  <tr><th>A</th><th>B</th><th>C</th></tr>
  <tr><td colspan='2' rowspan='2'>1</td><td>2</td></tr>
  <tr><td rowspan='2'>3</td></tr>
  <tr><td>4</td></tr>
</table>")

html %>%
  html_element("table") %>%
  html_table()
#> # A tibble: 3 x 3
#>       A     B     C
#>   <int> <int> <int>
#> 1     1     1     2
#> 2     1     1     3
#> 3     4    NA     3

html_table() now returns a tibble rather than a data frame (to be more compatible with the rest of the tidyverse), and its performance has been considerably improved (10x for the motivating example). It also gains new na.strings and convert arguments to better control how NAs and strings are processed. See the docs for more details.

While it’s not a major feature, its worth noting that rvest is now much smaller (~100 Kb vs ~1 Mb) thanks to a rewrite of vignette("rvest") and making the SelectorGadget article web-only.

API changes

Since this is the 1.0.0 release, I included a large number of API changes to make rvest more compatible with current tidyverse conventions. Older functions have been deprecated, so existing code will continue to work (albeit with a few new warnings).

  • rvest now imports xml2 rather than depending on it. This is cleaner because it avoids attaching all the xml2 functions that you’re probably not going to use. To reduce the change of breakages, rvest re-exports xml2 functions read_html() and url_absolute(); if you use other functions, your code will now need an explicit library(xml2).

  • html_form() now returns an object with class rvest_form. Fields within a form now have class rvest_field, instead of a variety of classes that were lacking the rvest_ prefix. All functions for working with forms have a common html_form_ prefix, e.g.  set_values() became html_form_set().

  • html_node() and html_nodes() have been superseded in favor of html_element() and html_elements() since they (almost) always return elements, not nodes. This vocabulary will better match what you’re likely to see when learning about HTML.

  • html_session() is now session() and returns an object of class rvest_session. All functions that work with session objects now have a common session_ prefix.

  • Long deprecated html(), html_tag(), xml() functions have been removed.

  • minimal_html() (which doesn’t appear to be used by any other package) has had its arguments flipped to make it more intuitive.

  • guess_encoding() has been renamed to html_encoding_guess() to avoid a clash with stringr::guess_encoding(). repair_encoding() was deprecated because it doesn’t appear to have ever worked.

  • pluck() is no longer exported to avoid a clash with purrr::pluck(); if you need it use purrr::map_chr() and friends instead.

  • xml_tag(), xml_node(), and xml_nodes() have been formally deprecated in favour of their html_ equivalents.

Acknowledgements

A big thanks to all the folks who helped make this release possible through their issues, comments, and pull requests 😄

@13kay, @adam52, @AgnieszkaTomczyk, @ahaseemkunjucl, @akshaynagpal, @AlanMex1990, @alex23lemm, @amjiuzi, @antoine-lizee, @arilamstein, @artemklevtsov, @batpigandme, @bbrewington, @bedantaguru, @bramtayl, @brshallo, @charleswg, @christopherhastings, @chuchu89, @conjugateprior, @cpsievert, @craigcitro, @cranknasty, @cungbac, @curtisalexander, @cwickham, @data-steve, @dbuijs, @Deleetdk, @dholstius, @DiegoKoz, @dmi3kno, @dpprdan, @englianhu, @etabeta78, @ethanbsmith, @flpezet, @garrettgman, @georgevbsantiago, @geotheory, @ghost, @gokceneraslan, @gunawebs, @hadley, @happyshows, @hauj12123, @HBossier, @hemans, @higgi13425, @himanshudhingra, @hsancen, @ignotus0001, @ilarischeinin, @IndrajeetPatil, @iProcrastinate, @jaanos, @JackWilb, @JakeRuss, @jamjaemin, @javrucebo, @jeffisabelle, @jeroen, @jeroenjanssens, @jgilfillan, @jimhester, @jjchern, @jl5000, @jlewis91, @jmgirard, @johncollins, @JohnMount, @jonathan-g, @Jonathanyni, @joranE, @joshualeond, @jpmarindiaz, @jrnold, @jrosen48, @juba, @jubjubbc, @jullybobble, @kendonB, @kevin199011, @kevinrue, @kiernann, @kjschaudt, @ktaylora, @ktmud, @kurtis14, @leoluyi, @LeslieTse, @lifan0127, @litao1105, @magic-lantern, @MarcinKosinski, @markdanese, @MichaelChirico, @mikegros, @mikemc, @MislavSag, @mitchelloharawild, @mobcdi, @Monduiz, @moodymudskipper, @mrchypark, @MrFlick, @msberends, @msgoussi, @myliserta, @mzorgdrager, @nalimilan, @neilfws, @NicolasRuth, @nitishgupta4291, @noamross, @np2201, @npjc, @oguzhanogreden, @OmarGonD, @oNIenSis, @oriolmirosa, @Osc2wall, @petermeissner, @petrbouchal, @PritishDsouza, @PriyaShaji, @pssguy, @qpmnguyen, @r2evans, @rafaminos, @ramnathv, @renkun-ken, @rentrop, @richierocks, @rjpat, @romainfrancois, @rpalsaxena, @salauer, @SamoPP, @san1289, @sco-lo-digital, @seasmith, @sfirke, @sillasgonzaga, @slowkow, @smach, @smbache, @stenevang, @StephaneKazmierczak, @stevecondylios, @swiftsam, @swishderzy, @targeteer, @tbates, @The-Janitor, @thomasd2, @tomasbarcellos, @TyGu1, @wbuchanan, @WHardyPL, @WilDoane, @wldnjs, @yogesh1612, @yrochat, @yutannihilation, and @zheguzai100.