I’m tickled pink to announce the release of rvest 1.0.0. rvest is designed to make it easy to scrape (i.e. harvest) data from HTML web pages.
You can install it from CRAN with:
This release includes two major improvements that make it easier to extract text and tables. I also took this opportunity to tidy up the interface to be better match the tidyverse standards that have emerged since rvest was created in 2012. This is a major release that marks rvest as stable. That means we promise to avoid breaking changes as much as possible, and where they are needed, we will provided a significant deprecation cycle.
You can see a full list of changes in the release notes.
It’s been a while since I took a good look at rvest, and the GitHub issues suggested that there were two sources of long-standing frustration with rvest:
html_text() was a source of frustration because it extracts raw text from underlying HTML. It ignores HTML’s line breaks (i.e.
<br>) but preserves non-significant whitespace, making it a pain to use:
html <- minimal_html( "<p> This is a paragraph. This another sentence.<br>This should start on a new line </p>" ) html %>% html_text() %>% writeLines() #> #> This is a paragraph. #> This another sentence.This should start on a new line #>
innerText() function and uses a handful of heuristics to generate more useful output:
html %>% html_text2() %>% writeLines() #> This is a paragraph. This another sentence. #> This should start on a new line
html_table() was frustrating because it failed on many tables that used row or column spans. I’ve now re-written it from scratch, closely following the algorithm that browsers use. This means that there are far fewer tables for which it fails to produce useful output, and I have deprecated the
fill argument because it’s no longer needed.
Here’s a little example with row span, column span, and a missing cell:
html <- minimal_html("<table> <tr><th>A</th><th>B</th><th>C</th></tr> <tr><td colspan='2' rowspan='2'>1</td><td>2</td></tr> <tr><td rowspan='2'>3</td></tr> <tr><td>4</td></tr> </table>") html %>% html_element("table") %>% html_table() #> # A tibble: 3 x 3 #> A B C #> <int> <int> <int> #> 1 1 1 2 #> 2 1 1 3 #> 3 4 NA 3
html_table() now returns a tibble rather than a data frame (to be more compatible with the rest of the tidyverse), and its performance has been considerably improved (10x for the
motivating example). It also gains new
convert arguments to better control how
NAs and strings are processed. See the docs for more details.
While it’s not a major feature, its worth noting that rvest is now much smaller (~100 Kb vs ~1 Mb) thanks to a rewrite of
vignette("rvest") and making the
SelectorGadget article web-only.
Since this is the 1.0.0 release, I included a large number of API changes to make rvest more compatible with current tidyverse conventions. Older functions have been deprecated, so existing code will continue to work (albeit with a few new warnings).
rvest now imports xml2 rather than depending on it. This is cleaner because it avoids attaching all the xml2 functions that you’re probably not going to use. To reduce the change of breakages, rvest re-exports xml2 functions
url_absolute(); if you use other functions, your code will now need an explicit
html_form()now returns an object with class
rvest_form. Fields within a form now have class
rvest_field, instead of a variety of classes that were lacking the
rvest_prefix. All functions for working with forms have a common
html_nodes()have been superseded in favor of
html_elements()since they (almost) always return elements, not nodes. This vocabulary will better match what you’re likely to see when learning about HTML.
session()and returns an object of class
rvest_session. All functions that work with session objects now have a common
xml()functions have been removed.
minimal_html()(which doesn’t appear to be used by any other package) has had its arguments flipped to make it more intuitive.
guess_encoding()has been renamed to
html_encoding_guess()to avoid a clash with
repair_encoding()was deprecated because it doesn’t appear to have ever worked.
pluck()is no longer exported to avoid a clash with
purrr::pluck(); if you need it use
purrr::map_chr()and friends instead.
xml_nodes()have been formally deprecated in favour of their
A big thanks to all the folks who helped make this release possible through their issues, comments, and pull requests 😄
@13kay, @adam52, @AgnieszkaTomczyk, @ahaseemkunjucl, @akshaynagpal, @AlanMex1990, @alex23lemm, @amjiuzi, @antoine-lizee, @arilamstein, @artemklevtsov, @batpigandme, @bbrewington, @bedantaguru, @bramtayl, @brshallo, @charleswg, @christopherhastings, @chuchu89, @conjugateprior, @cpsievert, @craigcitro, @cranknasty, @cungbac, @curtisalexander, @cwickham, @data-steve, @dbuijs, @Deleetdk, @dholstius, @DiegoKoz, @dmi3kno, @dpprdan, @englianhu, @etabeta78, @ethanbsmith, @flpezet, @garrettgman, @georgevbsantiago, @geotheory, @ghost, @gokceneraslan, @gunawebs, @hadley, @happyshows, @hauj12123, @HBossier, @hemans, @higgi13425, @himanshudhingra, @hsancen, @ignotus0001, @ilarischeinin, @IndrajeetPatil, @iProcrastinate, @jaanos, @JackWilb, @JakeRuss, @jamjaemin, @javrucebo, @jeffisabelle, @jeroen, @jeroenjanssens, @jgilfillan, @jimhester, @jjchern, @jl5000, @jlewis91, @jmgirard, @johncollins, @JohnMount, @jonathan-g, @Jonathanyni, @joranE, @joshualeond, @jpmarindiaz, @jrnold, @jrosen48, @juba, @jubjubbc, @jullybobble, @kendonB, @kevin199011, @kevinrue, @kiernann, @kjschaudt, @ktaylora, @ktmud, @kurtis14, @leoluyi, @LeslieTse, @lifan0127, @litao1105, @magic-lantern, @MarcinKosinski, @markdanese, @MichaelChirico, @mikegros, @mikemc, @MislavSag, @mitchelloharawild, @mobcdi, @Monduiz, @moodymudskipper, @mrchypark, @MrFlick, @msberends, @msgoussi, @myliserta, @mzorgdrager, @nalimilan, @neilfws, @NicolasRuth, @nitishgupta4291, @noamross, @np2201, @npjc, @oguzhanogreden, @OmarGonD, @oNIenSis, @oriolmirosa, @Osc2wall, @petermeissner, @petrbouchal, @PritishDsouza, @PriyaShaji, @pssguy, @qpmnguyen, @r2evans, @rafaminos, @ramnathv, @renkun-ken, @rentrop, @richierocks, @rjpat, @romainfrancois, @rpalsaxena, @salauer, @SamoPP, @san1289, @sco-lo-digital, @seasmith, @sfirke, @sillasgonzaga, @slowkow, @smach, @smbache, @stenevang, @StephaneKazmierczak, @stevecondylios, @swiftsam, @swishderzy, @targeteer, @tbates, @The-Janitor, @thomasd2, @tomasbarcellos, @TyGu1, @wbuchanan, @WHardyPL, @WilDoane, @wldnjs, @yogesh1612, @yrochat, @yutannihilation, and @zheguzai100.