stringr 1.5.0

We’re chuffed to announce the release of stringr 1.5.0. stringr provides a cohesive set of functions designed to make working with strings as easy as possible.

You can install it from CRAN with:

install.packages("stringr")

This blog post will give you an overview of the biggest changes (you can get a detailed list of all changes from the release notes). Firstly, we need to update you on some (small) breaking changes we’ve made to make stringr more consistent with the rest of the tidyverse. Then, we’ll give a quick overview of improvements to documentation and stringr’s new license. Lastly, we’ll finish off by diving into a few of the many small, but useful, functions that we’ve accumulated in the three and half years since the last release.

library(stringr)

Breaking changes

Lets start with the important stuff: the breaking changes. We’ve tried to keep these small and we don’t believe they’ll affect much code in the wild (they only affected ~20 of the ~1,600 packages that use stringr). But we’re believe they’re important to make as a consistent set of rules makes the tidyverse as a whole more predictable and easier to learn.

Recycling rules

stringr functions now consistently implement the tidyverse recycling rules¹, which are stricter than the previous rules in two ways. Firstly, we no longer recycle shorter vectors that are an integer multiple of longer vectors:

str_detect(letters, c("x", "y"))
#> Error in `str_detect()`:
#> ! Can't recycle `string` (size 26) to match `pattern` (size 2).

Secondly, a 0-length vector no longer implies a 0-length output. Instead it’s recycled using the usual rules:

str_detect(letters, character())
#> Error in `str_detect()`:
#> ! Can't recycle `string` (size 26) to match `pattern` (size 0).
str_detect("x", character())
#> logical(0)

Neither of these situations occurs very commonly in data analysis, so this change primarily brings consistency with the rest of the tidyverse without affecting much existing code.

Finally, stringr functions are generally a little stricter because we require the inputs to be vectors of some type. Again, this is unlikely to affect your data analysis code and will result in a clearer error if you accidentally pass in something weird:

str_detect(mean, "x")
#> Error in `str_detect()`:
#> ! `string` must be a vector, not a function.

Empty patterns

In many stringr functions, "" will match or split on every character. This is motivated by base R’s strsplit():

strsplit("abc", "")[[1]]
#> [1] "a" "b" "c"
str_split("abc", "")[[1]]
#> [1] "a" "b" "c"

When creating stringr (over 13 years ago!), I took this idea and ran with it, implementing similar support in every function where it might possibly work. But I missed an important problem with str_detect().

What should str_detect(X, "") return? You can argue two ways:

To be consistent with str_split(), it should return TRUE whenever there are characters to match, i.e. x != "".
It’s common to build up a set of possible matches by doing str_flatten(matches, "|"). What should this match if matches is empty? Ideally it would match nothing implying that str_detect(x, "") should be equivalent to x == "".

This inconsistency potentially leads to some subtle bugs, so use of "" in str_detect() (and a few other related functions) is now an error:

str_detect(letters, "")
#> Error in `str_detect()`:
#> ! `pattern` can't be the empty string (`""`).

Documentation and licensing

Now that we’ve got the breaking changes out of the way we can focus on the new stuff 😃. Most importantly, there’s a new vignette that provides some advice if you’re transition from (or to) base R’s string functions: vignette("from-base", package = "stringr"). It was written by Sara Stoudt during the 2019 Tidyverse developer day, and has finally made it to the released version!

We’ve also spent a bunch of time reviewing the documentation, particularly the topic titles and descriptions. They’re now more informative and less duplicative, hopefully make it easier to find the function that you’re looking for. See the complete list of functions in the reference index.

Finally, stringr is now officially re-licensed as MIT.

New features

The biggest improvement is to str_view() which has gained a bunch of new features, including using the cli package so it can work in more places. We also have a grab bag of new functions that fill in small functionality gaps.

`str_view()`

str_view() uses ANSI colouring rather than an HTML widget. This means it works in more places and requires fewer dependencies. str_view() now:

Displays strings with special characters:

x <- c("\\", "\"\nabcdef\n\"")
x
#> [1] "\\"             "\"\nabcdef\n\""

str_view(x)
#> [1] │ \
#> [2] │ "
#>     │ abcdef
#>     │ "

Highlights unusual whitespace characters:
```
str_view("\t")
#> [1] │ {\t}
```

By default, only shows matching strings:

str_view(fruit, "(.)\\1")
#>  [1] │ a<pp>le
#>  [5] │ be<ll> pe<pp>er
#>  [6] │ bilbe<rr>y
#>  [7] │ blackbe<rr>y
#>  [8] │ blackcu<rr>ant
#>  [9] │ bl<oo>d orange
#> [10] │ bluebe<rr>y
#> [11] │ boysenbe<rr>y
#> [16] │ che<rr>y
#> [17] │ chili pe<pp>er
#> [19] │ cloudbe<rr>y
#> [21] │ cranbe<rr>y
#> [23] │ cu<rr>ant
#> [28] │ e<gg>plant
#> [29] │ elderbe<rr>y
#> [32] │ goji be<rr>y
#> [33] │ g<oo>sebe<rr>y
#> [38] │ hucklebe<rr>y
#> [47] │ lych<ee>
#> [50] │ mulbe<rr>y
#> ... and 9 more

(This makes str_view_all() redundant and hence deprecated.)

Comparing strings

There are three new functions related to comparing strings:

str_equal() compares two character vectors using Unicode rules, optionally ignoring case:

str_equal("a", "A")
#> [1] FALSE
str_equal("a", "A", ignore_case = TRUE)
#> [1] TRUE

str_rank() completes the set of order/rank/sort functions:

str_rank(c("a", "c", "b", "b"))
#> [1] 1 4 2 2
# compare to:
str_order(c("a", "c", "b", "b"))
#> [1] 1 3 4 2

str_unique() returns unique values, optionally ignoring case:

str_unique(c("a", "a", "A"))
#> [1] "a" "A"
str_unique(c("a", "a", "A"), ignore_case = TRUE)
#> [1] "a"

Splitting

str_split() gains two useful variants:

str_split_1() is tailored for the special case of splitting up a single string. It returns a character vector, not a list, and errors if you try and give it multiple values:
```
str_split_1("x-y-z", "-")
#> [1] "x" "y" "z"
str_split_1(c("x-y", "a-b-c"), "-")
#> Error in `str_split_1()`:
#> ! `string` must be a single string, not a character vector.
```
It’s a shortcut for the common pattern of unlist(str_split(x, " ")).

str_split_i() extracts a single piece from the split string:

x <- c("a-b-c", "d-e", "f-g-h-i")
str_split_i(x, "-", 2)
#> [1] "b" "e" "g"

str_split_i(x, "-", 4)
#> [1] NA  NA  "i"

str_split_i(x, "-", -1)
#> [1] "c" "e" "i"

Miscellaneous

str_escape() escapes regular expression metacharacters, providing an alternative to fixed() if you want to compose a pattern from user supplied strings:
```
str_view("[hello]", str_escape("[]"))
```

str_extract() can now extract a capturing group instead of the complete match:

x <- c("Chapter 1", "Section 2.3", "Chapter 3", "Section 4.1.1")
str_extract(x, "([A-Za-z]+) ([0-9.]+)", group = 1)
#> [1] "Chapter" "Section" "Chapter" "Section"
str_extract(x, "([A-Za-z]+) ([0-9.]+)", group = 2)
#> [1] "1"     "2.3"   "3"     "4.1.1"

str_flatten() gains a last argument which is used to power the new str_flatten_comma():

str_flatten_comma(c("cats", "dogs", "mice"))
#> [1] "cats, dogs, mice"
str_flatten_comma(c("cats", "dogs", "mice"), last = " and ")
#> [1] "cats, dogs and mice"
str_flatten_comma(c("cats", "dogs", "mice"), last = ", and ")
#> [1] "cats, dogs, and mice"

# correctly handles the two element case with the Oxford comma
str_flatten_comma(c("cats", "dogs"), last = ", and ")
#> [1] "cats and dogs"

str_like() works like str_detect() but uses SQL’s LIKE syntax:

fruit <- c("apple", "banana", "pear", "pineapple")
fruit[str_like(fruit, "%apple")]
#> [1] "apple"     "pineapple"
fruit[str_like(fruit, "p__r")]
#> [1] "pear"

Acknowledgements

A big thanks to all 114 folks who contributed to this release through pull requests and issues! @aaronrudkin, @adisarid, @AleSR13, @anfederico, @AR1337, @arisp99, @avila, @balthasars, @batpigandme, @bbarros50, @bbo2adwuff, @bensenmansen, @bfgray3, @Bisaloo, @bonmac, @botan, @bshor, @carlganz, @chintanp, @chrimaho, @chris2b5, @clemenshug, @courtiol, @dachosen1, @dan-reznik, @datawookie, @david-romano, @DavisVaughan, @dbarrows, @deann88, @denrou, @deschen1, @dsg38, @dtburk, @elbersb, @geotheory, @ghost, @GrimTrigger88, @hadley, @iago-pssjd, @IndigoJay, @jashapiro, @JBGruber, @jennybc, @jimhester, @jjesusfilho, @jmbarbone, @joethorley, @jonas-hag, @jonthegeek, @joshyam-k, @jpeacock29, @jzadra, @KasperThystrup, @kendonB, @kieran-mace, @kiernann, @Kodiologist, @leej3, @leowill01, @LimaRAF, @lmwang9527, @Ludsfer, @lz01, @Marcade80, @Mashin6, @MattCowgill, @maxheld83, @mgirlich, @MichaelChirico, @michaelweylandt, @mikeaalv, @misea, @mitchelloharawild, @mkvasnicka, @mrcaseb, @mtnbikerjoshua, @mwip, @nachovoss, @neonira, @Nischal-Karki-ATW, @oliverbeagley, @orgadish, @pachadotdev, @PathosEthosLogos, @pdelboca, @petermeissner, @phargarten2, @programLyrique, @psads-git, @psychelzh, @PursuitOfDataScience, @richardjtelford, @richelbilderbeek, @rjpat, @romatik, @rressler, @rwbaer, @salim-b, @sammo3182, @sastoudt, @SchmidtPaul, @seasmith, @selesnow, @slee981, @Tal1987, @tanzatanza, @THChan11, @travis-leith, @vladtarko, @wdenton, @wurli, @Yingjie4Science, and @zeehio.

You might wonder why we developed our own set of recycling rules for the tidyverse instead of using the base R rules. That’s because, unfortunately, there isn’t a consistent set of rules used by base R, but a suite of variations. ↩︎