tidyr 1.3.0

  tidyr

  Hadley Wickham

We’re pleased to announce the release of tidyr 1.3.0. tidyr provides a set of tools for transforming data frames to and from tidy data, where each variable is a column and each observation is a row. Tidy data is a convention for matching the semantics and structure of your data that makes using the rest of the tidyverse (and many other R packages) much easier.

You can install it from CRAN with:

install.packages("tidyr")

This post highlights the biggest changes in this release:

  • A new family of separate_*() functions supersede separate() and extract() and come with useful debugging features.

  • unnest_wider() and unnest_longer() gain a bundle of useful improvements.

  • pivot_longer() gets a new cols_vary argument.

  • nest(.by) provides a new (and hopefully final) way to create nested datasets.

You should also notice generally improved errors with this release: we check function arguments more aggressively, and take care to always report the name of the function that you called, not some internal helper. As usual, you can find a full set of changes in the release notes.

library(tidyr)
library(dplyr, warn.conflicts = FALSE)

separate_*() family of functions

The biggest feature of this release is a new, experimental family of functions for separating string columns:

Make columns Make rows
Separate with delimiter separate_wider_delim() separate_longer_delim()
Separate by position separate_wider_position() separate_longer_position()
Separate with regular expression separate_wider_regex()

These functions collectively supersede extract(), separate(), and separate_rows() because they have more consistent names and arguments, have better performance (thanks to stringr), and provide a new approach for handling problems.

Make columns Make rows
Separate with delimiter separate(sep = string) separate_rows()
Separate by position separate(sep = integer vector) N/A
Separate with regular expression extract()

Here I’ll focus on the wider functions because they generally present the most interesting challenges. Let’s start by grabbing some census data with the tidycensus package:

vt_census <- tidycensus::get_decennial(
  geography = "block",
  state = "VT",
  county = "Washington",
  variables = "P1_001N",
  year = 2020
)
#> Getting data from the 2020 decennial Census
#> Using the PL 94-171 Redistricting Data summary file
#> Note: 2020 decennial Census data use differential privacy, a technique that
#> introduces errors into data to preserve respondent confidentiality.
#>  Small counts should be interpreted with caution.
#>  See https://www.census.gov/library/fact-sheets/2021/protecting-the-confidentiality-of-the-2020-census-redistricting-data.html for additional guidance.
#> This message is displayed once per session.
vt_census
#> # A tibble: 2,150 × 4
#>    GEOID           NAME                                           variable value
#>    <chr>           <chr>                                          <chr>    <dbl>
#>  1 500239555021014 Block 1014, Block Group 1, Census Tract 9555.… P1_001N     21
#>  2 500239555021015 Block 1015, Block Group 1, Census Tract 9555.… P1_001N     19
#>  3 500239555021016 Block 1016, Block Group 1, Census Tract 9555.… P1_001N      0
#>  4 500239555021017 Block 1017, Block Group 1, Census Tract 9555.… P1_001N      0
#>  5 500239555021018 Block 1018, Block Group 1, Census Tract 9555.… P1_001N     43
#>  6 500239555021019 Block 1019, Block Group 1, Census Tract 9555.… P1_001N     68
#>  7 500239555021020 Block 1020, Block Group 1, Census Tract 9555.… P1_001N     30
#>  8 500239555021021 Block 1021, Block Group 1, Census Tract 9555.… P1_001N      0
#>  9 500239555021022 Block 1022, Block Group 1, Census Tract 9555.… P1_001N     18
#> 10 500239555021023 Block 1023, Block Group 1, Census Tract 9555.… P1_001N     93
#> # … with 2,140 more rows

The GEOID column is made up of four components: a 2-digit state identifier, a 3-digit county identifier, a 6-digit tract identifier, and a 4-digit block identifier. We can use separate_wider_position() to extract these into their own variables:

vt_census |>
  select(GEOID) |> 
  separate_wider_position(
    GEOID,
    widths = c(state = 2, county = 3, tract = 6, block = 4)
  )
#> # A tibble: 2,150 × 4
#>    state county tract  block
#>    <chr> <chr>  <chr>  <chr>
#>  1 50    023    955502 1014 
#>  2 50    023    955502 1015 
#>  3 50    023    955502 1016 
#>  4 50    023    955502 1017 
#>  5 50    023    955502 1018 
#>  6 50    023    955502 1019 
#>  7 50    023    955502 1020 
#>  8 50    023    955502 1021 
#>  9 50    023    955502 1022 
#> 10 50    023    955502 1023 
#> # … with 2,140 more rows

The name column contains this same information in a text form, with each component separated by a comma. We can use separate_wider_delim() to break up this sort of data into individual variables:

vt_census |>
  select(NAME) |> 
  separate_wider_delim(
    NAME,
    delim = ", ",
    names = c("block", "block_group", "tract", "county", "state")
  )
#> # A tibble: 2,150 × 5
#>    block      block_group   tract                county            state  
#>    <chr>      <chr>         <chr>                <chr>             <chr>  
#>  1 Block 1014 Block Group 1 Census Tract 9555.02 Washington County Vermont
#>  2 Block 1015 Block Group 1 Census Tract 9555.02 Washington County Vermont
#>  3 Block 1016 Block Group 1 Census Tract 9555.02 Washington County Vermont
#>  4 Block 1017 Block Group 1 Census Tract 9555.02 Washington County Vermont
#>  5 Block 1018 Block Group 1 Census Tract 9555.02 Washington County Vermont
#>  6 Block 1019 Block Group 1 Census Tract 9555.02 Washington County Vermont
#>  7 Block 1020 Block Group 1 Census Tract 9555.02 Washington County Vermont
#>  8 Block 1021 Block Group 1 Census Tract 9555.02 Washington County Vermont
#>  9 Block 1022 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> 10 Block 1023 Block Group 1 Census Tract 9555.02 Washington County Vermont
#> # … with 2,140 more rows

You’ll notice that each row contains a lot of duplicated information (“Block”, “Block Group”, …). You could certainly use mutate() and string manipulation to clean this up, but there’s a more direct approach that you can use if you’re familiar with regular expressions. The new separate_wider_regex() takes a vector of regular expressions that are matched in order, from left to right. If you name the regular expression, it will appear in the output; otherwise, it will be dropped. I think this leads to a particularly elegant solution to many problems.

vt_census |>
  select(NAME) |> 
  separate_wider_regex(
    NAME,
    patterns = c(
      "Block ", block = "\\d+", ", ",
      "Block Group ", block_group = "\\d+", ", ",
      "Census Tract ", tract = "\\d+.\\d+", ", ",
      county = "[^,]+", ", ",
      state = ".*"
    )
  )
#> # A tibble: 2,150 × 5
#>    block block_group tract   county            state  
#>    <chr> <chr>       <chr>   <chr>             <chr>  
#>  1 1014  1           9555.02 Washington County Vermont
#>  2 1015  1           9555.02 Washington County Vermont
#>  3 1016  1           9555.02 Washington County Vermont
#>  4 1017  1           9555.02 Washington County Vermont
#>  5 1018  1           9555.02 Washington County Vermont
#>  6 1019  1           9555.02 Washington County Vermont
#>  7 1020  1           9555.02 Washington County Vermont
#>  8 1021  1           9555.02 Washington County Vermont
#>  9 1022  1           9555.02 Washington County Vermont
#> 10 1023  1           9555.02 Washington County Vermont
#> # … with 2,140 more rows

These functions also have a new way to report problems. Let’s start with a very simple example:

df <- tibble(
  id = 1:3,
  x = c("a", "a-b", "a-b-c")
)

df |> separate_wider_delim(x, delim = "-", names = c("x", "y"))
#> Error in `separate_wider_delim()`:
#> ! Expected 2 pieces in each element of `x`.
#> ! 1 value was too short.
#>  Use `too_few = "debug"` to diagnose the problem.
#>  Use `too_few = "align_start"/"align_end"` to silence this message.
#> ! 1 value was too long.
#>  Use `too_many = "debug"` to diagnose the problem.
#>  Use `too_many = "drop"/"merge"` to silence this message.

We’ve requested two columns in the output (x and y), but the first row has only one element and the last row has three elements, so separate_wider_delim() can’t do what we’ve asked. The error lays out your options for resolving the problem using the too_few and too_many arguments. I’d recommend always starting with "debug" to get more information about the problem:

probs <- df |> 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("a", "b"),
    too_few = "debug",
    too_many = "debug"
  )
#> Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and `x_remainder`.
probs
#> # A tibble: 3 × 7
#>      id a     b     x     x_ok  x_pieces x_remainder
#>   <int> <chr> <chr> <chr> <lgl>    <int> <chr>      
#> 1     1 a     NA    a     FALSE        1 ""         
#> 2     2 a     b     a-b   TRUE         2 ""         
#> 3     3 a     b     a-b-c FALSE        3 "-c"

This adds three new variables: x_ok tells you if the x could be separated as you requested, x_pieces tells you the actual number of pieces, and x_remainder shows you anything that remains after the columns you asked for. You can use this information to fix the problems in the input, or you can use the other options to too_few and too_many to tell separate_wider_delim() to fix them for you:

df |> 
  separate_wider_delim(
    x,
    delim = "-",
    names = c("a", "b"),
    too_few = "align_start",
    too_many = "drop"
  )
#> # A tibble: 3 × 3
#>      id a     b    
#>   <int> <chr> <chr>
#> 1     1 a     NA   
#> 2     2 a     b    
#> 3     3 a     b

too_few and too_many also work with separate_wider_position(), and too_few works with separate_wider_regex(). The longer variants don’t need these arguments because varying numbers of rows don’t matter in the same way that varying numbers of columns do:

df |> separate_longer_delim(x, delim = "-")
#> # A tibble: 6 × 2
#>      id x    
#>   <int> <chr>
#> 1     1 a    
#> 2     2 a    
#> 3     2 b    
#> 4     3 a    
#> 5     3 b    
#> 6     3 c

These functions are still experimental so we are actively seeking feedback. Please try them out and let us know if you find them useful or if there are other features you’d like to see.

unnest_wider() and unnest_longer() improvements

unnest_longer() and unnest_wider() have both received some quality of life and consistency improvements. Most importantly:

  • unnest_wider() now gives a better error when unnesting an unnamed vector:

    df <- tibble(
      id = 1:2,
      x = list(c("a", "b"), c("d", "e", "f"))
    )
    df |> 
      unnest_wider(x)
    #> Error in `unnest_wider()`:
    #>  In column: `x`.
    #>  In row: 1.
    #> Caused by error:
    #> ! Can't unnest elements with missing names.
    #>  Supply `names_sep` to generate automatic names.
    
    df |> 
      unnest_wider(x, names_sep = "_")
    #> # A tibble: 2 × 4
    #>      id x_1   x_2   x_3  
    #>   <int> <chr> <chr> <chr>
    #> 1     1 a     b     NA   
    #> 2     2 d     e     f
    

    And this same behaviour now also applies to partially named vectors.

  • unnest_longer() has gained a keep_empty argument like unnest(), and it now treats NULL and empty vectors the same way:

    df <- tibble(
      id = 1:3,
      x = list(NULL, integer(), 1:3)
    )
    
    df |> unnest_longer(x)
    #> # A tibble: 3 × 2
    #>      id     x
    #>   <int> <int>
    #> 1     3     1
    #> 2     3     2
    #> 3     3     3
    df |> unnest_longer(x, keep_empty = TRUE)
    #> # A tibble: 5 × 2
    #>      id     x
    #>   <int> <int>
    #> 1     1    NA
    #> 2     2    NA
    #> 3     3     1
    #> 4     3     2
    #> 5     3     3
    

pivot_longer(cols_vary)

By default, pivot_longer() creates its output row-by-row:

df <- tibble(
  x = 1:2,
  y = 3:4,
  z = 5:6
)

df |> 
  pivot_longer(
    everything(),
    names_to = "name",
    values_to = "value"
  )
#> # A tibble: 6 × 2
#>   name  value
#>   <chr> <int>
#> 1 x         1
#> 2 y         3
#> 3 z         5
#> 4 x         2
#> 5 y         4
#> 6 z         6

You can now request to create the output column-by-column with cols_vary = "slowest":

df |> 
  pivot_longer(
    everything(),
    names_to = "name",
    values_to = "value",
    cols_vary = "slowest"
  )
#> # A tibble: 6 × 2
#>   name  value
#>   <chr> <int>
#> 1 x         1
#> 2 x         2
#> 3 y         3
#> 4 y         4
#> 5 z         5
#> 6 z         6

nest(.by)

A nested data frame is a data frame where one (or more) columns is a list of data frames. Nested data frames are a powerful tool that allow you to turn groups into rows and can facilitate certain types of data manipulation that would be very tricky otherwise. (One place to learn more about them is my 2016 talk “ Managing many models with R".)

Over the years we’ve made a number of attempts at getting the correct interface for nesting, including tidyr::nest(), dplyr::nest_by(), and dplyr::group_nest(). In this version of tidyr we’ve taken one more stab at it by adding a new argument to nest(): .by, inspired by the upcoming dplyr 1.1.0 release. This means that nest() now allows you to specify the variables you want to nest by as an alternative to specifying the variables that appear in the nested data.

# Specify what to nest by
mtcars |> 
  nest(.by = cyl)
#> # A tibble: 3 × 2
#>     cyl data              
#>   <dbl> <list>            
#> 1     6 <tibble [7 × 10]> 
#> 2     4 <tibble [11 × 10]>
#> 3     8 <tibble [14 × 10]>

# Specify what should be nested
mtcars |> 
  nest(data = -cyl)
#> # A tibble: 3 × 2
#>     cyl data              
#>   <dbl> <list>            
#> 1     6 <tibble [7 × 10]> 
#> 2     4 <tibble [11 × 10]>
#> 3     8 <tibble [14 × 10]>

# Specify both (to drop variables)
mtcars |> 
  nest(data = mpg:drat, .by = cyl)
#> # A tibble: 3 × 2
#>     cyl data             
#>   <dbl> <list>           
#> 1     6 <tibble [7 × 5]> 
#> 2     4 <tibble [11 × 5]>
#> 3     8 <tibble [14 × 5]>

If this function is all we hope it to be, we’re likely to supersede dplyr::nest_by() and dplyr::group_nest() in the future. This has the nice property of placing the functions for nesting and unnesting in the same package (tidyr).

Acknowledgements

A big thanks to all 51 contributors who helped make this release possible, by writing code and documentating, asking questions, and reporting bugs! @AdrianS85, @ahcyip, @allenbaron, @AnBarbosaBr, @ArthurAndrews, @bart1, @billdenney, @bknakker, @bwiernik, @crissthiandi, @daattali, @DavisVaughan, @dcaud, @DSLituiev, @elgabbas, @fabiangehring, @hadley, @ilikegitlab, @jennybc, @jic007, @Joao-O-Santos, @joeycouse, @jonspring, @kevinushey, @krlmlr, @lionel-, @lotard, @lschneiderbauer, @lucylgao, @markfairbanks, @martina-starc, @MatthieuStigler, @mattnolan001, @mattroumaya, @mdkrause, @mgirlich, @millermc38, @modche, @moodymudskipper, @mspittler, @olivroy, @piokol23, @ppreshant, @ramiromagno, @Rengervn, @rjake, @roohitk, @struckma, @tjmahr, @weirichs, and @wurli.