tibble 2.0.1

I’m pleased to announce that version 2.0.1 of the tibble package is on CRAN now, just in time for rstudio::conf(). Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not, with nicer default output too! Grab the latest version with:

install.packages("tibble")

This release required a bit of preparation, including a pre-release blog post that described the breaking changes, mostly in as_tibble(), new_tibble(), set_tidy_names(), tidy_names(), and names<-(), and a patch release that fixed problems found after the initial 2.0.0 release. In this blog post, I focus on a few user- and programmer-related changes, and give an outlook over future development:

  • view(), nameless enframe(), 2D columns
  • Lifecycle, robustness, name repair, row names, glimpse() for subclasses
  • vctrs, dependencies, decorations

For a complete overview please see the release notes.

Use the issue tracker to submit bugs or suggest ideas, your contributions are always welcome.

Changes that affect users

view

The experimental view() function forwards its input to utils::View() (only in interactive mode) and always returns its input invisibly, which is useful for pipe-based workflows. Currently it is unclear if this functionality should live in tibble or elsewhere.

# This is a no-op in non-interactive mode.
# In interactive mode, a viewer window/pane will open.
iris %>%
  view()

Nameless enframe

The enframe() function always has been a good way to convert a (named) vector to a two-column data frame. In this version, conversion to a one-column data frame is also supported by setting the name argument to NULL. This is now the recommended way to turn a vector to a one-column tibble, due to changes to the default implementation of as_tibble().

enframe(letters[1:3])

#> # A tibble: 3 x 2
#>    name value
#>   <int> <chr>
#> 1     1 a    
#> 2     2 b    
#> 3     3 c

enframe(letters[1:3], name = NULL)

#> # A tibble: 3 x 1
#>   value
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 c

2D columns

tibble() now supports columns that are matrices or data frames. These have always been supported in data frames and are used in some modelling functions. We are looking forward to supporting these and other exciting use cases, see also the Matrix and data frame columns chapter of adv-r. The number of rows in these objects must be consistent with the length of the other columns. Internally, this feature required using NROW() instead of length() in a few spots, which conveniently returns the length for vectors and the number of rows for 2D objects. The required support in pillar has been added earlier last year.

tibble(
  a = 1:3,
  b = tibble(c = 4:6),
  d = tibble(e = 7:9, f = tibble(g = 10, h = 11)),
  i = diag(3)
)

#> # A tibble: 3 x 4
#>       a   b$c   d$e  $f$g   $$h i[,1]  [,2]  [,3]
#>   <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     4     7    10    11     1     0     0
#> 2     2     5     8    10    11     0     1     0
#> 3     3     6     9    10    11     0     0     1

Changes that affect package developers

Lifecycle

Life
cycle

All functions have been assigned a lifecycle. The tibble package has now reached the “stable” lifecycle, functions in a different lifecycle stage are marked as such in their documentation. One example is the add_row() function: it is unclear if it should ensure that all columns have length one by wrapping in a list if necessary, and a better implementation is perhaps possible once tibble uses the vctrs package, see below. Therefore this function is marked “questioning”. Learn more about lifecycle in the tidyverse at https://www.tidyverse.org/lifecycle/.

Robustness

The new .rows argument to tibble() and as_tibble() allows specifying the expected number of rows explicitly, even if it’s evident from the data. This supports writing even more defensive code. The nrow argument to the low-level new_tibble() constructor is now mandatory, on the other hand most expensive checks have been moved to the new validate_tibble() function. This means that constructions of tibbles is now faster by default if you know that the inputs are correct, but you can always double-check if needed. See also the S3 classes chapter in adv-r for motivation.

tibble(a = 1, b = 1:3, .rows = 3)

#> # A tibble: 3 x 2
#>       a     b
#>   <dbl> <int>
#> 1     1     1
#> 2     1     2
#> 3     1     3

tibble(a = 1, b = 2:3, .rows = 3)
#> Error: Tibble columns must have consistent lengths, only values of length one are recycled:
#> * Length 3: Requested with `.rows` argument
#> * Length 2: Column `b`
tibble(a = 1, .rows = 3)

#> # A tibble: 3 x 1
#>       a
#>   <dbl>
#> 1     1
#> 2     1
#> 3     1

as_tibble(iris[1:3, ], .rows = 3)

#> # A tibble: 3 x 5
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>          <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#> 1          5.1         3.5          1.4         0.2 setosa 
#> 2          4.9         3            1.4         0.2 setosa 
#> 3          4.7         3.2          1.3         0.2 setosa

new_tibble(list(a = 1:3), nrow = 3)

#> # A tibble: 3 x 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3

bad <- new_tibble(list(a = 1:2), nrow = 3)
validate_tibble(bad)
#> Error: Tibble columns must have consistent lengths, only values of length one are recycled:
#> * Length 3: Requested with `nrow` argument
#> * Length 2: Column `a`

Name repair

Column name repair has more direct support, via the new .name_repair argument to tibble() and as_tibble(). It takes the following values:

  • "minimal": No name repair or checks, beyond basic existence.
  • "unique": Make sure names are unique and not empty.
  • "check_unique": (default value), no name repair, but check they are unique.
  • "universal": Make the names unique and syntactic.
  • a function: apply custom name repair (e.g., .name_repair = make.names or .name_repair = ~make.names(., unique = TRUE) for names in the style of base R).
## by default, duplicate names are not allowed
tibble(`1a` = 1, `1a` = 2)
#> Error: Column name `1a` must not be duplicated.
#> Use .name_repair to specify repair.

## you can authorize duplicate names
tibble(`1a` = 1, `1a` = 2, .name_repair = "minimal")

#> # A tibble: 1 x 2
#>    1a  1a
#>   <dbl> <dbl>
#> 1     1     2

## or request that the names be made unique
tibble(`1a` = 1, `1a` = 2, .name_repair = "unique")
#> New names:
#> * `1a` -> `1a..1`
#> * `1a` -> `1a..2`

#> # A tibble: 1 x 2
#>   1a..1 1a..2
#>     <dbl>   <dbl>
#> 1       1       2

## or universal
tibble(`1a` = 1, `1a` = 2, .name_repair = "universal")
#> New names:
#> * `1a` -> ..1a..1
#> * `1a` -> ..1a..2

#> # A tibble: 1 x 2
#>   ..1a..1 ..1a..2
#>     <dbl>   <dbl>
#> 1       1       2

Row names

Row name handling is stricter. Row names were never supported in tibble() and new_tibble(), and are now stripped by default in as_tibble(). The rownames argument to as_tibble() supports:

  • NULL: remove row names (default),
  • NA: keep row names,
  • A string: the name of the new column that will contain the existing row names, which are no longer present in the result.

The old default can be restored by calling pkgconfig::set_config("tibble::rownames", NA), this also works for packages that import tibble.

rownames(as_tibble(mtcars))
#>  [1] "1"  "2"  "3"  "4"  "5"  "6"  "7"  "8"  "9"  "10" "11" "12" "13" "14"
#> [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
#> [29] "29" "30" "31" "32"
as_tibble(mtcars, rownames = "make_model")

#> # A tibble: 32 x 12
#>    make_model   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear
#>    <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1 Mazda RX4   21       6  160    110  3.9   2.62  16.5     0     1     4
#>  2 Mazda RX4…  21       6  160    110  3.9   2.88  17.0     0     1     4
#>  3 Datsun 710  22.8     4  108     93  3.85  2.32  18.6     1     1     4
#>  4 Hornet 4 …  21.4     6  258    110  3.08  3.22  19.4     1     0     3
#>  5 Hornet Sp…  18.7     8  360    175  3.15  3.44  17.0     0     0     3
#>  6 Valiant     18.1     6  225    105  2.76  3.46  20.2     1     0     3
#>  7 Duster 360  14.3     8  360    245  3.21  3.57  15.8     0     0     3
#>  8 Merc 240D   24.4     4  147.    62  3.69  3.19  20       1     0     4
#>  9 Merc 230    22.8     4  141.    95  3.92  3.15  22.9     1     0     4
#> 10 Merc 280    19.2     6  168.   123  3.92  3.44  18.3     1     0     4
#> # … with 22 more rows, and 1 more variable: carb <dbl>

glimpse for subclasses

The glimpse() function shows information obtained from tbl_sum() in the header, e.g. grouping information for grouped_df from dplyr, or other information from packages that override the tbl_df class.

iris %>%
  group_by(Species) %>%
  glimpse()

#> Observations: 150
#> Variables: 5
#> Groups: Species [3]
#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5…
#> $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3…
#> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1…
#> $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0…
#> $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, set…

Outlook

vctrs

The plan is to use vctrs in tibble 2.1.0. This package is a solid foundation for handling coercion, concatenation and recycling in vectors of arbitrary type. The support provided by vctrs will yield a better add_row() implementation, in return name repair which is currently defined in tibble should likely live in vctrs.

Dependencies

Currently, installing tibble can bring in almost dozen other packages:

tools::package_dependencies("tibble", recursive = TRUE, which = "Imports")
#> $tibble
#>  [1] "cli"        "crayon"     "fansi"      "methods"    "pillar"    
#>  [6] "pkgconfig"  "rlang"      "utils"      "assertthat" "grDevices" 
#> [11] "utf8"       "tools"

Some of them, namely fansi and utf8, contain code that requires compilation and are only required for optional features. The plan is to make these packages, and crayon, a suggested package to cli, and provide fallback implementations there. When finished, taking a strong dependency on tibble won’t add too many new dependencies (again): rlang, vctrs and cli will be used by most of the tidyverse anyway, pillar is the only truly new strong dependency. Packages that subclass tbl_df should import tibble to make sure that the subsetting operator [ always behaves the same. Constructing (subclasses of) tibbles should happen through new_tibble() only.

Decorations

Tibbles have a very opinionated way to print their data, not always in line with users’ expectations, and sometimes clearly wrong (e.g. for numerical data where the absolute mean is much larger than the standard deviation). It seems difficult to devise a formatting that suits all needs, especially for numbers: how do we tell if a number represents money, or perhaps is a misspecified categorical variable or a UID? Decorations are an idea that might help here. A decoration is applied only when printing a vector, which behaves identically to a bare vector otherwise. Decorations can be “learned” from the data (using heuristics), or specified directly after import or when creating column, and stored in attribues like "class". It will be important to make sure that these attributes survive subsetting and perhaps some arithmetic transformations, easiest to achieve with the help of vctrs.

Acknowledgments

Thanks to Brodie Gaslam (@brodieG) for his help with formatting this blog post and for spotting inaccurate wording.

We also received issues, pull requests, and comments from 108 people since tibble 1.4.2. Thanks to everyone:

@adam-gruer, @aegerton, @alaindanet, @alexpghayes, @alexwhan, @alistaire47, @anhqle, @batpigandme, @brendanf, @brodieG, @cfhammill, @christophsax, @cimentadaj, @czeildi, @DasHammett, @DavisVaughan, @earowang, @Eluvias, @Enchufa2, @esford3, @flying-sheep, @gavinsimpson, @GeorgeHayduke, @gregorp, @hadley, @IndrajeetPatil, @iron0012, @isteves, @jeffreyhanson, @jennybc, @jimhester, @JLYJabc, @joranE, @jtelleriar, @karldw, @kendonB, @kevinushey, @kovla, @lbusett, @lionel-, @lorenzwalthert, @lwiklendt, @mattfidler, @MatthieuStigler, @maxheld83, @michaelweylandt, @mingsu, @momeara, @PalaceChan, @pat-s, @plantarum, @prosoitos, @ptoche, @QuLogic, @ralonso-igenomix, @randomgambit, @riccardopinosio, @romainfrancois, @tomroh, @Woosah, @yonicd, and @yutannihilation.

Upcoming events
Sydney, Australia
May 1-2
You should take this workshop if you have experience programming in R and want to learn how to tackle larger scale problems. The class is taught by Hadley Wickham, Chief Scientist at RStudio.
San Francisco, CA
Jan 27-30
rstudio::conf 2020 covers all things RStudio, including workshops to teach you the tidyverse, and talks to show you the latest and greatest features.