I’m pleased to announce that version 2.0.1 of the tibble package is on CRAN now, just in time for rstudio::conf(). Tibbles are a modern reimagining of the data frame, keeping what time has shown to be effective, and throwing out what is not, with nicer default output too! Grab the latest version with:
install.packages("tibble")
This release required a bit of preparation, including a
pre-release blog post that described the breaking changes, mostly in
as_tibble()
,
new_tibble()
,
set_tidy_names()
,
tidy_names()
, and names<-()
, and a patch release that fixed problems found after the initial 2.0.0 release.
In this blog post, I focus on a few user- and programmer-related changes, and give an outlook over future development:
-
view()
, namelessenframe()
, 2D columns - Lifecycle, robustness, name repair, row names,
glimpse()
for subclasses - vctrs, dependencies, decorations
For a complete overview please see the release notes.
Use the issue tracker to submit bugs or suggest ideas, your contributions are always welcome.
Changes that affect users
view
The experimental
view()
function forwards its input to utils::View()
(only in interactive mode) and always returns its input invisibly, which is useful for pipe-based workflows.
Currently it is unclear if this functionality should live in tibble or elsewhere.
# This is a no-op in non-interactive mode.
# In interactive mode, a viewer window/pane will open.
iris %>%
view()
Nameless enframe
The
enframe()
function always has been a good way to convert a (named) vector to a two-column data frame.
In this version, conversion to a one-column data frame is also supported by setting the name
argument to NULL
.
This is now the recommended way to turn a vector to a one-column tibble, due to changes to the default implementation of
as_tibble()
.
enframe(letters[1:3])
#> # A tibble: 3 x 2
#> name value
#> <int> <chr>
#> 1 1 a
#> 2 2 b
#> 3 3 c
enframe(letters[1:3], name = NULL)
#> # A tibble: 3 x 1
#> value
#> <chr>
#> 1 a
#> 2 b
#> 3 c
2D columns
tibble()
now supports columns that are matrices or data frames.
These have always been supported in data frames and are used in some modelling functions.
We are looking forward to supporting these and other exciting use cases, see also the
Matrix and data frame columns chapter of adv-r.
The number of rows in these objects must be consistent with the length of the other columns.
Internally, this feature required using NROW()
instead of length()
in a few spots, which conveniently returns the length for vectors and the number of rows for 2D objects.
The required support in pillar has been added earlier last year.
tibble(
a = 1:3,
b = tibble(c = 4:6),
d = tibble(e = 7:9, f = tibble(g = 10, h = 11)),
i = diag(3)
)
#> # A tibble: 3 x 4
#> a b$c d$e $f$g $$h i[,1] [,2] [,3]
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 4 7 10 11 1 0 0
#> 2 2 5 8 10 11 0 1 0
#> 3 3 6 9 10 11 0 0 1
Changes that affect package developers
Lifecycle
All functions have been assigned a lifecycle.
The tibble package has now reached the “stable” lifecycle, functions in a different lifecycle stage are marked as such in their documentation.
One example is the
add_row()
function: it is unclear if it should ensure that all columns have length one by wrapping in a list if necessary, and a better implementation is perhaps possible once tibble uses the vctrs package, see below.
Therefore this function is marked “questioning”.
Learn more about lifecycle in the tidyverse at https://www.tidyverse.org/lifecycle/.
Robustness
The new .rows
argument to
tibble()
and
as_tibble()
allows specifying the expected number of rows explicitly, even if it’s evident from the data.
This supports writing even more defensive code.
The nrow
argument to the low-level
new_tibble()
constructor is now mandatory, on the other hand most expensive checks have been moved to the new
validate_tibble()
function.
This means that constructions of tibbles is now faster by default if you know that the inputs are correct, but you can always double-check if needed.
See also the
S3 classes chapter in adv-r for motivation.
tibble(a = 1, b = 1:3, .rows = 3)
#> # A tibble: 3 x 2
#> a b
#> <dbl> <int>
#> 1 1 1
#> 2 1 2
#> 3 1 3
tibble(a = 1, b = 2:3, .rows = 3)
#> Error: Tibble columns must have consistent lengths, only values of length one are recycled:
#> * Length 3: Requested with `.rows` argument
#> * Length 2: Column `b`
tibble(a = 1, .rows = 3)
#> # A tibble: 3 x 1
#> a
#> <dbl>
#> 1 1
#> 2 1
#> 3 1
as_tibble(iris[1:3, ], .rows = 3)
#> # A tibble: 3 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
new_tibble(list(a = 1:3), nrow = 3)
#> # A tibble: 3 x 1
#> a
#> <int>
#> 1 1
#> 2 2
#> 3 3
bad <- new_tibble(list(a = 1:2), nrow = 3)
validate_tibble(bad)
#> Error: Tibble columns must have consistent lengths, only values of length one are recycled:
#> * Length 3: Requested with `nrow` argument
#> * Length 2: Column `a`
Name repair
Column name repair has more direct support, via the new .name_repair
argument to
tibble()
and
as_tibble()
.
It takes the following values:
"minimal"
: No name repair or checks, beyond basic existence."unique"
: Make sure names are unique and not empty."check_unique"
: (default value), no name repair, but check they areunique
."universal"
: Make the namesunique
and syntactic.- a function: apply custom name repair (e.g.,
.name_repair = make.names
or.name_repair = ~make.names(., unique = TRUE)
for names in the style of base R).
## by default, duplicate names are not allowed
tibble(`1a` = 1, `1a` = 2)
#> Error: Column name `1a` must not be duplicated.
#> Use .name_repair to specify repair.
## you can authorize duplicate names
tibble(`1a` = 1, `1a` = 2, .name_repair = "minimal")
#> # A tibble: 1 x 2
#> `1a` `1a`
#> <dbl> <dbl>
#> 1 1 2
## or request that the names be made unique
tibble(`1a` = 1, `1a` = 2, .name_repair = "unique")
#> New names:
#> * `1a` -> `1a..1`
#> * `1a` -> `1a..2`
#> # A tibble: 1 x 2
#> `1a..1` `1a..2`
#> <dbl> <dbl>
#> 1 1 2
## or universal
tibble(`1a` = 1, `1a` = 2, .name_repair = "universal")
#> New names:
#> * `1a` -> ..1a..1
#> * `1a` -> ..1a..2
#> # A tibble: 1 x 2
#> ..1a..1 ..1a..2
#> <dbl> <dbl>
#> 1 1 2
Row names
Row name handling is stricter.
Row names were never supported in
tibble()
and
new_tibble()
, and are now stripped by default in
as_tibble()
.
The rownames
argument to
as_tibble()
supports:
NULL
: remove row names (default),NA
: keep row names,- A string: the name of the new column that will contain the existing row names, which are no longer present in the result.
The old default can be restored by calling pkgconfig::set_config("tibble::rownames", NA)
, this also works for packages that import tibble.
rownames(as_tibble(mtcars))
#> [1] "1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14"
#> [15] "15" "16" "17" "18" "19" "20" "21" "22" "23" "24" "25" "26" "27" "28"
#> [29] "29" "30" "31" "32"
as_tibble(mtcars, rownames = "make_model")
#> # A tibble: 32 x 12
#> make_model mpg cyl disp hp drat wt qsec vs am gear
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4
#> 2 Mazda RX4… 21 6 160 110 3.9 2.88 17.0 0 1 4
#> 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4
#> 4 Hornet 4 … 21.4 6 258 110 3.08 3.22 19.4 1 0 3
#> 5 Hornet Sp… 18.7 8 360 175 3.15 3.44 17.0 0 0 3
#> 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3
#> 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3
#> 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4
#> 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4
#> 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4
#> # … with 22 more rows, and 1 more variable: carb <dbl>
glimpse for subclasses
The
glimpse()
function shows information obtained from
tbl_sum()
in the header, e.g. grouping information for grouped_df
from dplyr, or other information from packages that override the tbl_df
class.
iris %>%
group_by(Species) %>%
glimpse()
#> Observations: 150
#> Variables: 5
#> Groups: Species [3]
#> $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5…
#> $ Sepal.Width <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3…
#> $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1…
#> $ Petal.Width <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0…
#> $ Species <fct> setosa, setosa, setosa, setosa, setosa, setosa, set…
Outlook
vctrs
The plan is to
use vctrs in tibble 2.1.0.
This package is a solid foundation for handling coercion, concatenation and recycling in vectors of arbitrary type.
The support provided by vctrs will yield a better
add_row()
implementation, in return name repair which is currently defined in tibble should likely live in vctrs.
Dependencies
Currently, installing tibble can bring in almost dozen other packages:
tools::package_dependencies("tibble", recursive = TRUE, which = "Imports")
#> $tibble
#> [1] "cli" "crayon" "fansi" "methods" "pillar"
#> [6] "pkgconfig" "rlang" "utils" "assertthat" "grDevices"
#> [11] "utf8" "tools"
Some of them, namely fansi and utf8, contain code that requires compilation and are only required for optional features.
The plan is to make these packages, and crayon, a suggested package to cli, and provide fallback implementations there.
When finished, taking a strong dependency on tibble won’t add too many new dependencies (again): rlang, vctrs and cli will be used by most of the tidyverse anyway, pillar is the only truly new strong dependency.
Packages that subclass tbl_df
should import tibble to make sure that the subsetting operator [
always behaves the same.
Constructing (subclasses of) tibbles should happen through
new_tibble()
only.
Decorations
Tibbles have a very opinionated way to print their data, not always in line with users’ expectations, and sometimes clearly wrong (e.g. for numerical data where the absolute mean is much larger than the standard deviation).
It seems difficult to devise a formatting that suits all needs, especially for numbers: how do we tell if a number represents money, or perhaps is a misspecified categorical variable or a UID?
Decorations are an idea that might help here.
A decoration is applied only when printing a vector, which behaves identically to a bare vector otherwise.
Decorations can be “learned” from the data (using heuristics), or specified directly after import or when creating column,
and stored in attribues like "class"
.
It will be important to make sure that these attributes survive subsetting and perhaps some arithmetic transformations, easiest to achieve with the help of vctrs.
Acknowledgments
Thanks to Brodie Gaslam ( @brodieG) for his help with formatting this blog post and for spotting inaccurate wording.
We also received issues, pull requests, and comments from 108 people since tibble 1.4.2. Thanks to everyone:
@adam-gruer, @aegerton, @alaindanet, @alexpghayes, @alexwhan, @alistaire47, @anhqle, @batpigandme, @brendanf, @brodieG, @cfhammill, @christophsax, @cimentadaj, @czeildi, @DasHammett, @DavisVaughan, @earowang, @Eluvias, @Enchufa2, @esford3, @flying-sheep, @gavinsimpson, @GeorgeHayduke, @gregorp, @hadley, @IndrajeetPatil, @iron0012, @isteves, @jeffreyhanson, @jennybc, @jimhester, @JLYJabc, @joranE, @jtelleriar, @karldw, @kendonB, @kevinushey, @kovla, @lbusett, @lionel-, @lorenzwalthert, @lwiklendt, @mattfidler, @MatthieuStigler, @maxheld83, @michaelweylandt, @mingsu, @momeara, @PalaceChan, @pat-s, @plantarum, @prosoitos, @ptoche, @QuLogic, @ralonso-igenomix, @randomgambit, @riccardopinosio, @romainfrancois, @tomroh, @Woosah, @yonicd, and @yutannihilation.