Q1 2023 tidymodels digest

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like these posts from the past couple of months:

Since our last roundup post, there have been CRAN releases of 24 tidymodels packages. Here are links to their NEWS files:

agua (0.1.2)
baguette (1.0.1)
broom (1.0.4)
butcher (0.3.2)
censored (0.2.0)
dials (1.2.0)
discrim (1.0.1)
embed (1.1.0)
finetune (1.1.0)
hardhat (1.3.0)
modeldata (1.1.0)
parsnip (1.1.0)
recipes (1.0.6)
rules (1.0.2)
spatialsample (0.3.0)
stacks (1.0.2)
textrecipes (1.0.3)
themis (1.0.1)
tidyclust (0.1.2)
tidypredict (0.5)
tune (1.1.1)
workflows (1.1.3)
workflowsets (1.0.1)
yardstick (1.2.0)

We’ll highlight a few especially notable changes below: more informative errors and faster code. First, loading the collection of packages:

library(tidymodels)
library(embed)

data("ames", package = "modeldata")

More informative errors

In the last few months we have been focused on refining error messages so that they are easier for the users to pinpoint what went wrong and where. Since the modeling pipeline can be quite complicated, getting uninformative errors is a no-go.

Across the tidymodels, error messages will now indicate the user-facing function that caused the error rather than the internal function that it came from.

From dials, an error that looked like

degree(range = c(1L, 5L))
#> Error in `new_quant_param()`:
#> ! Since `type = 'double'`, please use that data type for the range.

Now says that the error came from degree() rather than new_quant_param()

degree(range = c(1L, 5L))
#> Error in `degree()`:
#> ! Since `type = 'double'`, please use that data type for the range.

The same thing can be seen with the yardstick metrics

mtcars |>
  accuracy(vs, am)
#> Error in `dplyr::summarise()`:
#> ℹ In argument: `.estimate = metric_fn(truth = vs, estimate = am, na_rm =
#>   na_rm)`.
#> Caused by error in `validate_class()`:
#> ! `truth` should be a factor but a numeric was supplied.

which now errors much more informatively

mtcars |>
  accuracy(vs, am)
#> Error in `accuracy()`:
#> ! `truth` should be a factor, not a `numeric`.

Lastly, one of the biggest improvements came in recipes, which now shows which step caused the error instead of saying it happened in prep() or bake(). This is a huge improvement since preprocessing pipelines which often string together many preprocessing steps.

Before

recipe(~., data = ames) |>
  step_novel(Neighborhood, new_level = "Gilbert") |>
  prep()
#> Error in `prep()`:
#> ! Columns already contain the new level: Neighborhood

Now

recipe(~., data = ames) |>
  step_novel(Neighborhood, new_level = "Gilbert") |>
  prep()
#> Error in `step_novel()`:
#> Caused by error in `prep()` at recipes/R/recipe.R:437:8:
#> ! Columns already contain the new level: Neighborhood

Especially when calls to recipes functions are deeply nested inside the call stack, like in fit_resamples() or tune_grid(), these changes make a big difference.

Things are getting faster

As we have written about in The tidymodels is getting a whole lot faster and Writing performant code with tidy tools, we have been working on tightening up the performance of the tidymodels code. These changes are mostly related to the infrastructure code, meaning that the speedup will bring you to closer underlying implementations.

A different kind of speedup is found with the addition of the step_pca_truncated() step added in the embed package.

Principal Component Analysis is a really powerful and fast method for dimensionality reduction of large data sets. However, for data with many columns, it can be computationally expensive to calculate all the principal components. step_pca_truncated() works in much the same way as step_pca() but it only calculates the number of components it needs

pca_normal <- recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(all_nominal_predictors()) |>
  step_pca(all_numeric_predictors(), num_comp = 3)

pca_truncated <- recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(all_nominal_predictors()) |>
  step_pca_truncated(all_numeric_predictors(), num_comp = 3)

tictoc::tic()
prep(pca_normal) |> bake(ames)
#> # A tibble: 2,930 × 4
#>    Sale_Price     PC1    PC2   PC3
#>         <int>   <dbl>  <dbl> <dbl>
#>  1     215000 -31793.  4151. -197.
#>  2     105000 -12198.  -611. -524.
#>  3     172000 -14911.  -265. 7568.
#>  4     244000 -12072. -1813.  918.
#>  5     189900 -14418.  -345. -302.
#>  6     195500 -10704. -1367. -204.
#>  7     213500  -5858. -2805.  114.
#>  8     191500  -5932. -2762.  131.
#>  9     236500  -6368. -2862.  325.
#> 10     189000  -8368. -2219.  126.
#> # ℹ 2,920 more rows
tictoc::toc()
#> 0.782 sec elapsed

tictoc::tic()
prep(pca_truncated) |> bake(ames)
#> # A tibble: 2,930 × 4
#>    Sale_Price     PC1    PC2   PC3
#>         <int>   <dbl>  <dbl> <dbl>
#>  1     215000 -31793.  4151. -197.
#>  2     105000 -12198.  -611. -524.
#>  3     172000 -14911.  -265. 7568.
#>  4     244000 -12072. -1813.  918.
#>  5     189900 -14418.  -345. -302.
#>  6     195500 -10704. -1367. -204.
#>  7     213500  -5858. -2805.  114.
#>  8     191500  -5932. -2762.  131.
#>  9     236500  -6368. -2862.  325.
#> 10     189000  -8368. -2219.  126.
#> # ℹ 2,920 more rows
tictoc::toc()
#> 0.162 sec elapsed

The speedup will be orders of magnitude larger for very wide data.

Acknowledgements

We’d like to thank those in the community that contributed to tidymodels in the last quarter:

agua: @hfrick, @simonpcouch, and @topepo.
baguette: @simonpcouch, and @topepo.
broom: @benwhalley, @dgrtwo, @egosv, @hfrick, @JorisChau, @mccarthy-m-g, @MichaelChirico, @paige-cho, @PoGibas, @rsbivand, @simonpcouch, @ste-tuf, and @victor-vscn.
butcher: @ashbythorpe, @DavisVaughan, @hfrick, @juliasilge, @rdavis120, @rkb965, and @simonpcouch.
censored: @brunocarlin, and @hfrick.
dials: @amin0511ss, @EmilHvitfeldt, @hfrick, and @simonpcouch.
discrim: @EmilHvitfeldt, and @tomwagstaff-opml.
embed: @EmilHvitfeldt, @jackobenco016, and @skasowitz.
finetune: @Freestyleyang, @simonpcouch, and @topepo.
hardhat: @cregouby, @DavisVaughan, @EmilHvitfeldt, @frank113, and @mikemahoney218.
modeldata: @EmilHvitfeldt, and @topepo.
parsnip: @EmilHvitfeldt, @emmafeuer, @exsell-jc, @hfrick, @mariamaseng, @SHo-JANG, @simonpcouch, @topepo, and @Tripartio.
recipes: @AshesITR, @EmilHvitfeldt, @hfrick, @jjcurtin, @lang-benjamin, @lbui30, @PeterKoffeldt, @rdavis120, @simonpcouch, @StevenWallaert, @tellyshia, @topepo, @ttrodrigz, and @zecojls.
rules: @EmilHvitfeldt, @hfrick, @jonthegeek, and @topepo.
spatialsample: @hfrick, @mikemahoney218, and @RaymondBalise.
stacks: @amin0511ss, @gundalav, @jrosell, @juliasilge, @pbulsink, @rdavis120, and @simonpcouch.
textrecipes: @apsteinmetz, @EmilHvitfeldt, @gary-mu, @hfrick, and @nipnipj.
themis: @carlganz, @EmilHvitfeldt, @hfrick, @nipnipj, @rmurphy49, and @rowanjh.
tidyclust: @EmilHvitfeldt, @hfrick, @hsbadr, @jonthegeek, and @simonpcouch.
tidypredict: @edgararuiz, and @sdcharle.
tune: @BenoitLondon, @cphaarmeyer, @hfrick, @jthomasmock, @mrjujas, @MxNl, @nabsiddiqui, @rdavis120, @SHo-JANG, @simonpcouch, @topepo, @walrossker, and @yusuftengriverdi.
workflows: @simonpcouch.
workflowsets: @EmilHvitfeldt, @gsimchoni, and @simonpcouch.
yardstick: @77makr, @burch-cm, @EmilHvitfeldt, @idavydov, @kadyb, @mawardivaz, @mikemahoney218, @moloscripts, @nyambea, @SHo-JANG, @simdadim, and @simonpcouch.

We’re grateful for all of the tidymodels community, from observers to users to contributors. Happy modeling!