Q4 2023 tidymodels digest

  tidymodels, recipes

  Emil Hvitfeldt

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like this post from the past couple of months:

Since our last roundup post, there have been CRAN releases of 7 tidymodels packages. Here are links to their NEWS files:

We’ll highlight a few especially notable changes below: updated warnings when normalizing, and better error messages in recipes.

library(tidymodels)

data("ames", package = "modeldata")

Updated warnings when normalizing

The latest release of recipes features an overhaul of the warnings and error messages to use the cli package. With this, we are starting the project of providing more information signaling when things don’t go well.

The first type of issue we now signal for is when you try to normalize data that contains elements such as NA or Inf. These can sneak in for several reasons, and before this release, it happened silently. Below we are creating a recipe using the ames data set, and before we normalize, we are taking the logarithms of all variables that pertain to square footage.

rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_log(contains("SF")) |>
  step_normalize(all_numeric_predictors()) |>
  prep()
#> Warning: Columns `BsmtFin_SF_1`, `BsmtFin_SF_2`, `Bsmt_Unf_SF`, `Total_Bsmt_SF`,
#> `Second_Flr_SF`, `Wood_Deck_SF`, and `Open_Porch_SF` returned NaN, because
#> variance cannot be calculated and scaling cannot be used. Consider avoiding
#> `Inf` or `-Inf` values and/or setting `na_rm = TRUE` before normalizing.

We now get a warning that something happened, telling us that it encountered Inf or -Inf. Knowing that, we can go back and investigate what went wrong. If we exclude step_normalize() and bake() the recipe, we see that a number of -Inf values appear.

recipe(Sale_Price ~ ., data = ames) |>
  step_log(contains("SF")) |>
  prep() |>
  bake(new_data = NULL, contains("SF")) |>
  glimpse()
#> Rows: 2,930
#> Columns: 8
#> $ BsmtFin_SF_1  <dbl> 0.6931472, 1.7917595, 0.0000000, 0.0000000, 1.0986123, 1…
#> $ BsmtFin_SF_2  <dbl> -Inf, 4.969813, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf, -Inf…
#> $ Bsmt_Unf_SF   <dbl> 6.089045, 5.598422, 6.006353, 6.951772, 4.919981, 5.7807…
#> $ Total_Bsmt_SF <dbl> 6.984716, 6.782192, 7.192182, 7.654443, 6.833032, 6.8308…
#> $ First_Flr_SF  <dbl> 7.412160, 6.797940, 7.192182, 7.654443, 6.833032, 6.8308…
#> $ Second_Flr_SF <dbl> -Inf, -Inf, -Inf, -Inf, 6.552508, 6.519147, -Inf, -Inf, …
#> $ Wood_Deck_SF  <dbl> 5.347108, 4.941642, 5.973810, -Inf, 5.356586, 5.886104, …
#> $ Open_Porch_SF <dbl> 4.127134, -Inf, 3.583519, -Inf, 3.526361, 3.583519, -Inf…

Looking at the bare data set, we notice that the -Inf all appear where there are 0, which makes sense since log(0) is undefined.

ames |>
  select(contains("SF")) |>
  glimpse()
#> Rows: 2,930
#> Columns: 8
#> $ BsmtFin_SF_1  <dbl> 2, 6, 1, 1, 3, 3, 3, 1, 3, 7, 7, 1, 7, 3, 3, 1, 3, 3, 4,…
#> $ BsmtFin_SF_2  <dbl> 0, 144, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1120, 0, 0, …
#> $ Bsmt_Unf_SF   <dbl> 441, 270, 406, 1045, 137, 324, 722, 1017, 415, 994, 763,…
#> $ Total_Bsmt_SF <dbl> 1080, 882, 1329, 2110, 928, 926, 1338, 1280, 1595, 994, …
#> $ First_Flr_SF  <int> 1656, 896, 1329, 2110, 928, 926, 1338, 1280, 1616, 1028,…
#> $ Second_Flr_SF <int> 0, 0, 0, 0, 701, 678, 0, 0, 0, 776, 892, 0, 676, 0, 0, 1…
#> $ Wood_Deck_SF  <int> 210, 140, 393, 0, 212, 360, 0, 0, 237, 140, 157, 483, 0,…
#> $ Open_Porch_SF <int> 62, 0, 36, 0, 34, 36, 0, 82, 152, 60, 84, 21, 75, 0, 54,…

Knowing that it was 0 that caused the problem, we can set an offset to avoid taking log(0).

rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_log(contains("SF"), offset = 0.5) |>
  step_normalize(all_numeric_predictors()) |>
  prep()

These warnings appear in step_scale(), step_normalize(), step_center() or step_range().

Better error messages in recipes

Another problem that happens a lot when using recipes, is accidentally selecting variables that have the wrong types. Previously this caused the following error:

recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(starts_with("Lot_")) |>
  prep()
#> Error in `step_dummy()`:
#> Caused by error in `prep()`:
#> ! All columns selected for the step should be string, factor, or ordered.

In the newest release, it will detail the offending variables and what was wrong with them.

recipe(Sale_Price ~ ., data = ames) |>
  step_dummy(starts_with("Lot_")) |>
  prep() |>
  bake()
#> Error in `step_dummy()`:
#> Caused by error in `prep()`:
#>  All columns selected for the step should be factor or ordered.
#>  1 double variable found: `Lot_Frontage`
#>  1 integer variable found: `Lot_Area`

Coming Attractions

In the next month or so we are planning a cascade of CRAN releases. There is a lot of new functionality coming your way, especially in the tune package.

A number of our packages will (finally) be able to cohesively fit, evaluate, tune, and predict models for event times (a.k.a., survival analysis). If you don’t do this type of work, you might not notice the new capabilities. However, if you do, tidymodels will be able to do a lot more for you.

We’ve also implemented a number of features related to model fairness. These tools allow tidymodels users to identify when machine learning models behave unfairly towards certain groups of people, and will also be included in the upcoming releases of tidymodels packages in Q1.

We’ll highlight a lot of these new capabilities in blog posts here as well as tutorials on tidymodels.org.

So, there’s a lot more coming! We are very excited to have these features officially available and to see what people can do with them.

Acknowledgements

We’d like to thank those in the community that contributed to tidymodels in the last quarter:

We’re grateful for all of the tidymodels community, from observers to users to contributors. Happy modeling!