recipes 1.3.0

  recipes, tidymodels

  Emil Hvitfeldt

We’re thrilled to announce the release of recipes 1.3.0. recipes lets you create a pipeable sequence of feature engineering steps.

You can install it from CRAN with:

install.packages("recipes")

This blog post will walk through some of the highlights of this release, which includes changes to how strings_as_factors are specified, deprecation of step_select(), new contrasts argument for step_dummy(), and improvements for step_impute_bag().

You can see a full list of changes in the release notes.

Let’s first load the package:

strings_as_factors

Recipes by default convert predictor strings to factors, and the option for that is located in prep(). This caused an issue when you wanted to set strings_as_factors = FALSE for a recipe that is used somewhere else like in a workflow.

This is no longer an issue as we have moved the argument to recipe() itself. We are at the same time deprecating the use of strings_as_factors when used in prep(). Here is an example:

library(modeldata)
tate_text
#> # A tibble: 4,284 × 5
#>        id artist             title                                  medium  year
#>     <dbl> <fct>              <chr>                                  <fct>  <dbl>
#>  1  21926 Absalon            Proposals for a Habitat                Video…  1990
#>  2  20472 Auerbach, Frank    Michael                                Etchi…  1990
#>  3  20474 Auerbach, Frank    Geoffrey                               Etchi…  1990
#>  4  20473 Auerbach, Frank    Jake                                   Etchi…  1990
#>  5  20513 Auerbach, Frank    To the Studios                         Oil p…  1990
#>  6  21389 Ayres, OBE Gillian Phaëthon                               Oil p…  1990
#>  7 121187 Barlow, Phyllida   Untitled                               Acryl…  1990
#>  8  19455 Baselitz, Georg    Green VIII                             Woodc…  1990
#>  9  20938 Beattie, Basil     Present Bound                          Oil p…  1990
#> 10 105941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  1990
#> # ℹ 4,274 more rows

We are loading the modeldata package to get tate_text which has a character column title. If we don’t do anything then it turns into a factor.

recipe(~., data = tate_text) |>
  prep() |>
  bake(tate_text)
#> # A tibble: 4,284 × 5
#>        id artist             title                                  medium  year
#>     <dbl> <fct>              <fct>                                  <fct>  <dbl>
#>  1  21926 Absalon            Proposals for a Habitat                Video…  1990
#>  2  20472 Auerbach, Frank    Michael                                Etchi…  1990
#>  3  20474 Auerbach, Frank    Geoffrey                               Etchi…  1990
#>  4  20473 Auerbach, Frank    Jake                                   Etchi…  1990
#>  5  20513 Auerbach, Frank    To the Studios                         Oil p…  1990
#>  6  21389 Ayres, OBE Gillian Phaëthon                               Oil p…  1990
#>  7 121187 Barlow, Phyllida   Untitled                               Acryl…  1990
#>  8  19455 Baselitz, Georg    Green VIII                             Woodc…  1990
#>  9  20938 Beattie, Basil     Present Bound                          Oil p…  1990
#> 10 105941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  1990
#> # ℹ 4,274 more rows

But we can set strings_as_factors = FALSE in recipe() and it won’t anymore.

recipe(~., data = tate_text, strings_as_factors = FALSE) |>
  prep() |>
  bake(tate_text)
#> # A tibble: 4,284 × 5
#>        id artist             title                                  medium  year
#>     <dbl> <fct>              <chr>                                  <fct>  <dbl>
#>  1  21926 Absalon            Proposals for a Habitat                Video…  1990
#>  2  20472 Auerbach, Frank    Michael                                Etchi…  1990
#>  3  20474 Auerbach, Frank    Geoffrey                               Etchi…  1990
#>  4  20473 Auerbach, Frank    Jake                                   Etchi…  1990
#>  5  20513 Auerbach, Frank    To the Studios                         Oil p…  1990
#>  6  21389 Ayres, OBE Gillian Phaëthon                               Oil p…  1990
#>  7 121187 Barlow, Phyllida   Untitled                               Acryl…  1990
#>  8  19455 Baselitz, Georg    Green VIII                             Woodc…  1990
#>  9  20938 Beattie, Basil     Present Bound                          Oil p…  1990
#> 10 105941 Beuys, Joseph      Joseph Beuys: A Private Collection. A… Print…  1990
#> # ℹ 4,274 more rows

This change should also make pragmatic sense as whether you want to turn strings into factors is something that should encoded into the recipe itself.

Deprecating step_select()

We have started the process of deprecating step_select(). Given the number of issues people are having with the step and the fact that it doesn’t play well with workflows we think this is the right call.

There are two main use cases where step_select() was used: removing variables, and selecting variables. Removing variables when done with - in step_select()

recipe(mpg ~ ., mtcars) |>
  step_select(-starts_with("d")) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 9
#>      cyl    hp    wt  qsec    vs    am  gear  carb   mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     6   110  2.62  16.5     0     1     4     4  21  
#>  2     6   110  2.88  17.0     0     1     4     4  21  
#>  3     4    93  2.32  18.6     1     1     4     1  22.8
#>  4     6   110  3.22  19.4     1     0     3     1  21.4
#>  5     8   175  3.44  17.0     0     0     3     2  18.7
#>  6     6   105  3.46  20.2     1     0     3     1  18.1
#>  7     8   245  3.57  15.8     0     0     3     4  14.3
#>  8     4    62  3.19  20       1     0     4     2  24.4
#>  9     4    95  3.15  22.9     1     0     4     2  22.8
#> 10     6   123  3.44  18.3     1     0     4     4  19.2
#> # ℹ 22 more rows

These use cases can seamlessly be converted to use step_rm() without the - for the same result.

recipe(mpg ~ ., mtcars) |>
  step_rm(starts_with("d")) |>
  prep() |>
  bake(new_data = NULL)
#> # A tibble: 32 × 9
#>      cyl    hp    wt  qsec    vs    am  gear  carb   mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     6   110  2.62  16.5     0     1     4     4  21  
#>  2     6   110  2.88  17.0     0     1     4     4  21  
#>  3     4    93  2.32  18.6     1     1     4     1  22.8
#>  4     6   110  3.22  19.4     1     0     3     1  21.4
#>  5     8   175  3.44  17.0     0     0     3     2  18.7
#>  6     6   105  3.46  20.2     1     0     3     1  18.1
#>  7     8   245  3.57  15.8     0     0     3     4  14.3
#>  8     4    62  3.19  20       1     0     4     2  24.4
#>  9     4    95  3.15  22.9     1     0     4     2  22.8
#> 10     6   123  3.44  18.3     1     0     4     4  19.2
#> # ℹ 22 more rows

For selecting variables there are two cases. The first is as a tool to select which variables to use in our model. We recommend that you use select() to do that before passing the data into the recipe(). This is especially helpful since recipes are tighter with respect to their input types, so only passing the data you need to use is helpful.

If you need to do the selection after another step takes effect you should still be able to do so, by using step_rm() in the following manner.

step_rm(recipe, all_predictors(), -all_of(<variables that you want to keep>))

step_dummy() contrasts argument

Contrasts such as contr.treatment() and contr.poly() are used in step_dummy() to determine how the steps should translate categorical values into one or more numeric columns. Traditionally the contrasts were set using options() like so:

options(contrasts = c(unordered = "contr.poly", ordered = "contr.poly"))
recipe(~species + island, penguins) |>
  step_dummy(all_nominal_predictors()) |>
  prep() |>
  bake(new_data = penguins)
#> # A tibble: 344 × 4
#>    species_Chinstrap species_Gentoo island_Dream island_Torgersen
#>                <dbl>          <dbl>        <dbl>            <dbl>
#>  1            -0.707          0.408        0.707            0.408
#>  2            -0.707          0.408        0.707            0.408
#>  3            -0.707          0.408        0.707            0.408
#>  4            -0.707          0.408        0.707            0.408
#>  5            -0.707          0.408        0.707            0.408
#>  6            -0.707          0.408        0.707            0.408
#>  7            -0.707          0.408        0.707            0.408
#>  8            -0.707          0.408        0.707            0.408
#>  9            -0.707          0.408        0.707            0.408
#> 10            -0.707          0.408        0.707            0.408
#> # ℹ 334 more rows

The issue with this approach is that it pulls from options() when it needs it instead of storing the information. This means that if you put this recipe in production you will need to set the option in the production environment to match that of the training environment.

To fix this issue we have given step_dummy() an argument contrasts that work in much the same way. You simply specify the contrast you want and it will be stored in the object for easy deployment.

recipe(~species + island, penguins) |>
  step_dummy(
    all_nominal_predictors(), contrasts = "contr.poly") |>
  prep() |>
  bake(new_data = penguins)
#> # A tibble: 344 × 4
#>    species_Chinstrap species_Gentoo island_Dream island_Torgersen
#>                <dbl>          <dbl>        <dbl>            <dbl>
#>  1            -0.707          0.408        0.707            0.408
#>  2            -0.707          0.408        0.707            0.408
#>  3            -0.707          0.408        0.707            0.408
#>  4            -0.707          0.408        0.707            0.408
#>  5            -0.707          0.408        0.707            0.408
#>  6            -0.707          0.408        0.707            0.408
#>  7            -0.707          0.408        0.707            0.408
#>  8            -0.707          0.408        0.707            0.408
#>  9            -0.707          0.408        0.707            0.408
#> 10            -0.707          0.408        0.707            0.408
#> # ℹ 334 more rows

If you are using a contrasts from an external package such as hardhat::contr_one_hot() you will need to have the package loaded in the environments you are working in with library(hardhat) and setting contrasts = "contr_one_hot". You will also need to call library(hardhat) in any production environments you are using this recipe.

tidyselect can be used everywhere

Several steps such as step_pls() and step_impute_bag() require the selection of more than just the affected columns. step_pls() needs you to select an outcome variable and step_impute_bag() needs you to select which variables to impute with, impute_with, if you don’t want to use all predictors. Previously these needed to be strings or use special selectors like imp_vars(). You don’t have to do that anymore. You can now use tidyselect in these arguments too.

recipe(mpg ~ ., mtcars) |>
  step_pls(all_predictors(), outcome = mpg) |>
  prep() |>
  bake(new_data = mtcars)
#> # A tibble: 32 × 3
#>      mpg   PLS1   PLS2
#>    <dbl>  <dbl>  <dbl>
#>  1  21    0.693  0.895
#>  2  21    0.650  0.654
#>  3  22.8  2.78   0.378
#>  4  21.4  0.210 -0.368
#>  5  18.7 -1.95   0.845
#>  6  18.1  0.137 -0.624
#>  7  14.3 -2.77   0.364
#>  8  24.4  1.81  -1.30 
#>  9  22.8  2.12  -1.95 
#> 10  19.2  0.531 -1.51 
#> # ℹ 22 more rows

For arguments that allow for multiple selections now work with recipes selectors like all_numeric_predictors() and has_role().

recipe(mpg ~ ., mtcars) |>
  step_impute_bag(all_predictors(), impute_with = has_role("predictor")) |>
  prep() |>
  bake(new_data = mtcars)
#> # A tibble: 32 × 11
#>      cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb   mpg
#>    <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1     6  160    110  3.9   2.62  16.5     0     1     4     4  21  
#>  2     6  160    110  3.9   2.88  17.0     0     1     4     4  21  
#>  3     4  108     93  3.85  2.32  18.6     1     1     4     1  22.8
#>  4     6  258    110  3.08  3.22  19.4     1     0     3     1  21.4
#>  5     8  360    175  3.15  3.44  17.0     0     0     3     2  18.7
#>  6     6  225    105  2.76  3.46  20.2     1     0     3     1  18.1
#>  7     8  360    245  3.21  3.57  15.8     0     0     3     4  14.3
#>  8     4  147.    62  3.69  3.19  20       1     0     4     2  24.4
#>  9     4  141.    95  3.92  3.15  22.9     1     0     4     2  22.8
#> 10     6  168.   123  3.92  3.44  18.3     1     0     4     4  19.2
#> # ℹ 22 more rows

These changes are backwards compatible meaning that the old ways still work with minimal warnings.

step_impute_bag() now takes up less memory

We have another benefit for users of step_impute_bag(). For each variable it imputes on, it fits a bagged tree model, which is then used to predict with for imputation. It was noticed that these models had a larger memory footprint than was needed. This has been remedied, so now there should be a noticeable decrease in size for recipes with step_impute_bag().

rec <- recipe(Sale_Price ~ ., data = ames) |>
  step_impute_bag(starts_with("Lot_"), impute_with = all_numeric_predictors()) |>
  prep()

lobstr::obj_size(rec)
#> 20.23 MB

This recipe took up over 75 MB and now takes up 20 MB.

Acknowledgements

Many thanks to all the people who contributed to recipes since the last release!

@chillerb, @dshemetov, @EmilHvitfeldt, @kevbaer, @nhward, @regisely, and @topepo.