We’re thrilled to announce the release of recipes 1.3.0. recipes lets you create a pipeable sequence of feature engineering steps.
You can install it from CRAN with:
install.packages("recipes")
This blog post will walk through some of the highlights of this release, which includes changes to how strings_as_factors
are specified, deprecation of
step_select()
, new contrasts
argument for
step_dummy()
, and improvements for
step_impute_bag()
.
You can see a full list of changes in the release notes.
Let’s first load the package:
strings_as_factors
Recipes by default convert predictor strings to factors, and the option for that is located in
prep()
. This caused an issue when you wanted to set strings_as_factors = FALSE
for a recipe that is used somewhere else like in a workflow.
This is no longer an issue as we have moved the argument to
recipe()
itself. We are at the same time deprecating the use of strings_as_factors
when used in
prep()
. Here is an example:
library(modeldata)
tate_text
#> # A tibble: 4,284 × 5
#> id artist title medium year
#> <dbl> <fct> <chr> <fct> <dbl>
#> 1 21926 Absalon Proposals for a Habitat Video… 1990
#> 2 20472 Auerbach, Frank Michael Etchi… 1990
#> 3 20474 Auerbach, Frank Geoffrey Etchi… 1990
#> 4 20473 Auerbach, Frank Jake Etchi… 1990
#> 5 20513 Auerbach, Frank To the Studios Oil p… 1990
#> 6 21389 Ayres, OBE Gillian Phaëthon Oil p… 1990
#> 7 121187 Barlow, Phyllida Untitled Acryl… 1990
#> 8 19455 Baselitz, Georg Green VIII Woodc… 1990
#> 9 20938 Beattie, Basil Present Bound Oil p… 1990
#> 10 105941 Beuys, Joseph Joseph Beuys: A Private Collection. A… Print… 1990
#> # ℹ 4,274 more rows
We are loading the modeldata package to get tate_text
which has a character column title
. If we don’t do anything then it turns into a factor.
recipe(~., data = tate_text) |>
prep() |>
bake(tate_text)
#> # A tibble: 4,284 × 5
#> id artist title medium year
#> <dbl> <fct> <fct> <fct> <dbl>
#> 1 21926 Absalon Proposals for a Habitat Video… 1990
#> 2 20472 Auerbach, Frank Michael Etchi… 1990
#> 3 20474 Auerbach, Frank Geoffrey Etchi… 1990
#> 4 20473 Auerbach, Frank Jake Etchi… 1990
#> 5 20513 Auerbach, Frank To the Studios Oil p… 1990
#> 6 21389 Ayres, OBE Gillian Phaëthon Oil p… 1990
#> 7 121187 Barlow, Phyllida Untitled Acryl… 1990
#> 8 19455 Baselitz, Georg Green VIII Woodc… 1990
#> 9 20938 Beattie, Basil Present Bound Oil p… 1990
#> 10 105941 Beuys, Joseph Joseph Beuys: A Private Collection. A… Print… 1990
#> # ℹ 4,274 more rows
But we can set strings_as_factors = FALSE
in
recipe()
and it won’t anymore.
recipe(~., data = tate_text, strings_as_factors = FALSE) |>
prep() |>
bake(tate_text)
#> # A tibble: 4,284 × 5
#> id artist title medium year
#> <dbl> <fct> <chr> <fct> <dbl>
#> 1 21926 Absalon Proposals for a Habitat Video… 1990
#> 2 20472 Auerbach, Frank Michael Etchi… 1990
#> 3 20474 Auerbach, Frank Geoffrey Etchi… 1990
#> 4 20473 Auerbach, Frank Jake Etchi… 1990
#> 5 20513 Auerbach, Frank To the Studios Oil p… 1990
#> 6 21389 Ayres, OBE Gillian Phaëthon Oil p… 1990
#> 7 121187 Barlow, Phyllida Untitled Acryl… 1990
#> 8 19455 Baselitz, Georg Green VIII Woodc… 1990
#> 9 20938 Beattie, Basil Present Bound Oil p… 1990
#> 10 105941 Beuys, Joseph Joseph Beuys: A Private Collection. A… Print… 1990
#> # ℹ 4,274 more rows
This change should also make pragmatic sense as whether you want to turn strings into factors is something that should encoded into the recipe itself.
Deprecating step_select()
We have started the process of deprecating
step_select()
. Given the number of issues people are having with the step and the fact that it doesn’t play well with workflows we think this is the right call.
There are two main use cases where
step_select()
was used: removing variables, and selecting variables. Removing variables when done with -
in
step_select()
recipe(mpg ~ ., mtcars) |>
step_select(-starts_with("d")) |>
prep() |>
bake(new_data = NULL)
#> # A tibble: 32 × 9
#> cyl hp wt qsec vs am gear carb mpg
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6 110 2.62 16.5 0 1 4 4 21
#> 2 6 110 2.88 17.0 0 1 4 4 21
#> 3 4 93 2.32 18.6 1 1 4 1 22.8
#> 4 6 110 3.22 19.4 1 0 3 1 21.4
#> 5 8 175 3.44 17.0 0 0 3 2 18.7
#> 6 6 105 3.46 20.2 1 0 3 1 18.1
#> 7 8 245 3.57 15.8 0 0 3 4 14.3
#> 8 4 62 3.19 20 1 0 4 2 24.4
#> 9 4 95 3.15 22.9 1 0 4 2 22.8
#> 10 6 123 3.44 18.3 1 0 4 4 19.2
#> # ℹ 22 more rows
These use cases can seamlessly be converted to use
step_rm()
without the -
for the same result.
recipe(mpg ~ ., mtcars) |>
step_rm(starts_with("d")) |>
prep() |>
bake(new_data = NULL)
#> # A tibble: 32 × 9
#> cyl hp wt qsec vs am gear carb mpg
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6 110 2.62 16.5 0 1 4 4 21
#> 2 6 110 2.88 17.0 0 1 4 4 21
#> 3 4 93 2.32 18.6 1 1 4 1 22.8
#> 4 6 110 3.22 19.4 1 0 3 1 21.4
#> 5 8 175 3.44 17.0 0 0 3 2 18.7
#> 6 6 105 3.46 20.2 1 0 3 1 18.1
#> 7 8 245 3.57 15.8 0 0 3 4 14.3
#> 8 4 62 3.19 20 1 0 4 2 24.4
#> 9 4 95 3.15 22.9 1 0 4 2 22.8
#> 10 6 123 3.44 18.3 1 0 4 4 19.2
#> # ℹ 22 more rows
For selecting variables there are two cases. The first is as a tool to select which variables to use in our model. We recommend that you use
select()
to do that before passing the data into the
recipe()
. This is especially helpful since
recipes are tighter with respect to their input types, so only passing the data you need to use is helpful.
If you need to do the selection after another step takes effect you should still be able to do so, by using
step_rm()
in the following manner.
step_rm(recipe, all_predictors(), -all_of(<variables that you want to keep>))
step_dummy()
contrasts argument
Contrasts such as
contr.treatment()
and
contr.poly()
are used in
step_dummy()
to determine how the steps should translate categorical values into one or more numeric columns. Traditionally the contrasts were set using
options()
like so:
recipe(~species + island, penguins) |>
step_dummy(all_nominal_predictors()) |>
prep() |>
bake(new_data = penguins)
#> # A tibble: 344 × 4
#> species_Chinstrap species_Gentoo island_Dream island_Torgersen
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.707 0.408 0.707 0.408
#> 2 -0.707 0.408 0.707 0.408
#> 3 -0.707 0.408 0.707 0.408
#> 4 -0.707 0.408 0.707 0.408
#> 5 -0.707 0.408 0.707 0.408
#> 6 -0.707 0.408 0.707 0.408
#> 7 -0.707 0.408 0.707 0.408
#> 8 -0.707 0.408 0.707 0.408
#> 9 -0.707 0.408 0.707 0.408
#> 10 -0.707 0.408 0.707 0.408
#> # ℹ 334 more rows
The issue with this approach is that it pulls from
options()
when it needs it instead of storing the information. This means that if you put this recipe in production you will need to set the option in the production environment to match that of the training environment.
To fix this issue we have given
step_dummy()
an argument contrasts
that work in much the same way. You simply specify the contrast you want and it will be stored in the object for easy deployment.
recipe(~species + island, penguins) |>
step_dummy(
all_nominal_predictors(), contrasts = "contr.poly") |>
prep() |>
bake(new_data = penguins)
#> # A tibble: 344 × 4
#> species_Chinstrap species_Gentoo island_Dream island_Torgersen
#> <dbl> <dbl> <dbl> <dbl>
#> 1 -0.707 0.408 0.707 0.408
#> 2 -0.707 0.408 0.707 0.408
#> 3 -0.707 0.408 0.707 0.408
#> 4 -0.707 0.408 0.707 0.408
#> 5 -0.707 0.408 0.707 0.408
#> 6 -0.707 0.408 0.707 0.408
#> 7 -0.707 0.408 0.707 0.408
#> 8 -0.707 0.408 0.707 0.408
#> 9 -0.707 0.408 0.707 0.408
#> 10 -0.707 0.408 0.707 0.408
#> # ℹ 334 more rows
If you are using a contrasts from an external package such as
hardhat::contr_one_hot()
you will need to have the package loaded in the environments you are working in with
library(hardhat)
and setting contrasts = "contr_one_hot"
. You will also need to call
library(hardhat)
in any production environments you are using this recipe.
tidyselect can be used everywhere
Several steps such as
step_pls()
and
step_impute_bag()
require the selection of more than just the affected columns.
step_pls()
needs you to select an outcome
variable and
step_impute_bag()
needs you to select which variables to impute with, impute_with
, if you don’t want to use all predictors. Previously these needed to be strings or use special selectors like
imp_vars()
. You don’t have to do that anymore. You can now use tidyselect in these arguments too.
recipe(mpg ~ ., mtcars) |>
step_pls(all_predictors(), outcome = mpg) |>
prep() |>
bake(new_data = mtcars)
#> # A tibble: 32 × 3
#> mpg PLS1 PLS2
#> <dbl> <dbl> <dbl>
#> 1 21 0.693 0.895
#> 2 21 0.650 0.654
#> 3 22.8 2.78 0.378
#> 4 21.4 0.210 -0.368
#> 5 18.7 -1.95 0.845
#> 6 18.1 0.137 -0.624
#> 7 14.3 -2.77 0.364
#> 8 24.4 1.81 -1.30
#> 9 22.8 2.12 -1.95
#> 10 19.2 0.531 -1.51
#> # ℹ 22 more rows
For arguments that allow for multiple selections now work with recipes selectors like
all_numeric_predictors()
and
has_role()
.
recipe(mpg ~ ., mtcars) |>
step_impute_bag(all_predictors(), impute_with = has_role("predictor")) |>
prep() |>
bake(new_data = mtcars)
#> # A tibble: 32 × 11
#> cyl disp hp drat wt qsec vs am gear carb mpg
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 6 160 110 3.9 2.62 16.5 0 1 4 4 21
#> 2 6 160 110 3.9 2.88 17.0 0 1 4 4 21
#> 3 4 108 93 3.85 2.32 18.6 1 1 4 1 22.8
#> 4 6 258 110 3.08 3.22 19.4 1 0 3 1 21.4
#> 5 8 360 175 3.15 3.44 17.0 0 0 3 2 18.7
#> 6 6 225 105 2.76 3.46 20.2 1 0 3 1 18.1
#> 7 8 360 245 3.21 3.57 15.8 0 0 3 4 14.3
#> 8 4 147. 62 3.69 3.19 20 1 0 4 2 24.4
#> 9 4 141. 95 3.92 3.15 22.9 1 0 4 2 22.8
#> 10 6 168. 123 3.92 3.44 18.3 1 0 4 4 19.2
#> # ℹ 22 more rows
These changes are backwards compatible meaning that the old ways still work with minimal warnings.
step_impute_bag()
now takes up less memory
We have another benefit for users of
step_impute_bag()
. For each variable it imputes on, it fits a bagged tree model, which is then used to predict with for imputation. It was noticed that these models had a larger memory footprint than was needed. This has been remedied, so now there should be a noticeable decrease in size for recipes with
step_impute_bag()
.
rec <- recipe(Sale_Price ~ ., data = ames) |>
step_impute_bag(starts_with("Lot_"), impute_with = all_numeric_predictors()) |>
prep()
lobstr::obj_size(rec)
#> 20.23 MB
This recipe took up over 75 MB
and now takes up 20 MB
.
Acknowledgements
Many thanks to all the people who contributed to recipes since the last release!
@chillerb, @dshemetov, @EmilHvitfeldt, @kevbaer, @nhward, @regisely, and @topepo.