recipes 0.1.14

  tidymodels, recipes

  Max Kuhn

We’re stoked to announce the release of recipes 0.1.14. recipes is an alternative method for creating and preprocessing design matrices that can be used for modeling or visualization.

You can install it from CRAN with:

install.packages("recipes")

You can see a full list of changes in the release notes. There are some improvements and changes to talk about.

An alternative to juice()

Now that we have taught with recipes for a few years, we’ve realized that there is a lot of confusion about the differences between juice() and bake():

  • juice(recipe) returns the preprocessed training set (at very low computational costs).
  • bake(recipe, new_data) applies the recipe to any data (e.g. training, testing, unknowns, etc.)

We were not able to find ways to make this distinction clear for many users.

How could we solve this issue? We decided to come up with a better alternative to juice() that would be more intuitive. As a result, all applications of the recipe can now use bake():

  • bake(recipe, new_data = some_data_set) works as before.
  • bake(recipe, new_data = NULL) now returns the preprocessed training set.

This is precedented in base R since many predict() methods re-predict the training set when the newdata argument is NULL or missing. Note that there is no default for new_data; you have to set it to NULL to get the training set.

We felt that this was the best API change that we could make. An external poll showed some agreement:

plot of chunk poll

juice(), which is still my favorite R function name of all time, will not be removed; you can still use it. However, we will not use it in training materials or most documentation.

Imputation with linear models

Tim Zhou contributed a step to use linear models for imputation. This is a nice, compact method for adding an imputation equation for numeric predictors into the recipe. The syntax is similar to the existing imputation steps. Here’s an example from the Ames data:

library(tidymodels)
data(ames)
ames$Sale_Price <- log10(ames$Sale_Price)

# Set some of the latitude values to be missing: 

set.seed(393)
ames_missing <- ames
ames_missing$Latitude[sample(1:nrow(ames), 200)] <- NA

We might be able to reasonably approximate the missing values based on the other geographic predictors (Longitude and Neighborhood) as well as a few aspects of the houses (e.g., MS_Zoning and Alley). A linear model is create with these predictors in order to estimate the missing Latitude data:

imputed_ames <-
  recipe(Sale_Price ~ ., data = ames_missing) %>%
  step_impute_linear(
    Latitude,
    impute_with = imp_vars(Longitude, Neighborhood, MS_Zoning, Alley), 
    id = "lm-imp"
  ) %>%
  prep(ames_missing)

This plot shows the missing data’s true values on the x-axis and their imputed values on the y-axis:

plot of chunk plot-values

In future versions, we will standardize on the naming convention step_impute_*(). The existing functions will be soft-deprecated for a reasonable time period to ensure backward compatibility.

Other changes

This version of the package has an extra logging option for prep() that will print some information on the differences in the data before and after the step was prepared:

ames_rec <- recipe(Sale_Price ~ ., data = ames) %>%
  step_BoxCox(Lot_Area, Gr_Liv_Area) %>%
  step_other(Neighborhood, threshold = 0.05)  %>%
  step_dummy(all_nominal()) %>%
  step_interact(~ starts_with("Central_Air"):Year_Built) %>%
  step_ns(Longitude, Latitude, deg_free = 5) %>% 
  prep(log_changes = TRUE)
## step_BoxCox (BoxCox_KMeZW): same number of columns
## 
## step_other (other_b4CM3): same number of columns
## 
## step_dummy (dummy_q3sI4): 
##  new (223): MS_SubClass_One_Story_1945_and_Older, ...
##  removed (40): MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, ...
## 
## step_interact (interact_xjtSG): 
##  new (1): Central_Air_Y_x_Year_Built
## 
## step_ns (ns_pfpld): 
##  new (10): Longitude_ns_1, Longitude_ns_2, Longitude_ns_3, ...
##  removed (2): Longitude, Latitude

Another important change was behind the scenes. Before, there were problems with using PSOCK clusters on Windows because the worker processes were not aware of all the packages that should be loaded. Now, recipes ensures that all of the packages required by each step will be accessible in parallel. A similar change is coming soon to the parsnip package.

Acknowledgements

Thanks to those users who filed issues or contributed a pull request since the previous release: @AndrewKostandy, @anks7190, @AshesITR, @Bijaelo, @brodz, @DavisVaughan, @dgkf, @EllaKaye, @EmilHvitfeldt, @hamedbh, @hnagaty, @irkaal, @jerome-laurent-pro, @juliasilge, @karaesmen, @kylegilde, @LordRudolf, @lorenzwalthert, @mattwarkentin, @mpettis, @mt-edwards, @nhward, @Nilafhiosagam, @NRaillard, @Paul-Yuchao-Dong, @perluna, @RaminZi, @rorynolan, @Steviey, @topepo, and @ttzhou.