We’re stoked to announce the release of recipes 0.1.14. recipes is an alternative method for creating and preprocessing design matrices that can be used for modeling or visualization.
You can install it from CRAN with:
You can see a full list of changes in the release notes. There are some improvements and changes to talk about.
An alternative to
Now that we have taught with recipes for a few years, we’ve realized that there is a lot of confusion about the differences between
juice(recipe)returns the preprocessed training set (at very low computational costs).
bake(recipe, new_data)applies the recipe to any data (e.g. training, testing, unknowns, etc.)
We were not able to find ways to make this distinction clear for many users.
How could we solve this issue? We decided to come up with a better alternative to
juice() that would be more intuitive. As a result, all applications of the recipe can now use
bake(recipe, new_data = some_data_set)works as before.
bake(recipe, new_data = NULL)now returns the preprocessed training set.
This is precedented in base R since many
predict() methods re-predict the training set when the
newdata argument is
NULL or missing. Note that there is no default for
new_data; you have to set it to
NULL to get the training set.
We felt that this was the best API change that we could make. An external poll showed some agreement:
juice(), which is still my favorite R function name of all time, will not be removed; you can still use it. However, we will not use it in training materials or most documentation.
Imputation with linear models
Tim Zhou contributed a step to use linear models for imputation. This is a nice, compact method for adding an imputation equation for numeric predictors into the recipe. The syntax is similar to the existing imputation steps. Here’s an example from the Ames data:
library(tidymodels) data(ames) ames$Sale_Price <- log10(ames$Sale_Price) # Set some of the latitude values to be missing: set.seed(393) ames_missing <- ames ames_missing$Latitude[sample(1:nrow(ames), 200)] <- NA
We might be able to reasonably approximate the missing values based on the other geographic predictors (
Neighborhood) as well as a few aspects of the houses (e.g.,
Alley). A linear model is create with these predictors in order to estimate the missing
imputed_ames <- recipe(Sale_Price ~ ., data = ames_missing) %>% step_impute_linear( Latitude, impute_with = imp_vars(Longitude, Neighborhood, MS_Zoning, Alley), id = "lm-imp" ) %>% prep(ames_missing)
This plot shows the missing data’s true values on the x-axis and their imputed values on the y-axis:
In future versions, we will standardize on the naming convention
step_impute_*(). The existing functions will be soft-deprecated for a reasonable time period to ensure backward compatibility.
This version of the package has an extra logging option for
prep() that will print some information on the differences in the data before and after the step was prepared:
ames_rec <- recipe(Sale_Price ~ ., data = ames) %>% step_BoxCox(Lot_Area, Gr_Liv_Area) %>% step_other(Neighborhood, threshold = 0.05) %>% step_dummy(all_nominal()) %>% step_interact(~ starts_with("Central_Air"):Year_Built) %>% step_ns(Longitude, Latitude, deg_free = 5) %>% prep(log_changes = TRUE)
## step_BoxCox (BoxCox_KMeZW): same number of columns ## ## step_other (other_b4CM3): same number of columns ## ## step_dummy (dummy_q3sI4): ## new (223): MS_SubClass_One_Story_1945_and_Older, ... ## removed (40): MS_SubClass, MS_Zoning, Street, Alley, Lot_Shape, ... ## ## step_interact (interact_xjtSG): ## new (1): Central_Air_Y_x_Year_Built ## ## step_ns (ns_pfpld): ## new (10): Longitude_ns_1, Longitude_ns_2, Longitude_ns_3, ... ## removed (2): Longitude, Latitude
Another important change was behind the scenes. Before, there were problems with using PSOCK clusters on Windows because the worker processes were not aware of all the packages that should be loaded. Now, recipes ensures that all of the packages required by each step will be accessible in parallel. A similar change is coming soon to the parsnip package.
Thanks to those users who filed issues or contributed a pull request since the previous release: @AndrewKostandy, @anks7190, @AshesITR, @Bijaelo, @brodz, @DavisVaughan, @dgkf, @EllaKaye, @EmilHvitfeldt, @hamedbh, @hnagaty, @irkaal, @jerome-laurent-pro, @juliasilge, @karaesmen, @kylegilde, @LordRudolf, @lorenzwalthert, @mattwarkentin, @mpettis, @mt-edwards, @nhward, @Nilafhiosagam, @NRaillard, @Paul-Yuchao-Dong, @perluna, @RaminZi, @rorynolan, @Steviey, @topepo, and @ttzhou.