recipes 0.1.16

  tidymodels, recipes

  Max Kuhn

We’re tickled pink to announce the release of recipes 0.1.16. recipes is a package for preprocessing data for modeling and data analysis.

You can install it from CRAN with:

install.packages("recipes")

This blog post will discuss the several improvements to the package. Before discussing new features, please note that the package license was changed from GPL-2 to MIT.

You can see a full list of changes in the release notes.

New column selectors

We do our best to keep track of persistent issues that show up in our teaching, Stack Overflow posts, RStudio Community posts, the R4DS Tidy Modeling Book Club, and other venues. If there are persistent issues, we do our best to help make the programming interface better.

Mine Çetinkaya-Rundel had a good idea for one such persistent issue related to creating dummy variables. For classification data where one or more predictors are categorical, the users might accidentally capture the outcome and the predictors when creating dummy variables. For example:

library(tidymodels)

data(scat, package = "modeldata")

scat_rec <- 
  recipe(Species ~ Location + Age + Mass + Diameter, data = scat) %>% 
  step_dummy(all_nominal(), one_hot = TRUE) %>% 
  prep()

scat_rec %>% 
  bake(new_data = NULL) %>% 
  names()
## [1] "Age"               "Mass"              "Diameter"         
## [4] "Location_edge"     "Location_middle"   "Location_off_edge"
## [7] "Species_bobcat"    "Species_coyote"    "Species_gray_fox"

Note that the outcome column (Species) was made into binary indicators. Most classification models prefer a factor vector and this would cause errors. The fix would be to remember to remove Species from the step selector.

Most selectors in recipes are used to capture predictor columns. The new version of recipes contains new selectors that combine the role and the data type: all_nominal_predictors() and all_numeric_predictors(). Using these:

scat_rec <- 
  recipe(Species ~ Location + Age + Mass + Diameter, data = scat) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  prep()

scat_rec %>% 
  bake(new_data = NULL) %>% 
  names()
## [1] "Age"               "Mass"              "Diameter"         
## [4] "Species"           "Location_edge"     "Location_middle"  
## [7] "Location_off_edge"

The existing selectors will remain. We’ll be converting our documentation, books, and training to use these new selectors when we select predictors of a specific type.

New steps

A new selector was added to compliment step_rm() (which removes columns). The new step_select() declares which columns to retain and emulates dplyr::select().

In cases where there are missing data, some data analysis methods compliment the existing predictors with missing value indicators for the covariates that have incomplete values. Thanks to Konrad Semsch, step_indicate_na() can be used to create these. Using the previous example:

scat_rec <- 
  recipe(Species ~ Location + Age + Mass + Diameter, data = scat) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_indicate_na(Mass, Diameter) %>% 
  prep()

scat_rec %>% 
  bake(new_data = scat[!complete.cases(scat),],
       contains("Mass"), contains("Diameter")) 
## # A tibble: 19 x 4
##     Mass na_ind_Mass Diameter na_ind_Diameter
##    <dbl>       <int>    <dbl>           <int>
##  1  2.51           0     NA                 1
##  2 18.1            0     NA                 1
##  3  8.17           0     NA                 1
##  4  3.43           0     NA                 1
##  5  5.53           0     NA                 1
##  6 26.9            0     24.1               0
##  7  5.38           0     17.8               0
##  8 14.9            0     19.3               0
##  9  9.51           0     17.9               0
## 10 18.3            0     18.1               0
## 11  8.73           0     25.8               0
## 12 25.9            0     22.2               0
## 13 14.5            0     20.1               0
## 14 10.3            0     17.8               0
## 15 14.6            0     19.3               0
## 16  5.66           0     24.8               0
## 17 NA              1     14.9               0
## 18  6.77           0     17.3               0
## 19 20.3            0     NA                 1

Speaking of missing data, we’ve decided to rename the current eight imputation steps:

  • step_impute_knn() is favored over step_knnimpute()
  • step_impute_median() is favored over step_medianimpute()
  • and so on…

These are a lot better since they work well with tab-completion. The old steps will go through a gradual deprecation process before being removed at some point in the future.

Keeping columns used in other features

A fair number of steps take one or more columns of the data and convert them to artificial features. For example, principal component regression represents a set of columns as artificial features that are amalgamations of the original data. In some cases, users desired to be able to keep the original columns.

The following steps now have an option called keep_original_cols: step_date(), step_dummy(), step_holiday(), step_ica(), step_isomap(), step_kpca_poly(), step_kpca_rbf(), step_nnmf(), step_pca(), step_pls(), and step_ratio().

For example:

scat_rec <- 
  recipe(Species ~ Location + d13C + d15N + CN, data = scat) %>% 
  step_impute_mean(d13C, d15N, CN) %>% 
  step_dummy(all_nominal_predictors(), one_hot = TRUE) %>% 
  step_pca(d13C, d15N, CN, keep_original_cols = TRUE) %>% 
  prep()

scat_rec %>% 
  bake(new_data = scat) %>% 
  names()
##  [1] "d13C"              "d15N"              "CN"               
##  [4] "Species"           "Location_edge"     "Location_middle"  
##  [7] "Location_off_edge" "PC1"               "PC2"              
## [10] "PC3"

Acknowledgements

Thanks to everyone who contributed since the previous version: @AshesITR, @BenoitLondon, @CelloJuan, @dfalbel, @EmilHvitfeldt, @gregdenay, @gustavomodelli, @hfrick, @hsbadr, @jake-mason, @jjcurtin, @juliasilge, @konradsemsch, @kylegilde, @LePeti, @LordRudolf, @lukasal, @mattwarkentin, @mikemc, @mine-cetinkaya-rundel, @paudel-arjun, @renanxcortes, @rorynolan, @saadaslam, @schoonees, @topepo, @uriahf, @vadimus202, and @zenggyu.