recipes 0.2.0

We’re very excited to announce the release of recipes 0.2.0. recipes is a package for preprocessing data before using it in models or visualizations. You can think of it as a mash-up of model.matrix() and dplyr.

You can install it from CRAN with:

install.packages("recipes")

This blog post will describe the highlights of what’s new. You can see a full list of changes in the release notes.

New Steps

step_nnmf_sparse() was added to produce features using non-negative matrix factorization (via the RcppML package). This will supersede the existing step_nnmf() since that step was difficult to support and use. The new step allows for a sparse representation via regularization and, from our initial testing, is much faster than the original NNMF step.

The new step step_dummy_extract() helps create indicator variables from text data, especially those with multiple choice values. For example, if a row of a variable had a value of "red,black,brown", the step can separate these values and make all of the required binary dummy variables.

Here’s a real example from Episode 8 of Sliced where a column of data from Spotify had the artist(s) of a song:

library(recipes)
spotify <- 
  tibble::tribble(
    ~ artists,
    "['Genesis']",
    "['Billie Holiday', 'Teddy Wilson']",
    "['Jimmy Barnes', 'INXS']"
  )
recipe(~ artists, data = spotify) %>% 
  step_dummy_extract(artists, pattern = "(?<=')[^',]+(?=')") %>% 
  prep() %>% 
  bake(new_data = NULL) %>% 
  glimpse()

## Rows: 3
## Columns: 6
## $ artists_Billie.Holiday <dbl> 0, 1, 0
## $ artists_Genesis        <dbl> 1, 0, 0
## $ artists_INXS           <dbl> 0, 0, 1
## $ artists_Jimmy.Barnes   <dbl> 0, 0, 1
## $ artists_Teddy.Wilson   <dbl> 0, 1, 0
## $ artists_other          <dbl> 0, 0, 0

Note that this step produces an “other” column and has arguments similar to step_other() and step_dummy_multi_choice().

step_percentile() is a new step function after it had previously only been an example in the developer documentation. It can determine the empirical distribution of a variable using the training set, then convert any value to the percentile of this distribution.

Finally, a new filtering function (step_filter_missing()) can filter out columns that have too many missing values (for some definition of “too many”).

Other notable new features

step_zv() now has a group argument. This can be helpful for models such as naive Bayes or quadratic discriminant analysis where the predictors must have at least two unique values within each class.

All recipe steps now officially support empty selections to be more aligned with dplyr and other packages that use tidyselect. For example, if a previous step removed all of the columns needed for a later step, the recipe does not fail when it is estimated (with the exception of step_mutate()). The documentation in ?selections has been updated with advice for writing selectors when filtering steps are used.

There are new extract_parameter_set_dials() and extract_parameter_dials() methods to extract parameter sets and single parameters from a recipe. Since this is related to tuning parameters, the tune package should be loaded before they are used.

Breaking changes

Changes in step_ica() and step_kpca*() will now cause recipe objects from previous versions to error when applied to new data. You will need to update these recipes with the current version to be able to use them.

Acknowledgements

We’d like to thank everyone that has contributed since the last release: @agwalker82, @albert-ying, @AshesITR, @ddsjoberg, @DoktorMike, @EmilHvitfeldt, @emmansh, @hermandr, @hfrick, @jacekkotowski, @JensPMB, @jkennel, @juliasilge, @lg1000, @lionel-, @markjrieke, @mattwarkentin, @MichaelChirico, @ninohardt, @SewerynGrodny, @SimonCoulombe, @spsanderson, @tedmoorman, @topepo, @tsengj, @walrossker, @williamshell, and @xiaoxi-david.