recipes 0.1.13

We’re very chuffed to announce the release of recipes 0.1.13. recipes is an alternative method for creating and preprocessing design matrices that can be used for modeling or visualization.

You can install it from CRAN with:

install.packages("recipes")

You can see a full list of changes in the release notes. There are some improvements and changes to talk about.

General changes

First, step_filter(), step_slice(), step_sample(), and step_naomit() had their defaults for skip changed to TRUE. In the vast majority of applications, these steps should not be applied to the test or assessment sets.

Also, step_upsample() and step_downsample() are soft deprecated in recipes as they are now available in the themis package. They will be removed in the next version.

Finally, for the new version of dplyr, the selectors all_of() and any_of() can now be used in step selections.

Feature extraction improvements

In the feature extraction category, there are two improvements. First, the tidy() method for step_pca() can return the percentage of variation accounted for by each PCA component. For example:

library(tidymodels)

# Many highly correlated numeric predictors:
data(meats, package = "modeldata")

set.seed(2383)
split <- initial_split(meats)
meat_tr <- training(split)
meat_te <- testing(split)

pca_rec <- 
  recipe(water + fat + protein ~ ., data = meat_tr) %>% 
  step_normalize(all_predictors()) %>% 
  step_pca(all_predictors(), num_comp = 10, id = "pca") %>% 
  prep()

var_info <- tidy(pca_rec, id = "pca", type = "variance")

table(var_info$terms)
#> 
#> cumulative percent variance         cumulative variance 
#>                         100                         100 
#>            percent variance                    variance 
#>                         100                         100

var_info %>% 
  dplyr::filter(terms == "percent variance") %>% 
  ggplot(aes(x = component, y = value)) + 
  geom_bar(stat = "identity") + 
  xlim(c(0, 10)) + 
  ylab("% of Total Variation")

Another change in this version of recipes is that step_pls() has received an upgrade. Partial least squares (PLS) is similar to PCA but takes the outcome(s) into account.

Previously, it used the pls package to do the computations. That’s a great package but it lacks two important features: allow for a categorical outcome value (e.g. “pls-da” for discriminant analysis) or allow for sparsity in the coefficients. Sparsity would facilitate simpler and perhaps more robust models.

step_pls() now uses the Bioconductor mixOmics package. As such, the outcome data can now be a factor and a new argument predictor_prop is used for sparsity. That argument specifies the maximum proportion of partial least squares loadings that will be non-zero (per component) during training. Newly prepped recipes will use this package but previously created recipes still use the pls package. For our previous example, let’s look at the protein outcome and build a recipe:

pls_rec <- 
  recipe(water + fat + protein ~ ., data = meat_tr) %>% 
  step_normalize(all_predictors()) %>% 
  step_pls(
    all_predictors(),
    outcome = vars(protein),
    num_comp = 3,
    predictor_prop = 0.75,
    id = "pls"
  ) %>% 
  prep()

# for new data: 
bake(pls_rec, meat_te, protein, starts_with("PLS")) %>%
  tidyr::pivot_longer(cols = c(-protein),
                      names_to = "component",
                      values_to = "values") %>% 
  ggplot(aes(x = values, y = protein)) + 
  geom_point(alpha = 0.5) + 
  facet_wrap(~ component, scale = "free_x") +
  xlab("PLS Score")

What are the PLS coefficients from this?

tidy(pls_rec, id = "pls") %>%
  ggplot(aes(x = component, y = terms, fill = value)) +
  geom_tile() +
  scale_fill_gradient2(
    low = "#B2182B",
    mid = "white",
    high = "#2166AC",
    limits = c(-0.4, 0.4)
  ) + 
  theme(axis.text.y = element_blank()) + 
  ylab("Predictors")

The third component has the largest coefficients and the largest effect on predicting the percentage of protein. This is consistent with the scatter plot above. The blocks of white in the heatmap above are coefficients effected by the sparsity argument.