We’re very chuffed to announce the release of recipes 0.1.13. recipes is an alternative method for creating and preprocessing design matrices that can be used for modeling or visualization.
You can install it from CRAN with:
You can see a full list of changes in the release notes. There are some improvements and changes to talk about.
step_naomit() had their defaults for
skip changed to
TRUE. In the vast majority of applications, these steps should not be applied to the test or assessment sets.
step_downsample() are soft deprecated in recipes as they are now available in the themis package. They will be removed in the next version.
Finally, for the new version of dplyr, the selectors
any_of() can now be used in step selections.
Feature extraction improvements
library(tidymodels) # Many highly correlated numeric predictors: data(meats, package = "modeldata") set.seed(2383) split <- initial_split(meats) meat_tr <- training(split) meat_te <- testing(split) pca_rec <- recipe(water + fat + protein ~ ., data = meat_tr) %>% step_normalize(all_predictors()) %>% step_pca(all_predictors(), num_comp = 10, id = "pca") %>% prep() var_info <- tidy(pca_rec, id = "pca", type = "variance") table(var_info$terms)
## ## cumulative percent variance cumulative variance ## 100 100 ## percent variance variance ## 100 100
var_info %>% dplyr::filter(terms == "percent variance") %>% ggplot(aes(x = component, y = value)) + geom_bar(stat = "identity") + xlim(c(0, 10)) + ylab("% of Total Variation")
Another change in this version of recipes is that
step_pls() has received an upgrade. Partial least squares (PLS) is similar to PCA but takes the outcome(s) into account.
Previously, it used the pls package to do the computations. That’s a great package but it lacks two important features: allow for a categorical outcome value (e.g. “pls-da” for discriminant analysis) or allow for sparsity in the coefficients. Sparsity would facilitate simpler and perhaps more robust models.
step_pls() now uses the Bioconductor mixOmics package. As such, the outcome data can now be a factor and a new argument
predictor_prop is used for sparsity. That argument specifies the maximum proportion of partial least squares loadings that will be non-zero (per component) during training. Newly prepped recipes will use this package but previously created recipes still use the pls package.
For our previous example, let’s look at the protein outcome and build a recipe:
pls_rec <- recipe(water + fat + protein ~ ., data = meat_tr) %>% step_normalize(all_predictors()) %>% step_pls( all_predictors(), outcome = vars(protein), num_comp = 3, predictor_prop = 0.75, id = "pls" ) %>% prep() # for new data: bake(pls_rec, meat_te, protein, starts_with("PLS")) %>% tidyr::pivot_longer(cols = c(-protein), names_to = "component", values_to = "values") %>% ggplot(aes(x = values, y = protein)) + geom_point(alpha = 0.5) + facet_wrap(~ component, scale = "free_x") + xlab("PLS Score")
What are the PLS coefficients from this?
tidy(pls_rec, id = "pls") %>% ggplot(aes(x = component, y = terms, fill = value)) + geom_tile() + scale_fill_gradient2( low = "#B2182B", mid = "white", high = "#2166AC", limits = c(-0.4, 0.4) ) + theme(axis.text.y = element_blank()) + ylab("Predictors")
The third component has the largest coefficients and the largest effect on predicting the percentage of protein. This is consistent with the scatter plot above. The blocks of white in the heatmap above are coefficients effected by the sparsity argument.