Q3 2021 tidymodels roundup

  tidymodels, recipes, dials, modeldata

  Julia Silge

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles. We now publish regular updates here on the tidyverse blog summarizing recent developments in the tidymodels ecosystem. You can check out the tidymodels tag to find all tidymodels blog posts here, including those that focus on a single package or more major releases. The purpose of these roundup posts is to keep you informed about any releases you may have missed and useful new functionality as we maintain these packages.

Since our last roundup post, there have been 9 CRAN releases of tidymodels packages. You can install these updates from CRAN with:

install.packages(c("baguette", "broom", "dials", 
                   "modeldata", "poissonreg", "recipes", 
                   "rules", "stacks", "textrecipes"))

The NEWS files are linked here for each package; you’ll notice that many of these releases involve small updates for CRAN policy or changes that are not user-facing. We write code in these smaller, modular packages that we can release frequently to make models easier to deploy and our software easier to maintain, but it can be a lot to keep up with as a user! We want to take the opportunity here to highlight a couple of more important changes in these releases.

New parameter objects

The dials package is one that you might not have thought much about yet, even if you are a regular tidymodels user. It is an infrastructure package for creating and managing hyperparameters as well as grids (both regular and non-regular) of hyperparameters. The most recent release of dials includes several new parameters, and Hannah Frick is now the package maintainer.

One of the new parameters in this release is stop_iter(), the number of iterations without improvement before “early stopping”.

library(tidymodels)
#> ── Attaching packages ──────────────────────────── tidymodels 0.1.3 ──
#> ✓ broom        0.7.9      ✓ rsample      0.1.0 
#> ✓ dials        0.0.10     ✓ tibble       3.1.4 
#> ✓ dplyr        1.0.7      ✓ tidyr        1.1.3 
#> ✓ infer        1.0.0      ✓ tune         0.1.6 
#> ✓ modeldata    0.1.1      ✓ workflows    0.2.3 
#> ✓ parsnip      0.1.7      ✓ workflowsets 0.1.0 
#> ✓ purrr        0.3.4      ✓ yardstick    0.0.8 
#> ✓ recipes      0.1.17
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
#> • Use tidymodels_prefer() to resolve common conflicts.
stop_iter()
#> # Iterations Before Stopping (quantitative)
#> Range: [3, 20]

You don’t typically use parameters from dials like this, though. Instead, the infrastructure these parameters provide is what allows us to fluently tune our hyperparameters. For example, we can use data on high-performance computing jobs to predict the class of those jobs with xgboost, choosing to try out different values for early stopping (along with another hyperparameter mtry).

data(hpc_data)
hpc_folds <- vfold_cv(hpc_data, strata = class)
hpc_formula <- class ~ compounds + input_fields + iterations + num_pending + hour

stopping_spec <-
  boost_tree(
    trees = 500,
    learn_rate = 0.02,
    mtry = tune(),
    stop_iter = tune()
  ) %>%
  set_engine("xgboost", validation = 0.2) %>%
  set_mode("classification")

early_stop_wf <- workflow(hpc_formula, stopping_spec)

We can now tune this workflow() over the resamples hpc_folds and find out which values for the hyperparameters turned out best.

doParallel::registerDoParallel()
set.seed(123)
early_stop_rs <- tune_grid(early_stop_wf, hpc_folds)
#> i Creating pre-processing data to finalize unknown parameter: mtry
show_best(early_stop_rs, "roc_auc")
#> # A tibble: 5 × 8
#>    mtry stop_iter .metric .estimator  mean     n std_err .config      
#>   <int>     <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>        
#> 1     3        13 roc_auc hand_till  0.887    10 0.00337 Preprocessor…
#> 2     2        11 roc_auc hand_till  0.887    10 0.00414 Preprocessor…
#> 3     2        15 roc_auc hand_till  0.886    10 0.00342 Preprocessor…
#> 4     4         7 roc_auc hand_till  0.885    10 0.00365 Preprocessor…
#> 5     4         9 roc_auc hand_till  0.884    10 0.00353 Preprocessor…

In this case, the best value for the early stopping parameter is 13.

This recent screencast demonstrates how to tune early stopping for xgboost as well.

Several of the new parameter objects in this version of dials are prep work for supporting more options in tidymodels, such as tuning yet more hyperparameters of xgboost (like L1 and L2 penalties and whether to balance classes), generalized additive models, discriminant analysis models, recursively partitioned models, and a recipe step for sparse PCA. Stay tuned for more on these new options!

Improvements to recipes

The most recent release of recipes is a robust one, including new recipe steps, bug fixes, documentation improvements, and better performance. You can dig into the NEWS file for more details, but check out a few of the most important changes.

We added a new recipe step for creating sine and cosine features, step_harmonic(). We can use this to analyze data with periodic features, like the annual number of sunspots. The sunspot.year dataset has an observation every year, and the solar cycle is about 11 years long.

data(sunspot.year)
sunspots <-
  tibble(year = 1700:1988,
         n_sunspot = sunspot.year)

sun_split <- initial_time_split(sunspots)
sun_train <- training(sun_split)
sun_test  <- testing(sun_split)

sunspots_rec <- 
  recipe(n_sunspot ~ year, data = sun_train) %>%
  step_harmonic(year, frequency = 1 / 11, cycle_size = 1, 
                role = "predictor",
                keep_original_cols = FALSE) 

lm_spec <- linear_reg()
sunspots_wf <- workflow(sunspots_rec, lm_spec)

sunspots_fit <- fit(sunspots_wf, sun_train)
sunspots_fit
#> ══ Workflow [trained] ════════════════════════════════════════════════
#> Preprocessor: Recipe
#> Model: linear_reg()
#> 
#> ── Preprocessor ──────────────────────────────────────────────────────
#> 1 Recipe Step
#> 
#> • step_harmonic()
#> 
#> ── Model ─────────────────────────────────────────────────────────────
#> 
#> Call:
#> stats::lm(formula = ..y ~ ., data = data)
#> 
#> Coefficients:
#> (Intercept)   year_sin_1   year_cos_1  
#>      42.904        9.691       20.504

We can now predict with this fitted linear model on the most recent years.

sunspots_fit %>%
  augment(sun_test) %>%
  select(year, n_sunspot, .pred) %>%
  pivot_longer(-year) %>%
  ggplot(aes(year, value, color = name)) +
  geom_line(alpha = 0.8, size = 1.2) +
  theme_minimal() +
  labs(x = NULL, y = "Number of sunspots", color = NULL)

plot of chunk unnamed-chunk-7

Looks like there have been more sunspots in recent decades compared to the past!

Another new recipe step is step_dummy_multi_choice(), while step_kpca() was “un-deprecated” and step_spatialsign() and step_geodist() were improved.

If you build your own recipe steps, the new recipes_eval_select() function is now available, powering the tidyselect semantics specific to recipes. The older terms_select() function is now deprecated in favor of this new helper.

The recipes package is fairly extensive, and we have recently invested time and energy in refining the documentation to make it more navigable and clear, as well as easier to maintain and contribute to. Specific documentation pages with recent updates you may find helpful include:

Reexamining example datasets

One of the tidymodels packages is modeldata, where we keep example datasets for vignettes, examples, and other similar uses. We have included two datasets, okc and okc_text, in modeldata based on real user data from the dating website OkCupid.

This dataset was sourced from Kim and Escobedo-Land (2015). Permission to use this dataset was explicitly granted by OkCupid, but since that time, concerns have been raised about the ethics of using this or similar data sets, for example to identify individuals. Xiao and Ma (2021) specifically address the possible misuse of this particular dataset, and we now agree that it isn’t a good option to use for examples or teaching. In the most recent release of modeldata, we have marked these datasets as deprecated. We have removed them from the development version of the package on GitHub, and they will be removed entirely in the next CRAN release. We especially want to thank Albert Kim, one of the authors of the original paper, for his thoughtful and helpful discussion.

One of the reasons we found the OkCupid dataset useful was that it included multiple text columns per observation, so removing these two datasets motivated us to look for a new option to include instead. We landed on metadata for modern artwork from the Tate Gallery; if you used okc_text in the past, we recommend switching to tate_text. For example, we can count how many of these examples of artwork involve paper or canvas.

data(tate_text)

library(textrecipes)
paper_or_canvas <- c("paper", "canvas")

recipe(~ ., data = tate_text) %>%
  step_tokenize(medium) %>%
  step_stopwords(medium, custom_stopword_source = paper_or_canvas, keep = TRUE) %>%
  step_tf(medium) %>%
  prep() %>%
  bake(new_data = NULL) %>%
  select(artist, year, starts_with("tf"))
#> # A tibble: 4,284 × 4
#>    artist              year tf_medium_canvas tf_medium_paper
#>    <fct>              <dbl>            <dbl>           <dbl>
#>  1 Absalon             1990                0               0
#>  2 Auerbach, Frank     1990                0               1
#>  3 Auerbach, Frank     1990                0               1
#>  4 Auerbach, Frank     1990                0               1
#>  5 Auerbach, Frank     1990                1               0
#>  6 Ayres, OBE Gillian  1990                1               0
#>  7 Barlow, Phyllida    1990                0               1
#>  8 Baselitz, Georg     1990                0               1
#>  9 Beattie, Basil      1990                1               0
#> 10 Beuys, Joseph       1990                0               1
#> # … with 4,274 more rows

This artwork metadata was a Tidy Tuesday dataset earlier this year.

Acknowledgements

We’d like to extend our thanks to all of the contributors who helped make these releases during Q3 possible!