Q3 2022 tidymodels digest

  tidymodels, agua, h2o, recipes

  Max Kuhn

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like these from the past month or so:

Since our last roundup post, there have been CRAN releases of 22 tidymodels packages. Here are links to their NEWS files:

We’ll highlight two specific upgrades: one for agua and another for recipes.

A big upgrade for agua

With version 3.38.0.1 of the h2o package, agua can now tune h2o models as if they were any other type of model engine.

h2o has an excellent server-based computational engine for fitting a variety of different machine learning and statistical models. The h2o server can run locally or on some external high performance computing server. The downside is that it is light on tools for feature engineering and interactive data analysis.

Using h2o with tidymodels enables users to leverage the benefits of packages like recipes along with fast, server-based parallel processing.

While the syntax for model fitting and tuning are the same as any other non-h2o model, there are different ways to parallelize the work:

  • The h2o server has the ability to internally parallelize individual model computations. For example, when fitting trees, the search for the best split can be done using multiple threads. The number of threads that each model should be used is set with h2o.init(nthreads). The default (-1) is to use all CPUs on the host.

  • When using grid search, h2o.grid(parallelism) determines how many models the h2o server should process at the same time. The default (1) constrains the server to run the models sequentially.

  • R has external parallelization tools (such as the foreach and future packages) that can start new R processes to simultaneously do work. This would run many models in parallel. For h2o, this determines how many models the agua package could send to the server at once. This does not appear to be constrained by the parallelism argument to h2o.grid().

With h2o and tidymodels, you should probably use h2o’s parallelization. Using multiple approaches can work but only for some technologies. It’s still pretty complicated but we are working on un-complicating it.

To set up h2o parallelization, there is a new control argument called backend_options. If you were doing a grid search, you first define how many threads the h2o server should use:

library(tidymodels)
library(agua)
library(finetune)

h2o_thread_spec <- agua_backend_options(parallelism = 10) 

Then, pass the output to any of the existing control functions:

grid_ctrl <- control_grid(backend_options = h2o_thread_spec)

Now h2o can parallel process 10 models at once.

Here is an example using a simulated data set with a numeric outcome:

library(tidymodels)
library(agua)
library(finetune)

# Simulate the data
n_train <- 200
set.seed(6147)
sim_dat <- sim_regression(n_train)

# Resample using 10-fold cross-validation
set.seed(91)
sim_rs <- vfold_cv(sim_dat)

We’ll use grid search to tune a boosted tree:

boost_spec <-
  boost_tree(
    trees = tune(),
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune()
  ) %>%
  set_engine("h2o") %>%
  set_mode("regression")

Now, let’s parallel process our computations.

# Start the h2o server
h2o::h2o.init()

# Multi-thread the model fits
h2o_thread_spec <- agua_backend_options(parallelism = 10)
grid_ctrl <- control_grid(backend_options = h2o_thread_spec)

We’ll evaluate a very small grid at first:

set.seed(7616)
grid_res <-
  boost_spec %>%
  tune_grid(outcome ~ ., resamples = sim_rs, grid = 10, control = grid_ctrl)
show_best(grid_res, metric = "rmse") %>% select(-.config, -.metric, -.estimator)

#> # A tibble: 5 × 8
#>   trees min_n tree_depth    learn_rate loss_reduction  mean     n std_err
#>   <int> <int>      <int>         <dbl>          <dbl> <dbl> <int>   <dbl>
#> 1  1954    17         13 0.0318               6.08e-8  13.1    10   0.828
#> 2   184    25          4 0.00000000164        6.56e-1  15.7    10   1.03 
#> 3  1068    10          8 0.0000409            1.19e+1  17.4    10   1.08 
#> 4  1500    37         10 0.0000108            9.97e-9  18.3    10   1.03 
#> 5   985    18          7 0.0000000454         1.84e-3  18.4    10   1.04

autoplot(grid_res, metric = "rmse")

It was a small grid and most of the configurations were not especially good. We can further optimize the results by applying simulated annealing search to the best grid results.

sa_ctrl <- control_sim_anneal(backend_options = h2o_thread_spec)

set.seed(4)
sa_res <-
  boost_spec %>%
  tune_sim_anneal(
    outcome ~ .,
    resamples = sim_rs,
    initial = grid_res,
    iter = 25,
    control = sa_ctrl
  )

#> Optimizing rmse
#> Initial best: 13.06400
#>  1 ♥ new best           rmse=12.688  (+/-0.7899)
#>  2 ◯ accept suboptimal  rmse=12.849  (+/-0.8304)
#>  3 ◯ accept suboptimal  rmse=13.129  (+/-0.8266)
#>  4 ◯ accept suboptimal  rmse=13.678  (+/-0.9544)
#>  5 + better suboptimal  rmse=13.433  (+/-0.792)
#>  6 + better suboptimal  rmse=12.99  (+/-0.9031)
#>  7 ─ discard suboptimal rmse=16.531  (+/-1.027)
#>  8 ─ discard suboptimal rmse=13.522  (+/-0.9802)
#>  9 ✖ restart from best  rmse=13.097  (+/-0.8109)
#> 10 ♥ new best           rmse=12.66  (+/-0.8028)
#> 11 ◯ accept suboptimal  rmse=13.116  (+/-0.8135)
#> 12 + better suboptimal  rmse=12.714  (+/-0.7747)
#> 13 ─ discard suboptimal rmse=13.074  (+/-0.6598)
#> 14 ─ discard suboptimal rmse=14.489  (+/-1.028)
#> 15 ◯ accept suboptimal  rmse=12.715  (+/-0.8043)
#> 16 ─ discard suboptimal rmse=13.788  (+/-1.027)
#> 17 ─ discard suboptimal rmse=13.057  (+/-0.7716)
#> 18 ✖ restart from best  rmse=13.064  (+/-0.7095)
#> 19 ♥ new best           rmse=12.645  (+/-0.7706)
#> 20 ◯ accept suboptimal  rmse=12.7  (+/-0.821)
#> 21 ─ discard suboptimal rmse=13.018  (+/-0.8047)
#> 22 ─ discard suboptimal rmse=14.812  (+/-1.017)
#> 23 ─ discard suboptimal rmse=13.098  (+/-0.921)
#> 24 ◯ accept suboptimal  rmse=12.708  (+/-0.7538)
#> 25 ◯ accept suboptimal  rmse=13.054  (+/-0.9046)

Again, h2o is doing all of the computational work for fitting models and tidymodels is proposing new parameter configurations.

One other nice feature of the new agua release is the h2o engine for the auto_ml() model. This builds a stacked ensemble on a set of different models (not unlike our stacks package but with far less code).

There is a great worked example on the agua website so make sure to check this out!

More spline recipe steps

Spline techniques allow linear models to produce nonlinear model curves. These are called basis expansion methods since they take a single numeric predictor and make additional nonlinear feature columns.

If you have ever used geom:smooth(), you have probably used a spline function.

The recipes package now has an expanded set of spline functions (with a common naming convention):

There is also another step to make polynomial functions: step_poly_bernstein()

These functions take different approaches to creating the new set of features. Take a look at the references to see the technical details.

Here is a simple example using the Ames data where we model the sale price as a nonlinear function of the longitude using a convex basis function:

data(ames)

ames$Sale_Price <- log10(ames$Sale_Price)

spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% 
  step_spline_convex(Longitude, deg_free = 25)

spline_fit <- 
  spline_rec %>% 
  workflow( linear_reg() ) %>% 
  fit(data = ames)

spline_fit %>% 
  augment(ames) %>% 
  ggplot(aes(Longitude)) + 
  geom_point(aes(y = Sale_Price), alpha = 1 / 3) +
  geom_line(aes(y = .pred), col = "red", lwd = 1.5)

Not too bad but the model clearly over-fits on the extreme right tail of the predictor distribution.

Acknowledgements

It’s important that we thank everyone in the community that contributed to tidymodels: