The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.
Since the beginning of 2021, we have been publishing
quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the
tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like these from the past month or so:
- Improvements to model specification checking in tidymodels
- brulee 0.2.0
- Announcing bundle
- censored 0.1.0
- rsample 1.1.0
Since our last roundup post, there have been CRAN releases of 22 tidymodels packages. Here are links to their NEWS files:
- agua (0.1.0)
- applicable (0.1.0)
- bonsai (0.2.0)
- broom (1.0.1)
- brulee (0.2.0)
- butcher (0.3.0)
- censored (0.1.1)
- corrr (0.4.4)
- finetune (1.0.1)
- modeldata (1.0.1)
- modeldb (0.2.3)
- parsnip (1.0.2)
- plsmod (1.0.0)
- poissonreg (1.0.1)
- probably (0.1.0)
- recipes (1.0.2)
- rsample (1.1.0)
- spatialsample (0.2.1)
- textrecipes (1.0.1)
- tune (1.0.1)
- workflows (1.1.0)
- yardstick (1.1.0)
We’ll highlight two specific upgrades: one for agua and another for recipes.
With version 220.127.116.11 of the h2o package, agua can now tune h2o models as if they were any other type of model engine.
h2o has an excellent server-based computational engine for fitting a variety of different machine learning and statistical models. The h2o server can run locally or on some external high performance computing server. The downside is that it is light on tools for feature engineering and interactive data analysis.
Using h2o with tidymodels enables users to leverage the benefits of packages like recipes along with fast, server-based parallel processing.
While the syntax for model fitting and tuning are the same as any other non-h2o model, there are different ways to parallelize the work:
The h2o server has the ability to internally parallelize individual model computations. For example, when fitting trees, the search for the best split can be done using multiple threads. The number of threads that each model should be used is set with h2o.init(nthreads). The default (
-1) is to use all CPUs on the host.
When using grid search, h2o.grid(parallelism) determines how many models the h2o server should process at the same time. The default (
1) constrains the server to run the models sequentially.
R has external parallelization tools (such as the foreach and future packages) that can start new R processes to simultaneously do work. This would run many models in parallel. For h2o, this determines how many models the agua package could send to the server at once. This does not appear to be constrained by the
With h2o and tidymodels, you should probably use h2o’s parallelization. Using multiple approaches can work but only for some technologies. It’s still pretty complicated but we are working on un-complicating it.
To set up h2o parallelization, there is a new control argument called
backend_options. If you were doing a grid search, you first define how many threads the h2o server should use:
library(tidymodels) library(agua) library(finetune) h2o_thread_spec <- agua_backend_options(parallelism = 10)
Then, pass the output to any of the existing control functions:
grid_ctrl <- control_grid(backend_options = h2o_thread_spec)
Now h2o can parallel process 10 models at once.
Here is an example using a simulated data set with a numeric outcome:
We’ll use grid search to tune a boosted tree:
boost_spec <- boost_tree( trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), loss_reduction = tune() ) %>% set_engine("h2o") %>% set_mode("regression")
Now, let’s parallel process our computations.
# Start the h2o server h2o::h2o.init() # Multi-thread the model fits h2o_thread_spec <- agua_backend_options(parallelism = 10) grid_ctrl <- control_grid(backend_options = h2o_thread_spec)
We’ll evaluate a very small grid at first:
show_best(grid_res, metric = "rmse") %>% select(-.config, -.metric, -.estimator) #> # A tibble: 5 × 8 #> trees min_n tree_depth learn_rate loss_reduction mean n std_err #> <int> <int> <int> <dbl> <dbl> <dbl> <int> <dbl> #> 1 1954 17 13 0.0318 6.08e-8 13.1 10 0.828 #> 2 184 25 4 0.00000000164 6.56e-1 15.7 10 1.03 #> 3 1068 10 8 0.0000409 1.19e+1 17.4 10 1.08 #> 4 1500 37 10 0.0000108 9.97e-9 18.3 10 1.03 #> 5 985 18 7 0.0000000454 1.84e-3 18.4 10 1.04 autoplot(grid_res, metric = "rmse")
It was a small grid and most of the configurations were not especially good. We can further optimize the results by applying simulated annealing search to the best grid results.
sa_ctrl <- control_sim_anneal(backend_options = h2o_thread_spec) set.seed(4) sa_res <- boost_spec %>% tune_sim_anneal( outcome ~ ., resamples = sim_rs, initial = grid_res, iter = 25, control = sa_ctrl ) #> Optimizing rmse #> Initial best: 13.06400 #> 1 ♥ new best rmse=12.688 (+/-0.7899) #> 2 ◯ accept suboptimal rmse=12.849 (+/-0.8304) #> 3 ◯ accept suboptimal rmse=13.129 (+/-0.8266) #> 4 ◯ accept suboptimal rmse=13.678 (+/-0.9544) #> 5 + better suboptimal rmse=13.433 (+/-0.792) #> 6 + better suboptimal rmse=12.99 (+/-0.9031) #> 7 ─ discard suboptimal rmse=16.531 (+/-1.027) #> 8 ─ discard suboptimal rmse=13.522 (+/-0.9802) #> 9 ✖ restart from best rmse=13.097 (+/-0.8109) #> 10 ♥ new best rmse=12.66 (+/-0.8028) #> 11 ◯ accept suboptimal rmse=13.116 (+/-0.8135) #> 12 + better suboptimal rmse=12.714 (+/-0.7747) #> 13 ─ discard suboptimal rmse=13.074 (+/-0.6598) #> 14 ─ discard suboptimal rmse=14.489 (+/-1.028) #> 15 ◯ accept suboptimal rmse=12.715 (+/-0.8043) #> 16 ─ discard suboptimal rmse=13.788 (+/-1.027) #> 17 ─ discard suboptimal rmse=13.057 (+/-0.7716) #> 18 ✖ restart from best rmse=13.064 (+/-0.7095) #> 19 ♥ new best rmse=12.645 (+/-0.7706) #> 20 ◯ accept suboptimal rmse=12.7 (+/-0.821) #> 21 ─ discard suboptimal rmse=13.018 (+/-0.8047) #> 22 ─ discard suboptimal rmse=14.812 (+/-1.017) #> 23 ─ discard suboptimal rmse=13.098 (+/-0.921) #> 24 ◯ accept suboptimal rmse=12.708 (+/-0.7538) #> 25 ◯ accept suboptimal rmse=13.054 (+/-0.9046)
Again, h2o is doing all of the computational work for fitting models and tidymodels is proposing new parameter configurations.
One other nice feature of the new agua release is the h2o engine for the
auto_ml() model. This builds a stacked ensemble on a set of different models (not unlike our
stacks package but with far less code).
There is a great worked example on the agua website so make sure to check this out!
Spline techniques allow linear models to produce nonlinear model curves. These are called basis expansion methods since they take a single numeric predictor and make additional nonlinear feature columns.
If you have ever used
geom:smooth(), you have probably used a spline function.
The recipes package now has an expanded set of spline functions (with a common naming convention):
There is also another step to make polynomial functions:
These functions take different approaches to creating the new set of features. Take a look at the references to see the technical details.
Here is a simple example using the Ames data where we model the sale price as a nonlinear function of the longitude using a convex basis function:
data(ames) ames$Sale_Price <- log10(ames$Sale_Price) spline_rec <- recipe(Sale_Price ~ Longitude, data = ames) %>% step_spline_convex(Longitude, deg_free = 25) spline_fit <- spline_rec %>% workflow( linear_reg() ) %>% fit(data = ames) spline_fit %>% augment(ames) %>% ggplot(aes(Longitude)) + geom_point(aes(y = Sale_Price), alpha = 1 / 3) + geom_line(aes(y = .pred), col = "red", lwd = 1.5)
Not too bad but the model clearly over-fits on the extreme right tail of the predictor distribution.
It’s important that we thank everyone in the community that contributed to tidymodels:
- agua: @coforfe, @gouthaman87, @jeliason, @qiushiyan, and @topepo.
- applicable: @topepo.
- bonsai: @barrettlayman, @DesmondChoy, @dfsnow, @dpprdan, @jameslamb, and @simonpcouch.
- broom: @AaronRendahl, @AmeliaMN, @bbolker, @capnrefsmmat, @corybrunson, @ddsjoberg, @friendly, @johanneskoch94, and @simonpcouch.
- brulee: @topepo.
- butcher: @DavisVaughan, @EmilHvitfeldt, @juliasilge, and @topepo.
- censored: @bcjaeger, @hfrick, @mattwarkentin, @simonpcouch, @therneau, and @topepo.
- corrr: @jagodap, and @topepo.
- finetune: @DMozzanica, @EmilHvitfeldt, @qiushiyan, @simonpcouch, and @topepo.
- modeldata: @topepo.
- modeldb: @topepo.
- parsnip: @barnabywalker, @bcjaeger, @ben-e, @CarmenCiardiello, @daniel-althoff, @EmilHvitfeldt, @gustavomodelli, @hfrick, @joeycouse, @john-b-edwards, @juliasilge, @mdneuzerling, @mikemahoney218, @mrkaye97, @qiushiyan, @siegfried, @simonpcouch, @topepo, @TylerGrantSmith, and @zhaoliang0302.
- plsmod: @topepo.
- poissonreg: @hfrick, @mattwarkentin, and @simonpcouch.
- probably: @DavisVaughan.
- recipes: @abichat, @adisarid, @EmilHvitfeldt, @JamesHWade, @juliasilge, @luisDVA, @naveranoc, @nhward, @RMHogervorst, @ruddnr, @simonpcouch, @topepo, and @zhaoliang0302.
- rsample: @DavisVaughan, @EmilHvitfeldt, @hfrick, @juliasilge, @mikemahoney218, and @tjmahr.
- spatialsample: @mikemahoney218.
- textrecipes: @EmilHvitfeldt, @hadley, and @PursuitOfDataScience.
- tune: @Athospd, @EmilHvitfeldt, @frankhezemans, @kevin199011, @misken, @PursuitOfDataScience, @qiushiyan, @simonpcouch, and @topepo.
- workflows: @DavisVaughan, @EmilHvitfeldt, @JosiahParry, @lrossouw, @msberends, @simonpcouch, and @topepo.
- yardstick: @DavisVaughan, @deschen1, @EmilHvitfeldt, @juliasilge, @leonhGeis, @PriitPaluoja, and @topepo.