Improvements to model specification checking in tidymodels

  tidymodels, parsnip

  Simon Couch

We’re stoked to announce the new release of parsnip v1.0.2 on CRAN! parsnip provides a tidy, unified interface to various statistical and machine learning models. This release includes improvements to errors and warnings that proliferate throughout the tidymodels ecosystem. These changes are meant to better anticipate common mistakes and nudge users informatively when defining model specifications. parsnip v1.0.2 includes a number of other changes that you can read about in the release notes.

parsnip and its extension packages

We’ll load parsnip, along with other core packages in tidymodels, using the tidymodels meta-package:

library(tidymodels)

parsnip provides a unified interface to machine learning models, supporting a wide array of modeling approaches implemented across numerous R packages. For instance, the code to specify a linear regression model using the glmnet package:

linear_reg() %>%
  set_engine("glmnet") %>%
  set_mode("regression")
#> Linear Regression Model Specification (regression)
#> 
#> Computational engine: glmnet

…is quite similar to that needed to specify a boosted tree regression model using xgboost:

boost_tree() %>%
  set_engine("xgboost") %>%
  set_mode("regression")
#> Boosted Tree Model Specification (regression)
#> 
#> Computational engine: xgboost

We refer to these objects as model specifications. They have three main components:

  • The model type: In this case, a linear regression or boosted tree.
  • The mode: The learning task, such as regression or classification.
  • The engine: The implementation for the given model type and mode, usually an R package.

This conceptual split of the model specification allows for parsnip’s consistent syntax - and it makes it extensible. Anyone (including you!) can write a parsnip extension package that tightly integrates with other tidymodels packages out-of-the-box. We maintain a few of these packages ourselves, such as:

  • agua: models from the H2O modeling ecosystem
  • baguette: bootstrap aggregating ensemble models
  • censored: censored regression and survival analysis

Similarly, community members outside of the tidymodels team have written parsnip extension packages, such as:

Much of our work on improving errors and warnings in this release has focused on parsnip’s integration with its extensions.

Improvements to errors and warnings

Two “big ideas” have helped us focus our efforts related to improving errors and messages in the ecosystem.

  • The same kind of mistake should raise the same prompt.
  • Don’t tell the user they did something they didn’t do.

We’ll address both in the sections below!

The same kind of mistake should raise the same prompt

The first problem we sought to address with these changes is that, in some cases, the same conceptual mistake could lead to different kinds of errors from parsnip and the packages that depend on it.

A common mistake that users (and we, as developers) make when defining model specifications is forgetting to load the needed extension package for a given model specification.

For example, parsnip supports bagged decision tree models via the bag_tree() model type, though requires extension packages for actual implementations of the model. The censored package implements the censored regression mode for bagged decision trees via rpart, and the baguette package implements a few additional engines for regression and classification with this model type.

In parsnip v1.0.1, if we specified a bag_tree() model without loading any extension packages, we’d see:

bt <-
  bag_tree() %>%
  set_engine("rpart")
  
bt
#> parsnip could not locate an implementation for `bag_tree` model specifications
#> using the `rpart` engine.
#>
#> Bagged Decision Tree Model Specification (unknown)
#> 
#> Main Arguments:
#>   cost_complexity = 0
#>   min_n = 2
#> 
#> Computational engine: rpart

After seeing this prompt, we may not remember which extension package was the one that implemented this specification. A reasonable guess might be the censored package:

library(censored)
#> Loading required package: survival

Then, trying again:

bag_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")
#> Error in `stop_incompatible_mode()`:
#> ! Available modes for engine rpart are: 'unknown', 'censored regression'

The censored package clearly wasn’t the right one to load. Strangely, though, a side effect of loading it was that the prompt then became more cryptic, and it was converted from a message to an error. Perhaps even more strangely, if we instead supply an engine that only has an implementation in baguette and not censored, we see a different error:

bag_tree() %>%
  set_engine("C5.0")
#> Error in `check_spec_mode_engine_val()`:
#> ! Engine 'C5.0' is not supported for `bag_tree()`. See `show_engines('bag_tree')`.

Not only is this error different from the one above, but it seems to suggest that there is literally no C5.0 implementation anywhere.

Returning to our bt object, suppose we moved forward with defining tuning parameters, and want to define the grid to optimize over:

bt <- 
  bt %>%
  update(cost_complexity = tune())

extract_parameter_set_dials(bt) %>%
  grid_random(size = 3)
#> Error in `grid_random()`:
#> ! At least one parameter object is required.

So far in this section, we’ve made the same mistake—failing to load the needed parsnip extension package—four times, and received four different prompts.

The good news is that, in each of the above cases, the newest version of parsnip always supplies a message, and it’s the same kind of message, and it’s much more helpful.

library(parsnip)

bag_tree() %>%
  set_engine("rpart")
#> ! parsnip could not locate an implementation for `bag_tree` model
#>   specifications using the `rpart` engine.
#> ℹ The parsnip extension packages censored and baguette implement support for
#>   this specification.
#> ℹ Please install (if needed) and load to continue.
#> 
#> Bagged Decision Tree Model Specification (unknown mode)
#> 
#> Main Arguments:
#>   cost_complexity = 0
#>   min_n = 2
#> 
#> Computational engine: rpart

Note how the above message now suggests the two possible parsnip extensions that could provide support for this model specification.

We could load censored, and then this specification is possible; censored implements a censored regression mode for bagged trees:

library(censored)
#> Loading required package: survival

bag_tree() %>%
  set_engine("rpart")
#> Bagged Decision Tree Model Specification (unknown mode)
#> 
#> Main Arguments:
#>   cost_complexity = 0
#>   min_n = 2
#> 
#> Computational engine: rpart

The censored package, however, doesn’t implement a regression mode for bagged trees. Thus, if we set the mode to regression but fail to load the package that provides support for that mode, parsnip will again prompt us to load the correct package:

bag_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")
#> ! parsnip could not locate an implementation for `bag_tree` regression model
#>   specifications using the `rpart` engine.
#> ℹ The parsnip extension package baguette implements support for this
#>   specification.
#> ℹ Please install (if needed) and load to continue.
#> 
#> Bagged Decision Tree Model Specification (regression)
#> 
#> Main Arguments:
#>   cost_complexity = 0
#>   min_n = 2
#> 
#> Computational engine: rpart

That side-effect of loading censored is no longer the case for the C5.0 engine, as well:

bag_tree() %>%
  set_engine("C5.0")
#> ! parsnip could not locate an implementation for `bag_tree` model
#>   specifications using the `C5.0` engine.
#> ℹ The parsnip extension package baguette implements support for this
#>   specification.
#> ℹ Please install (if needed) and load to continue.
#> 
#> Bagged Decision Tree Model Specification (unknown mode)
#> 
#> Main Arguments:
#>   cost_complexity = 0
#>   min_n = 2
#> 
#> Computational engine: C5.0

Finally, if we try to extract information about tuning parameters for a model that’s not well-specified with parsnip v1.0.2, the message about missing extensions is elevated to an error:

bt <- 
  bt %>%
  update(cost_complexity = tune())

extract_parameter_set_dials(bt) %>%
  grid_random(size = 3)
#> Error:
#> ! parsnip could not locate an implementation for `bag_tree` regression
#>   model specifications using the `rpart` engine.
#> ℹ The parsnip extension package baguette implements support for this
#>   specification.
#> ℹ Please install (if needed) and load to continue.

Given parsnip’s infrastructure, the technical conditions that raise these four prompts are quite different, but the technical reasons don’t matter; the mistake being made is the same, and that’s what ought to determine the prompt raised.

Don’t tell the user they did something they didn’t do

Another consideration that helped us frame these changes is that we feel error messages shouldn’t reference operations that users don’t need to know about. We’ll return to the example of forgetting to load extension packages to elaborate on what we mean here.

With parsnip v1.0.1, if we just load the package and initialize a bag_tree() model, we see:

library(parsnip)

bag_tree()
#> parsnip could not locate an implementation for `bag_tree` model specifications
#> using the `rpart` engine.
#> 
#> Bagged Decision Tree Model Specification (unknown)
#> 
#> Main Arguments:
#>   cost_complexity = 0
#>   min_n = 2
#> 
#> Computational engine: rpart

Note the ending of the message: “…using the rpart engine.” We didn’t specify that we wanted to use rpart as an engine, yet that seems to be what went wrong!

Readers who have fitted bagged decision tree models with parsnip before may realize that rpart is the default engine for these models. This shouldn’t be requisite knowledge to interpret this message, though, and is not helpful in addressing the issue. With v1.0.2, we only mention the information that users give to us when constructing that message, and tell them exactly which packages they might need to install/load:

library(parsnip)

bag_tree()
#> ! parsnip could not locate an implementation for `bag_tree` model
#>   specifications.
#> ℹ The parsnip extension packages censored and baguette implement support for
#>   this specification.
#> ℹ Please install (if needed) and load to continue.
#> 
#> Bagged Decision Tree Model Specification (unknown mode)
#> 
#> Main Arguments:
#>   cost_complexity = 0
#>   min_n = 2
#> 
#> Computational engine: rpart

We hinted at another example of this guideline in the previous section; parsnip shouldn’t refer to internal functions when it raises error messages. Above, with parsnip v1.0.1, we saw:

library(censored)
#> Loading required package: survival

bag_tree() %>%
  set_engine("rpart") %>%
  set_mode("regression")
#> Error in `stop_incompatible_mode()`:
#> ! Available modes for engine rpart are: 'unknown', 'censored regression'

The error points out a function called stop_incompatible_mode(), which is a function used internally by parsnip to check modes. There’s a different function, check_spec_mode_engine_val(), that will flag super silly modes:

library(parsnip)

bag_tree() %>%
  set_engine("rpart") %>%
  set_mode("beep bop boop")
#> Error in `check_spec_mode_engine_val()`:
#> ! 'beep bop boop' is not a known mode for model `bag_tree()`.

The important part, though, is that the technical reasons don’t matter. Users don’t know—and don’t need to know—what stop_incompatible_mode() or check_spec_mode_engine_val() do.

In parsnip v1.0.2, we now point users to the function they actually called that eventually gave rise to the error:

bag_tree() %>%
  set_engine("rpart") %>%
  set_mode("beep bop boop")
#> Error in `set_mode()`:
#> ! 'beep bop boop' is not a known mode for model `bag_tree()`.

We hope these changes improve folks’ experience when modeling with parsnip in the future!

Acknowledgements

Thanks to the folks who have contributed to this release of parsnip via GitHub: @gustavomodelli, @joeycouse, @mrkaye97, @siegfried.

Contributions from many others, in the form of StackOverflow and RStudio Community posts, have been greatly helpful in our work on these improvements.