usemodels 0.0.1

  tidymodels, tune, parsnip, recipes

  Max Kuhn

We’re very excited to announce the first release of the usemodels package. The tidymodels packages are designed to provide modeling functions that are highly flexible and modular. This is powerful, but sometimes a template or skeleton showing how to start is helpful. The usemodels package creates templates for tidymodels analyses so you don’t have to write as much new code.

You can install it from CRAN with:

install.packages("usemodels")

This blog post will show how to use the package.

Let’s start by creating a glmnet linear regression model for the mtcars data using tidymodels. This model is usually tuned over the amount and type of regularization. In tidymodels, there are a few intermediate steps for a glmnet model:

  • Create a parsnip model object and define the tuning parameters that we want to optimize.

  • Create a recipe that, at minimum, centers and scales the predictors. For some data sets, we also need to create dummy variables from any factor-encoded predictor columns.

  • Define a resampling scheme for our data.

  • Choose a function from the tune package, such as tune_grid(), to optimize the parameters. For grid search, we’ll also need a grid of candidate parameter values (or let the function choose one for us).

We recognize that this might be more code than you would have had to write compared to a package like caret. However, the tidymodels ecosystem enables a wider variety of modeling techniques and is more versatile.

The new usemodels package can automatically generate much of this code infrastructure. For example:

> library(usemodels)
> use_glmnet(mpg ~ ., data = mtcars)

which produces the terminal output:

glmnet_recipe <- 
  recipe(formula = mpg ~ ., data = mtcars) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_predictors(), -all_nominal()) 

glmnet_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_mode("regression") %>% 
  set_engine("glmnet") 

glmnet_workflow <- 
  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec) 

glmnet_grid <- tidyr::crossing(penalty = 10^seq(-6, -1, length.out = 20), mixture = c(0.05, 
    0.2, 0.4, 0.6, 0.8, 1)) 

glmnet_tune <- 
  tune_grid(glmnet_workflow, resamples = stop("add your rsample object"), grid = glmnet_grid) 

This can be copied to the source window and edited. Some notes:

  • For this model, it is possible to prescribe a default grid of candidate tuning parameter values that work well about 90% of the time. For other models, the grid might be data-driven. In these cases, the tune package functions can estimate an appropriate grid.

  • The extra recipes steps are the recommend preprocessing for this model. Since this varies from model-to-model, the recipe template will contain the minimal required steps. Your data might require additional operations.

  • One thing that should not be automated is the choice of resampling method. The code templates require the user to choose the rsample function that is appropriate.

In case you are unfamiliar with the model and its preprocessing needs, a verbose option prints comments that explain why some steps are included. For the glmnet model, the comments added to the recipe state:

Regularization methods sum up functions of the model slope coefficients. Because of this, the predictor variables should be on the same scale. Before centering and scaling the numeric predictors, any predictors with a single unique value are filtered out.

Let’s look at another example. The ad_data data set in the modeldata package has rows for 333 patients with a factor outcome for their level of cognitive impairment (e.g., Alzheimer’s disease). There is also a categorical predictor in the data, the Apolipoprotein E genotype, which has six levels. Let’s suppose the Genotype column was encoded as character (instead of being a factor). This might be a problem if the resampling method samples out a level from the data used to fit the model.

Let’s use a boosted tree model with the xgboost package and change the default prefix for the objects:

> library(tidymodels)
> data(ad_data)
> 
> ad_data$Genotype <- as.character(ad_data$Genotype)
> 
> use_xgboost(Class ~ ., data = ad_data, prefix = "impairment")
impairment_recipe <- 
  recipe(formula = Class ~ ., data = ad_data) %>% 
  step_string2factor(one_of(Genotype)) %>% 
  step_novel(all_nominal(), -all_outcomes()) %>% 
  step_dummy(all_nominal(), -all_outcomes(), one_hot = TRUE) %>% 
  step_zv(all_predictors()) 

impairment_spec <- 
  boost_tree(trees = tune(), min_n = tune(), tree_depth = tune(), learn_rate = tune(), 
    loss_reduction = tune(), sample_size = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("xgboost") 

impairment_workflow <- 
  workflow() %>% 
  add_recipe(impairment_recipe) %>% 
  add_model(impairment_spec) 

set.seed(64393)
impairment_tune <-
  tune_grid(impairment_workflow, resamples = stop("add your rsample object"), 
    grid = stop("add number of candidate points"))

Notice that the line

step_string2factor(one_of(Genotype)) 

is included in the recipe along with a step to generate one-hot encoded dummy variables. xgboost is one of the few tree ensemble implementations that requires the user to create dummy variables. This step is only added to the template when it is required for that model.

Also, for this particular model, we recommend using a space-filling design for the grid, but the user must choose the number of grid points.

The current set of templates included in the inaugural version of the package are:

ls("package:usemodels", pattern = "^use_")
## [1] "use_earth"   "use_glmnet"  "use_kknn"    "use_ranger"  "use_xgboost"

We’ll likely add more but please file an issue if there are any that you see as a priority.