Q1 2022 tidymodels digest

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

library(tidymodels)
#> ── Attaching packages ──────────────────────────── tidymodels 0.2.0 ──
#> ✓ broom        0.7.12     ✓ rsample      0.1.1 
#> ✓ dials        0.1.0      ✓ tibble       3.1.6 
#> ✓ dplyr        1.0.8      ✓ tidyr        1.2.0 
#> ✓ infer        1.0.0      ✓ tune         0.2.0 
#> ✓ modeldata    0.1.1      ✓ workflows    0.2.6 
#> ✓ parsnip      0.2.1      ✓ workflowsets 0.2.1 
#> ✓ purrr        0.3.4      ✓ yardstick    0.0.9 
#> ✓ recipes      0.2.0
#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──
#> x purrr::discard() masks scales::discard()
#> x dplyr::filter()  masks stats::filter()
#> x dplyr::lag()     masks stats::lag()
#> x recipes::step()  masks stats::step()
#> • Dig deeper into tidy modeling with R at https://www.tmwr.org

Since the beginning of last year, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like these from the past month or so:

Since our last roundup post, there have been 21 CRAN releases of tidymodels packages. You can install these updates from CRAN with:

install.packages(c(
  "baguette", "broom", "brulee", "dials", "discrim", "finetune",
  "hardhat", "multilevelmod", "parsnip", "plsmod", "poissonreg",
  "recipes", "rules", "stacks", "textrecipes", "tune",
  "tidymodels", "usemodels", "vetiver", "workflows", "workflowsets"
))

The NEWS files are linked here for each package; you’ll notice that there are a lot! We know it may be bothersome to keep up with all these changes, so we want to draw your attention to our recent blog posts above and also highlight a few more useful updates in today’s blog post.

We’re really excited about brulee and vetiver but will share more in upcoming blog posts.

Feature hashing

The newest textrecipes release provides support for feature hashing, a feature engineering approach that can be helpful when working with high cardinality categorical data or text. A hashing function takes an input of variable size and maps it to an output of fixed size. Hashing functions are commonly used in cryptography and databases, and we can create a hash in R using rlang::hash():

library(textrecipes)
data(Sacramento)
set.seed(123)
sac_split <- initial_split(Sacramento, strata = price)
sac_train <- training(sac_split)
sac_test  <- testing(sac_split)

tibble(sac_train) %>%
  mutate(zip_hash = map_chr(zip, rlang::hash)) %>%
  select(zip, zip_hash)
#> # A tibble: 698 × 2
#>    zip    zip_hash                        
#>    <fct>  <chr>                           
#>  1 z95838 32cbb7d319c97f062be64075c2ae6c07
#>  2 z95815 55d08d816f0d2e9ec16af15239826e91
#>  3 z95824 235b72b9a37a6154552498eb3f90e9e3
#>  4 z95841 d973597ab5cc48a0dfe54b84a91249e1
#>  5 z95842 c44537f2eecd51707b19e69027228a85
#>  6 z95820 e1b86cbed49c029f9fa25bba94ede11e
#>  7 z95670 60ee71387789bb8c58748e4632089cc4
#>  8 z95838 32cbb7d319c97f062be64075c2ae6c07
#>  9 z95815 55d08d816f0d2e9ec16af15239826e91
#> 10 z95822 8e212bdf9650ef39a1634e6e18529834
#> # … with 688 more rows

The variable zip in this data on home sales in Sacramento, CA is of “high cardinality” (as ZIP codes often are) with 67 unique values. When we hash() the ZIP code, we get out, well, a hash value, and we will always get the same hash value for the same input (as you can see for ZIP code 95838 here). We can choose the fixed size of our hashed output to reduce the number of possible values to whatever we want; it turns out this works well in a lot of situations.

Let’s use a hashing algorithm like this one (with an output size of 16) to create binary indicator variables for this high cardinality zip:

hash_rec <- 
  recipe(price ~ zip + beds + baths, data = sac_train) %>%
  step_dummy_hash(zip, signed = FALSE, num_terms = 16L)

prep(hash_rec) %>% bake(new_data = NULL)
#> # A tibble: 698 × 19
#>    dummyhash_zip_01 dummyhash_zip_02 dummyhash_zip_03 dummyhash_zip_04
#>               <dbl>            <dbl>            <dbl>            <dbl>
#>  1                0                0                0                0
#>  2                0                1                0                0
#>  3                0                0                1                0
#>  4                1                0                0                0
#>  5                0                0                0                0
#>  6                0                0                0                0
#>  7                0                1                0                0
#>  8                0                0                0                0
#>  9                0                1                0                0
#> 10                0                0                0                0
#> # … with 688 more rows, and 15 more variables:
#> #   dummyhash_zip_05 <dbl>, dummyhash_zip_06 <dbl>,
#> #   dummyhash_zip_07 <dbl>, dummyhash_zip_08 <dbl>,
#> #   dummyhash_zip_09 <dbl>, dummyhash_zip_10 <dbl>,
#> #   dummyhash_zip_11 <dbl>, dummyhash_zip_12 <dbl>,
#> #   dummyhash_zip_13 <dbl>, dummyhash_zip_14 <dbl>,
#> #   dummyhash_zip_15 <dbl>, dummyhash_zip_16 <dbl>, beds <int>, …

We now have 16 columns for zip (along with the other predictors and the outcome), instead of the over 60 we would have had by making regular dummy variables.

For more on feature hashing including its benefits (fast and low memory!) and downsides (not directly interpretable!), check out Section 6.7 of Supervised Machine Learning for Text Analysis with R and/or Section 17.4 of Tidy Modeling with R.

More customization for workflow sets

Last year about this time, we introduced workflowsets, a new package for creating, handling, and tuning multiple workflows at once. See Section 7.5 and especially Chapter 15 of Tidy Modeling with R for more on workflow sets. In the latest release of workflowsets, we provide finer control of customization for the workflows you create with workflowsets. First you can create a standard workflow set by crossing a set of models with a set of preprocessors (let’s just use the feature hashing recipe we already created):

glmnet_spec <- 
  linear_reg(penalty = tune(), mixture = tune()) %>% 
  set_engine("glmnet")

mars_spec <- 
  mars(prod_degree = tune()) %>%
  set_engine("earth") %>% 
  set_mode("regression")

old_set <- 
  workflow_set(
    preproc = list(hash = hash_rec), 
    models = list(MARS = mars_spec, glmnet = glmnet_spec)
  )

old_set
#> # A workflow set/tibble: 2 × 4
#>   wflow_id    info             option    result    
#>   <chr>       <list>           <list>    <list>    
#> 1 hash_MARS   <tibble [1 × 4]> <opts[0]> <list [0]>
#> 2 hash_glmnet <tibble [1 × 4]> <opts[0]> <list [0]>

The option column is a placeholder for any arguments to use when we evaluate the workflow; the possibilities here are any argument to functions like tune_grid() or fit_resamples(). But what about arguments that belong not to the workflow as a whole, but to a recipe or a parsnip model? In the new release, we added support for customizing those kinds of arguments via update_workflow_model() and update_workflow_recipe(). This lets you, for example, say that you want to use a sparse blueprint for fitting:

sparse_bp <- hardhat::default_recipe_blueprint(composition = "dgCMatrix")
new_set <- old_set %>%
  update_workflow_recipe("hash_glmnet", hash_rec, blueprint = sparse_bp)

Now we can tune this workflow set, with the sparse blueprint for the glmnet model, over a set of resampling folds.

set.seed(123)
folds <- vfold_cv(sac_train, strata = price)

new_set %>%
  workflow_map(resamples = folds, grid = 5, verbose = TRUE)
#> i 1 of 2 tuning:     hash_MARS
#> ✓ 1 of 2 tuning:     hash_MARS (2.2s)
#> i 2 of 2 tuning:     hash_glmnet
#> ✓ 2 of 2 tuning:     hash_glmnet (3.9s)
#> # A workflow set/tibble: 2 × 4
#>   wflow_id    info             option    result   
#>   <chr>       <list>           <list>    <list>   
#> 1 hash_MARS   <tibble [1 × 4]> <opts[2]> <tune[+]>
#> 2 hash_glmnet <tibble [1 × 4]> <opts[2]> <tune[+]>

New parameter objects and parameter handling

Even if you are a regular tidymodels user, you may not have thought much about dials. This is an infrastructure package that is used to create and manage model hyperparameters. In the latest release of dials, we provide a handful of new parameters for various models and feature engineering approaches. There are a handful of parameters for the new parsnip::bart(), i.e. Bayesian additive regression trees model:

prior_outcome_range()
#> Prior for Outcome Range (quantitative)
#> Range: (0, 5]
prior_terminal_node_coef()
#> Terminal Node Prior Coefficient (quantitative)
#> Range: (0, 1]
prior_terminal_node_expo()
#> Terminal Node Prior Exponent (quantitative)
#> Range: [0, 3]

This version of dials, along with the new hardhat release, also provides new functions for extracting single parameters and parameter sets from modeling objects.

recipe(price ~ zip + beds + baths, data = sac_train) %>%
  step_dummy_hash(zip, signed = FALSE, num_terms = tune()) %>%
  extract_parameter_set_dials()
#> Collection of 1 parameters for tuning
#> 
#>  identifier      type    object
#>   num_terms num_terms nparam[+]

You can also extract a single parameter by name:

mars_spec %>% extract_parameter_dials("prod_degree")
#> Degree of Interaction (quantitative)
#> Range: [1, 2]
glmnet_spec %>% extract_parameter_dials("penalty")
#> Amount of Regularization (quantitative)
#> Transformer:  log-10 
#> Range (transformed scale): [-10, 0]

Acknowledgements

We’d like to extend our thanks to all of the contributors who helped make these releases during Q1 possible!

baguette: @EmilHvitfeldt and @hfrick.
broom: @cgoo4, @colinbrislawn, @DanChaltiel, @ddsjoberg, @fschaffner, @grantmcdermott, @hughjonesd, @jennybc, @Marc-Girondot, @MichaelChirico, @mlaviolet, @oliverbothe, @PursuitOfDataScience, @simonpcouch, and @vincentarelbundock.
brulee: @dfalbel, @EmilHvitfeldt, and @topepo.
dials: @EmilHvitfeldt, @hfrick, and @py9mrg.
discrim: @deschen1, @EmilHvitfeldt, @hfrick, @jmarshallnz, and @juliasilge.
finetune: @juliasilge, @Steviey, and @topepo.
hardhat: @DavisVaughan, @ddsjoberg, @EmilHvitfeldt, @hfrick, and @MasterLuke84.
multilevelmod: @EmilHvitfeldt and @sitendug.
parsnip: @brunocarlin, @dietrichson, @edgararuiz, @EmilHvitfeldt, @hfrick, @jmarshallnz, @juliasilge, @mattwarkentin, @nikhilpathiyil, @nvelden, @t-kalinowski, @tiagomaie, @tolliam, and @topepo.
plsmod: @EmilHvitfeldt and @topepo.
poissonreg: @EmilHvitfeldt and @juliasilge.
recipes: @agwalker82, @AndrewKostandy, @aridf, @brunocarlin, @DoktorMike, @duccioa, @EmilHvitfeldt, @FieteO, @hfrick, @joeycouse, @juliasilge, @lionel-, @mattwarkentin, @mdsteiner, @MichaelChirico, @spsanderson, @themichjam, @tmastny, @tomazweiss, @topepo, @walrossker, and @zenggyu.
rules: @EmilHvitfeldt, @juliasilge, and @wdkeyzer.
stacks: @amcmahon17, @py9mrg, @Saarialho, @siegfried, @simonpcouch, @StuieT85, @topepo, and @williamshell.
textrecipes: @EmilHvitfeldt, @lionel-, and @NLDataScientist.
tune: @abichat, @AndrewKostandy, @dax44, @EmilHvitfeldt, @felxcon, @hfrick, @juanydlh, @juliasilge, @mattwarkentin, @mdancho84, @py9mrg, @topepo, @walrossker, @williamshell, and @wtbxsjy.
tidymodels: @EmilHvitfeldt, @exsell-jc, @hardin47, @juliasilge, @PursuitOfDataScience, @RaymondBalise, @scottlyden, and @topepo.
usemodels: @juliasilge and @topepo.
vetiver: @atheriel and @juliasilge.
workflows: @CarstenLange, @DavisVaughan, @dpprdan, @hfrick, and @juliasilge.
workflowsets: @DavisVaughan, @dvanic, @gdmcdonald, @hfrick, @juliasilge, @topepo, and @wdefreitas.