Q3 2023 tidymodels digest

  tidymodels, rsample, tidyclust

  Emil Hvitfeldt

The tidymodels framework is a collection of R packages for modeling and machine learning using tidyverse principles.

Since the beginning of 2021, we have been publishing quarterly updates here on the tidyverse blog summarizing what’s new in the tidymodels ecosystem. The purpose of these regular posts is to share useful new features and any updates you may have missed. You can check out the tidymodels tag to find all tidymodels blog posts here, including our roundup posts as well as those that are more focused, like this post from the past couple of months:

Since our last roundup post, there have been CRAN releases of 11 tidymodels packages. Here are links to their NEWS files:

We’ll highlight a few especially notable changes below: Updated workshop material, new K-means engines and quality of life improvements in rsample. First, loading the collection of packages:

library(tidymodels)
library(tidyclust)

data("ames", package = "modeldata")

Workshops

One of the biggest areas of work for our team this quarter was getting ready for this year’s posit::conf. This year, two 1-day workshops were available: “Introduction to tidymodels” and “Advanced tidymodels”. All the material can be found on our workshop website workshops.tidymodels.org, with these workshops being archived as posit::conf 2023 workshops.

Unless otherwise noted (i.e. not an original creation and reused from another source), these educational materials are licensed under Creative Commons Attribution CC BY-SA 4.0.

Tidyclust update

The latest release of tidyclust featured a round of bug fixes, documentation improvements and quality-of-life improvements.

This release adds 2 new engines to the k_means() model. klaR to run K-Modes models and clustMixType to run K-prototypes. K-Modes is the categorical analog to K-means, meaning that it is intended to be used on only categorical data, and K-prototypes is the more general method that works with categorical and numeric data at the same time.

If we were to fit a K-means model to a mixed-type data set such as ames, it would work, but under the hood, the model would apply a dummy transformation on the categorical predictors.

kmeans_spec <- k_means(num_clusters = 3) %>%
  set_engine("stats")

kmeans_fit <- kmeans_spec %>%
  fit(~ ., data = ames)

When extracting the cluster means, we see that the dummy variables were used when calculating the means, which can make it harder to interpret the output.

kmeans_fit %>%
  extract_centroids() %>%
  select(101:112) %>%
  glimpse()
#> Rows: 3
#> Columns: 12
#> $ Overall_CondGood           <dbl> 0.09009009, 0.17594787, 0.01234568
#> $ Overall_CondVery_Good      <dbl> 0.02702703, 0.06694313, 0.01646091
#> $ Overall_CondExcellent      <dbl> 0.01201201, 0.01303318, 0.02880658
#> $ Overall_CondVery_Excellent <dbl> 0, 0, 0
#> $ Year_Built                 <dbl> 1989.645, 1956.471, 1999.572
#> $ Year_Remod_Add             <dbl> 1996.090, 1974.518, 2003.379
#> $ Roof_StyleGable            <dbl> 0.8238238, 0.8234597, 0.4444444
#> $ Roof_StyleGambrel          <dbl> 0.005005005, 0.010071090, 0.000000000
#> $ Roof_StyleHip              <dbl> 0.1531532, 0.1558057, 0.5555556
#> $ Roof_StyleMansard          <dbl> 0.005005005, 0.003554502, 0.000000000
#> $ Roof_StyleShed             <dbl> 0.003003003, 0.001184834, 0.000000000
#> $ Roof_MatlCompShg           <dbl> 0.9759760, 0.9905213, 0.9876543

Fitting a K-prototype model is done by setting the engine in k_means() to "clustMixType".

kproto_spec <- k_means(num_clusters = 3) %>%
  set_engine("clustMixType")

kproto_fit <- kproto_spec %>%
  fit(~ ., data = ames)

The clusters can now be extracted on the original data format as categorical predictors are supported.

kproto_fit %>%
  extract_centroids() %>%
  select(11:20) %>%
  glimpse()
#> Rows: 3
#> Columns: 10
#> $ Lot_Config     <fct> Inside, Inside, Inside
#> $ Land_Slope     <fct> Gtl, Gtl, Gtl
#> $ Neighborhood   <fct> College_Creek, North_Ames, Northridge_Heights
#> $ Condition_1    <fct> Norm, Norm, Norm
#> $ Condition_2    <fct> Norm, Norm, Norm
#> $ Bldg_Type      <fct> OneFam, OneFam, OneFam
#> $ House_Style    <fct> Two_Story, One_Story, One_Story
#> $ Overall_Cond   <fct> Average, Average, Average
#> $ Year_Built     <dbl> 1989.977, 1953.793, 1998.765
#> $ Year_Remod_Add <dbl> 1995.934, 1972.973, 2003.035

Stricter rsample functions

Before version 1.2.0 of rsample, misspelled and wrongly used arguments would be swallowed silently by the functions. This could be a big source of confusion as it is easy to slip between the cracks. We have made changes to all rsample functions such that whenever possible they alert the user when something is wrong.

Before 1.2.0 when you, for example, misspelled strata as stata, everything would go on like normal, with no indication that stata was ignored.

initial_split(ames, prop = 0.75, stata = Neighborhood)
#> <Training/Testing/Total>
#> <2197/733/2930>

The same code will now error and point to the problematic arguments.

initial_split(ames, prop = 0.75, stata = Neighborhood)
#> Error in `initial_split()`:
#> ! `...` must be empty.
#>  Problematic argument:
#>  stata = Neighborhood

Acknowledgements

We’d like to thank those in the community that contributed to tidymodels in the last quarter:

We’re grateful for all of the tidymodels community, from observers to users to contributors. Happy modeling!