Updates for parsnip packages

We’re delighted to announce the release of parsnip 0.2.1. parsnip is a unified modeling interface for tidymodels.

This release of parsnip precipitated releases of our parsnip extension packages: baguette, discrim, plsmod, poissonreg, and rules. It also allowed us to release an additional package called multilevelmod (see the section below). We’ve kept CRAN busy!

You can see a full list of recent parsnip changes in the release notes. You can install the entire set from CRAN with:

install.packages("parsnip")
install.packages("baguette")
install.packages("discrim")
install.packages("multilevelmod")
install.packages("plsmod")
install.packages("poissonreg")
install.packages("rules")

Let’s look at a summary of the changes, which are almost entirely in parsnip, before looking at multilevelmod.

Major changes to parsnip

There are a lot of improvements in this version of parsnip. The main changes are described below.

BART

We’ve added a model function for the excellent Bayesian Additive Regression Trees (BART) approach and an engine for the dbarts package. The model is an ensemble of trees that is assembled using Bayesian estimation methods. It typically has very good predictive performance and is also able to generate estimates of the predictive posterior variance, and prediction intervals.

A good overview of this model is: Bayesian Additive Regression Trees: A Review and Look Forward ( pdf).

New engines

Within parsnip, a "glm" engine was added for linear regression. An engine vale of "brulee" was added for linear, logistic, and multinomial regression as well as for neural networks. The brulee package is a new, and is for fitting models using torch (look for a blog post soon on this package).

As discussed below, the multilevelmod package adds a lot more engines for linear(ish) models, such as "gee", "gls", "lme", "lmer", and "stan_glmer". There are similar engines for logistic and Poisson regression.

multilevelmod

This package has been simmering for a while on GitHub. Its engines are useful for fitting a variety of models that go by a litany of different names: mixed effects models, random coefficient models, variance component models, hierarchical linear models, and so on.

One aspect of these models is that they mostly work with the formula method, which specifies both the model terms and also which of these are “random effects”.

As an example, let’s look at the measurement system analysis (MSA) data in the package. In these data, 56 separate items were measured twice using a laboratory test. The lab would like to understand how noisy their data are and if different samples can be distinguished from one another. Here’s a plot of the data:

library(ggplot2)
library(parsnip)
library(multilevelmod)

data(msa_data)

msa_data %>% 
  ggplot() + 
  aes(x = reorder(id, value), y = value, col = replicate, pch = replicate) + 
  geom_point(alpha = 1/2, cex = 3) + 
  labs(x = NULL, y = "lab result") +
  theme_bw() + 
  theme(
    axis.text.x = element_text(angle = 90), 
    legend.position = "top"
  )

plot of chunk data-plot

With this data set, the goal is to estimate how much of the variation in the lab test is due to the different samples (as it should be since they are different) or measurement noise. The latter term could be associated with day-to-day differences, people-to-people differences etc. It might also be irreducible noise. In any case, we’d like to get estimates of these two sources of variation.

A straightforward way to estimate this is to use a repeated measurements model that considers the samples to be randomly selected from a population that are independent from one another. We can add a random intercept term that is different for each sample. From this, the sample-to-sample variance can be computed.

There are a lot of packages that can do this but we’ll use the lme4 package:

msa_model <- 
  linear_reg() %>% 
  set_engine("lmer") %>% 
  # The formula has (1|id) which means that each sample (=id) should
  # have a different intercept (=1)
  fit(value ~ (1|id), data = msa_data)
msa_model

## parsnip model object
## 
## Linear mixed model fit by REML ['lmerMod']
## Formula: value ~ (1 | id)
##    Data: data
## REML criterion at convergence: 163.0314
## Random effects:
##  Groups   Name        Std.Dev.
##  id       (Intercept) 0.6397  
##  Residual             0.2618  
## Number of obs: 112, groups:  id, 56
## Fixed Effects:
## (Intercept)  
##      0.8778

We can see from this output that the sample-to-sample variance is 0.6397^2 = 0.40921 which gives a percental of the total variance of:

0.6397 ^ 2 / (0.6397 ^ 2 + 0.2618 ^ 2) * 100

## [1] 85.6539

Pretty good!

There is a lot more that can be done with these models in terms of prediction and inference. If you are interested in more about multilevelmod, take a look at the Get Started vignette.

Acknowledgements

We’d like to thank all of the contributors to these packages since their last releases: @asshah4, @batpigandme, @bshor, @cimentadaj, @daaronr, @davestr2, @DavisVaughan, @deschen1, @dfalbel, @dietrichson, @edgararuiz, @EmilHvitfeldt, @fabrice-rossi, @frequena, @ghost, @gmcmacran, @hfrick, @JB304245, @Jeffrothschild, @jennybc, @jonthegeek, @josefortou, @juliasilge, @kcarnold, @maspotts, @mattwarkentin, @meenakshi-kushwaha, @miepstei, @mmp3, @NickCH-K, @nikhilpathiyil, @nvelden, @p-lemercier, @psads-git, @RaymondBalise, @rmflight, @saadaslam, @Shafi2016, @shuckle16, @sitendug, @ssh352, @stephenhillphd, @stevenpawley, @Steviey, @t-kalinowski, @t-neumann, @tiagomaie, @topepo, @tsengj, @ttrodrigz, @wdkeyzer, @yitao-li, @zenggyu