We’ve sent a few packages to CRAN recently. Here’s a recap of the changes (and some notes at the bottom):
- Since 2018, a warning has been issued when the wrong argument was used in
bake(recipe, newdata). The deprecation period is over and
new_datais officially required.
- Previously, if
step_other()did not collapse any levels, it would still add an “other” level to the factor. This would lump new factor levels into “other” when data were baked (as
step_novel()does). This no longer occurs, since it was inconsistent with
?step_other, which said that: “If no pooling is done the data are unmodified”.
step_normalize()centers and scales the data (if you are, like Max, too lazy to use two separate steps).
step_unknown()will convert missing data in categorical columns to “unknown” and update factor levels.
- If the
step_other()is greater than one, it specifies the minimum sample size before the levels of the factor are collapsed into the “other” category. #289
step_knnimpute()can now pass two options to the underlying knn code, including the number of threads (#323).
- Due to changes by CRAN,
step_nnmf()only works on versions of R >= 3.6.0 due to dependency issues.
step_other()are now tolerant to cases where that step’s selectors do not capture any columns. In this case, no modifications to the data are made. (#290, #348)
step_dummy()can now retain the original columns that are used to make the dummy variables by setting
preserve = TRUE. (#328)
step_other()’s print method only reports the variables with collapsed levels (as opposed to any column that was tested to see if it needed collapsing). (#338)
step_isomap()now accept zero components. In this case, the original data are returned. Please use this with great care.
Two new steps were added:
step_umap()was added for both supervised and unsupervised encodings.
step_woe()creates weight of evidence encodings. Thanks to Athos Petri Damiani for this.
- Added three functions to compute different bootstrap confidence intervals.
- A new function (
add_resample_id()) augments a data frame with columns for the resampling identifier.
group_vfold_cv()to use tidyselect on the stratification variable.
breaksparameter that specifies the number of bins to stratify by for a numeric stratification variable.
Unplanned release based on CRAN requirements for Solaris.
The method that
parsnipuses to store the model information has changed. Any custom models from previous versions will need to use the new method for registering models. The methods are detailed in
?get_model_envand the package vignette for adding models.
The mode needs to be declared for models that can be used for more than one mode prior to fitting and/or translation.
surv_reg(), the engine that uses the
survivalpackage is now called
glmnetmodels, the full regularization path is always fit regardless of the value given to
penalty. Previously, the model was fit by passing
lambdaargument, and the model could only make predictions at those specific values. (#195)
add_rowindex()can create a column called
.rowto a data frame.
If a computational engine is not explicitly set, a default will be used. Each default is documented on the corresponding model page. A warning is issued at fit time unless verbosity is zero.
multi_predict()documentation is a little better organized.
A suite of internal functions were added to help with upcoming model tuning features.
parsnipobject always saved the name(s) of the outcome variable(s) for proper naming of the predicted values.
- New function called
focus(x,..., mirror = TRUE)
- A new
retract()function does the opposite of
- A new argument was added to
remove.dups. It removes duplicates with out removing all NAs.
correlate()’s interface for databases was improved. It now only calculates unique pairs, and simplifies the formula that ultimately runs in-database. We also re-added the vignette to the package, which is also available on the site as an article
The new version is now able to parse the following models:
cubist(), from the
ctree(), from the
- XGBoost trained models, via the
- Integration with
tidy()function. It works with Regression models only
- Adds support for
as_parsed_model()function. It adds the proper class components to the list. This allows any model exported in the correct spec to be read in by
tidypredict. See the Save Models and Non-R models for more information
- Now supports classification models from
The package’s official website has been expanded greatly. Here are some highlights:
- An article per each supported model, they are found under Model List
- A how to guide to save and reload models, link here
- How to integrate non-R models to
tidypredict, link here
Two new metrics have been added to yardstick:
iic()is a numeric metric for computing the index of ideality of correlation. It is a potential alternative to the traditional correlation coefficient, and has been used in QSAR models (#115).
average_precision()is a probability metric that can be used as an alternative to
pr_auc(). It has the benefit of avoiding any issues of ambiguity in the edge case where
recall == 0and the current number of false positives is
pr_curve()(and by extension
pr_auc()) has been greatly improved to better handle edge cases when duplicate class probability values are present. Additionally, the first precision value in the curve is now a
1, rather than an
NA, which results in a more practical curve, and generates a more correct AUC value (#93).
- Each metric function now has a
directionattribute, which specifies the direction required for optimization, either minimization or maximization.
- Documentation for class probability metrics has been improved with more informative examples (#100).
mn_log_loss()now uses the min/max rule before computing the log of the estimated probabilities to avoid problematic undefined log values (#103).
Upcoming Changes and Directions
We are currently working on two general use packages:
tune. The former bundles together recipes, model object, and other items so that there can be single
tune will have tools for… um… tuning models. We are hoping to make these public in the next month or so.
There will be some changes to accommodate model tuning. The
dials package has been re-factored substantially (see the current GH master branch) and there were some small interfaces changes to
recipes too (mostly backwards compatible and also on GH). We are pretty close to end of “Phase I” of our tidymodels work.