Discovering the tidymodels

Discovering the tidymodels

After 1 year and half using the tidyverse ecosystem for data science I can say today that I’m familiar with the some core packages of ecosystem whether it be for data import, manipulation and transformation with {readr}, {dplyr} and {tidyr}, for data visualization with {ggplot2}, for functional programming with {purrr}, for working with strings and date with {stringr} and {lubridate}.

The few packages I just mentioned are in fact only a subset of the whole tidyverse ecosystem. Many other packages are in the tidyverse that I have never used.

The exhaustive list of all the packages in the tidyverse installation is the following :

tidyverse::tidyverse_packages()
  1. 'broom'
  2. 'cli'
  3. 'crayon'
  4. 'dplyr'
  5. 'dbplyr'
  6. 'forcats'
  7. 'ggplot2'
  8. 'haven'
  9. 'hms'
  10. 'httr'
  11. 'jsonlite'
  12. 'lubridate'
  13. 'magrittr'
  14. 'modelr'
  15. 'purrr'
  16. 'readr'
  17. 'readxl\n(>='
  18. 'reprex'
  19. 'rlang'
  20. 'rstudioapi'
  21. 'rvest'
  22. 'stringr'
  23. 'tibble'
  24. 'tidyr'
  25. 'xml2'
  26. 'tidyverse'

In this blog post I would like to present my recent discovery of the {tidymodels}.
The tidymodels, like the tidyverse, is a meta package containing various packages designed to work together for various modelling tasks in R.
You will need to install the package before you use it.

# install.packages("tidymodels")

Let’s understand what happens to an R session when we load the {tidymodels} package.
Whenever we start an R session, some packages and namespaces are automatically loaded. We can use the search() function to get the names of packages or environments that are attached to the session.

search()
  1. '.GlobalEnv'
  2. 'jupyter:irkernel'
  3. 'package:stats'
  4. 'package:graphics'
  5. 'package:grDevices'
  6. 'package:utils'
  7. 'package:datasets'
  8. 'package:methods'
  9. 'Autoloads'
  10. 'package:base'
# library(tidymodels)

After we load the tidymodels meta package, here are the packages and environment attached to the R session.

search()
  1. '.GlobalEnv'
  2. 'package:yardstick'
  3. 'package:tibble'
  4. 'package:rsample'
  5. 'package:tidyr'
  6. 'package:recipes'
  7. 'package:purrr'
  8. 'package:parsnip'
  9. 'package:infer'
  10. 'package:ggplot2'
  11. 'package:dplyr'
  12. 'package:dials'
  13. 'package:scales'
  14. 'package:broom'
  15. 'package:tidymodels'
  16. 'jupyter:irkernel'
  17. 'package:stats'
  18. 'package:graphics'
  19. 'package:grDevices'
  20. 'package:utils'
  21. 'package:datasets'
  22. 'package:methods'
  23. 'Autoloads'
  24. 'package:base'

It loads some of the most used packages of the tidyverse such as {dplyr}, {tidyr}, {purrr} along with the modelling packages.
The goal of this article is to present each of these packages : for what they are used for and present their main functions.

1 - {yardstick}

drawing

This package contains functions to estimate how well models are working using tidy principles. It is the package you’ll need to use when you want to compute the root mean squared error (RMSE), the accuracy, the precision, the recall … It has over 70 functions just to test your models.
The nice thing about the package is that it outputs data frames.

Demo of some functions

As many R packages, yardstick comes with examples data set to test the functions. One of the dataset is two_class_example.

metrics(two_class_example, truth, predicted)
.metric.estimator.estimate
accuracy binary 0.8380000
kap binary 0.6748764
precision(two_class_example, truth = truth, estimate = predicted)
.metric.estimator.estimate
precisionbinary 0.8194946

Because all the outputs are data frame you can easily bind them.

rbind(
    metrics(two_class_example, truth, predicted),
    precision(two_class_example, truth = truth, estimate = predicted),
    recall(two_class_example, truth = truth, estimate = predicted)
)
.metric.estimator.estimate
accuracy binary 0.8380000
kap binary 0.6748764
precisionbinary 0.8194946
recall binary 0.8798450
sample(
lsf.str("package:yardstick"), size = 10 )
  1. 'detection_prevalence_vec'
  2. 'bal_accuracy_vec'
  3. 'rmse'
  4. 'huber_loss_pseudo'
  5. 'mape'
  6. 'mcc'
  7. 'roc_auc'
  8. 'metric_vec_template'
  9. 'smape_vec'
  10. 'roc_auc_vec'

2- {rsample}

drawing

The rsample package provides the infrastructure for data splitting and resampling.

rsample contains a set of functions that can create different types of resamples and corresponding classes for their analysis. The goal is to have a modular set of methods that can be used across different R packages for:

• traditional resampling techniques for estimating the sampling distribution of a statistic and
• estimating model performance using a holdout set

  • The scope of rsample is to provide the basic building blocks for creating and analyzing resamples of a data set but does not include code for modeling or calculating statistics (which is left to other tidymodels or R packages)

Some resampling methods include: Simple Training/Test Set Splitting, Bootstrap Sampling, V-Fold Cross-Validation (CV), Leave-One-Out CV, Monte Carlo CV, Group V-Fold CV, Rolling Origin Forecast Resampling, Nested or Double Resampling, and Sampling for the Apparent Error Rate.

Demo

Splitting a data set into training and testing sets

splits <- initial_split(data = iris, prop = .7)
train <- training(splits)
test <- testing(splits)

print(
    paste(
    "The data set is splitted into training set of", nrow(train),
        "rows and a testing of", nrow(test), "rows."))
[1] "The data set is splitted into training set of 105 rows and a testing of 45 rows."

Bootstrap resampling
A bootstrap sample is a sample that is the same size as the original data set that is made using replacement. This results in analysis samples that have multiple replicates of some of the original rows of the data. The assessment set is defined as the rows of the original data that were not included in the bootstrap sample. This is often referred to as the “out-of-bag” (OOB) sample.

For the demonstration purpose let’s say we want to have a assess the confidence interval of a linear regression model.

confint(lm(mpg ~ wt, data = mtcars), level = .90)
5 %95 %
(Intercept)34.09830340.471950
wt-6.293412-4.395531
boot <- bootstraps(mtcars, 100)

confidence_intervals <- map(boot$splits, ~ confint(lm(mpg ~ wt, data = .), level = .90))
confidence_intervals[1:5]
  1. 5 %95 %
    (Intercept)32.55637938.914256
    wt-5.790271-4.055817
  2. 5 %95 %
    (Intercept)34.63285439.825161
    wt-6.309371-4.653124
  3. 5 %95 %
    (Intercept)26.89347333.9626
    wt-4.540461-2.6207
  4. 5 %95 %
    (Intercept)35.05681340.567585
    wt-6.372533-4.699671
  5. 5 %95 %
    (Intercept)36.52647644.87956
    wt-7.697699-5.19452

3- {recipes}

drawing The recipes package is an alternative method for creating and preprocessing design matrices that can be used for modeling or visualization. The recipes package lets you automate your preprocessing routine with useful step functions.

The package consists of three major function and family of functions.

  • recipe() : this functions lets you specify the design matrix. Think of the design matrix as the matrix of features that will be fed into a model.

  • step_*() family of functions : these are functions that will be applied sequentially on the design matrix you specified

  • prep() : this function perform the computation specified by the step_*() functions and return the final design matrix.

  • bake() : this function lets you apply the blueprint you create on a new data set.

Demo

Create a design matrix from the state.x77 dataset

state <- datasets::state.x77
colnames(state) <- gsub(pattern = " ", replacement = "_", x = colnames(state))
head(state)
PopulationIncomeIlliteracyLife_ExpMurderHS_GradFrostArea
Alabama 3615 3624 2.1 69.05 15.1 41.3 20 50708
Alaska 365 6315 1.5 69.31 11.3 66.7 152 566432
Arizona 2212 4530 1.8 70.55 7.8 58.1 15 113417
Arkansas 2110 3378 1.9 70.66 10.1 39.9 65 51945
California21198 5114 1.1 71.71 10.3 62.6 20 156361
Colorado 2541 4884 0.7 72.06 6.8 63.9 166 103766

For this demo, let’s say we want to explain life expentancy as a function of all the variables. We want our model to have the following specifications : - Taking the logarithm of the dependant variable - Standardize all the feature columns - Have polynomial degrees of 2 for Income and Illiteracy

blueprint <- recipe(formula = Life_Exp  ~ . , data = state)
blueprint <- blueprint %>%
step_log(all_outcomes()) %>%
step_scale(all_predictors()) %>%
step_poly(Income, Illiteracy, options = list(degree = 2))
blueprint
Data Recipe

Inputs:

      role #variables
   outcome          1
 predictor          7

Operations:

Log transformation on all_outcomes()
Scaling for all_predictors()
Orthogonal polynomials on Income, Illiteracy
blueprint <- prep(blueprint)
head(blueprint$template)
PopulationMurderHS_GradFrostAreaLife_ExpIncome_poly_1Income_poly_2Illiteracy_poly_1Illiteracy_poly_2
0.80972269 4.090434 5.113286 0.3847571 0.5942764 4.234831 -0.18873410 0.0983672744 0.21796542 0.01105159
0.08175623 3.061053 8.258019 2.9241539 6.6383444 4.238589 0.43689223 0.7305725135 0.07734257 -0.17129700
0.49546517 2.112939 7.193267 0.2885678 1.3291995 4.256322 0.02190041 -0.0933766675 0.14765400 -0.11655040
0.47261822 2.735986 4.939954 1.2504606 0.6087735 4.257880 -0.24592625 0.2213291785 0.17109114 -0.08211145
4.74813320 2.790164 7.750404 0.3847571 1.8324850 4.272630 0.15767364 0.0007233293-0.01640600 -0.13096185
0.56915777 1.842050 7.911355 3.1934839 1.2160938 4.277499 0.10420131 -0.0567062476-0.11015457 0.03889401

From here you can use this design matrix to estimate your model !

4- {parsnip}

drawing The parsnip “is designed to solve a specific problem related to model fitting in #r, the interface. Many functions have different interfaces and argument names and parsnip standardizes the interface for fitting models as well as the return values. When using parsnip, you don’t have to remember each interface and its unique set of argument names to easily move between R packages”.

parsnip is a tidy, unified interface to models that can be used to try a range of models without getting bogged down in the syntactical minutiae of the underlying package. The idea is:

• Separate the definition of a model from its evaluation.

• Decouple the model specification from the implementation (whether the implementation is in R, #spark, or something else). For example, the user would call rand_forest() instead of ranger::ranger() or other specific packages.

• Harmonize the argument names (e.g. n.trees, ntrees, trees) so that users can remember a single name. This will help across model types too so that trees will be the same argument across random forest as well as boosting or bagging.

Demo

Fit a regularized regression

constrained_reg <- linear_reg(mode = "regression", penalty = .8, mixture = 1) %>%
set_engine(engine = "glmnet")
constrained_reg
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.8
  mixture = 1

Computational engine: glmnet 
translate(constrained_reg, engine = "glmnet")
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = 0.8
  mixture = 1

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    lambda = 0.8, alpha = 1, family = "gaussian")
results <- fit(constrained_reg, formula = Life_Exp ~ . , data = blueprint$template)

5- {infer}

drawing The objective of this package is to perform statistical inference using an expressive statistical grammar that coheres with the tidyverse design framework. In short this package aims to make statistical inference tidy and transparent. The process can be summarize with the following diagram : drawing

Demo

Let’s demo an example to understand how it is working We want to test if there’s a significant difference between petal length of a setosa and of a versicolor

options(repr.plot.height = 4)

iris %>% filter(Species %in% c("setosa", "versicolor")) %>%
mutate(Species = as.factor(Species)) %>%
  specify(response = Petal.Length, explanatory = Species) %>%
  generate(reps = 100, type = "bootstrap") %>%
  calculate(stat = "diff in means", , order = c("versicolor", "setosa")) %>%
visualize()

png

5- {dials}

This package contains tools to create and manage values of tuning parameters and is designed to integrate well with the parsnip package.

The name reflects the idea that tuning predictive models can be like turning a set of dials on a complex machine under duress.

Avatar
Axel-Cleris Gailloty
Economics students turning to Data science

I’m interested in quantitative Economics, Econometrics and Data Science.

Related

comments powered by Disqus