Working with tidymodels

class: center, middle, title-slide

# Working with tidymodels
## OCRUG meetup
### Emil Hvitfeldt
### 2019-1-29

---

`tidymodels` is a "meta-package" for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse.

.left-column[
![](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidymodels.png)
]

.right-column[

```r
library(tidymodels)
```

```r
## ✔ broom     0.5.1          ✔ purrr     0.3.0     
## ✔ dials     0.0.2          ✔ recipes   0.1.4     
## ✔ dplyr     0.7.8          ✔ rsample   0.0.4     
## ✔ ggplot2   3.1.0          ✔ tibble    2.0.1     
## ✔ infer     0.4.0          ✔ yardstick 0.0.2     
## ✔ parsnip   0.0.1.9000
```
]

---

# The packages

.pull-left[
- broom
- dials
- dplyr
- ggplot2
- infer
- parsnip
]

.pull-right[
- purrr
- recipes
- rsample
- tibble
- yardstick
]

---

# The packages (tidyverse)

.pull-left[
- broom
- dials
- **dplyr**
- **ggplot2**
- infer
- parsnip
]

.pull-right[
- **purrr**
- recipes
- rsample
- **tibble**
- yardstick
]

---

# The packages (tidyverse)

.pull-left[
- broom
- dials
- **.light[dplyr]**
- **.light[ggplot2]**
- infer
- parsnip
]

.pull-right[
- **.light[purrr]**
- recipes
- rsample
- **.light[tibble]**
- yardstick
]

---

# The packages

.pull-left[
- **.light[broom]**
- **.light[dials]**
- **.light[dplyr]**
- **.light[ggplot2]**
-  **.light[infer]**
- **parsnip**
]

.pull-right[
- **.light[purrr]**
- **recipes**
- **rsample**
- **.light[tibble]**
- **yardstick**
]

---

# ⚠️ Disclaimer ⚠️

This talk is not designed to give opinions with respect to modeling best practices.

This talk is designed to showcase what packages are available and what they can do.

---

# Consider 32 cars from 1973-74

```r
head(mtcars)
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1
```

---

```r
model_glm <- glm(am ~ disp + drat + qsec, data = mtcars, 
 family = "binomial")
```

```r
predict(model_glm)
```

```
##           Mazda RX4       Mazda RX4 Wag          Datsun 710      Hornet 4 Drive 
##           105.97448            69.85214            27.10173          -276.59440 
##   Hornet Sportabout             Valiant          Duster 360           Merc 240D 
##          -240.20025          -313.42683          -158.99087          -123.81721 
##            Merc 230            Merc 280           Merc 280C          Merc 450SE 
##          -284.08255           -20.37716           -59.07966          -167.78204 
##          Merc 450SL         Merc 450SLC  Cadillac Fleetwood Lincoln Continental 
##          -180.68287          -206.48454          -458.77204          -427.72554 
##   Chrysler Imperial            Fiat 128         Honda Civic      Toyota Corolla 
##          -357.75792            27.25037           164.39652            20.76278 
##       Toyota Corona    Dodge Challenger         AMC Javelin          Camaro Z28 
##           -90.84576          -211.90064          -189.27739           -74.78345 
##    Pontiac Firebird           Fiat X1-9       Porsche 914-2        Lotus Europa 
##          -297.35324            63.64819           184.39936           146.50220 
##      Ford Pantera L        Ferrari Dino       Maserati Bora          Volvo 142E 
##            24.28831           162.60217            21.69348            33.80864
```

---

```r
library(glmnet)
model_glmnet <- glmnet(am ~ disp + drat + qsec, data = mtcars, 
 family = "binomial")
```

```
## Loading required package: Matrix
```

```
## 
## Attaching package: 'Matrix'
```

```
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
```

```
## Loaded glmnet 3.0-2
```

```
## Error in drop(y): argument "y" is missing, with no default
```

```r
model_glmnet <- glmnet(x = as.matrix(mtcars[, c("disp", "drat", "qsec")]),
 y = mtcars[, "am"],
 family = "binomial")
```

```r
model_glm <- glm(x = as.matrix(mtcars[, c("disp", "drat", "qsec")]),
 y = mtcars[, "am"],
 family = "binomial")
```

```
## Error in environment(formula): argument "formula" is missing, with no default
```

---

# User-facing problems in modeling in R

- Data must be a matrix (except when it needs to be a data.frame)
- Must use formula or x/y (or both)
- Inconsistent naming of arguments (ntree in randomForest, num.trees in ranger)
- na.omit explicitly or silently
- May or may not accept factors

.center[![](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/240/apple/155/tired-face_1f62b.png)]

---

# Syntax for Computing Predicted Class Probabilities

|Function     |Package      |Code                                       |
|:------------|:------------|:------------------------------------------|
|`lda`        |`MASS`       |`predict(obj)`                             |
|`glm`        |`stats`      |`predict(obj, type = "response")`          |
|`gbm`        |`gbm`        |`predict(obj, type = "response", n.trees)` |
|`mda`        |`mda`        |`predict(obj, type = "posterior")`         |
|`rpart`      |`rpart`      |`predict(obj, type = "prob")`              |
|`Weka`       |`RWeka`      |`predict(obj, type = "probability")`       |
|`logitboost` |`LogitBoost` |`predict(obj, type = "raw", nIter)`        |

blatantly stolen from Max Kuhn

---

![:scale 50%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/parsnip.png)

---

The goals of `parsnip` is...

- Decouple the *model classification* from the *computational engine*
- Separate the definition of a model from its evaluation
- Harmonize argument names
- Make consistent predictions (always tibbles with na.omit=FALSE)

---

```r
model_glm <- glm(am ~ disp + drat + qsec, data = mtcars, 
 family = "binomial")
```

---

```r
library(parsnip)
model_glm <- logistic_reg(mode = "classification") %>%
 set_engine("glm")

model_glm
```

```
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glm
```

```r
fit_glm <- model_glm %>%
 fit(factor(am) ~ disp + drat + qsec, data = mtcars)
```

???
decision_tree(), linear_reg(), logistic_reg(), rand_forest(), svm_poly()

---

```r
library(parsnip)
model_glmnet <- logistic_reg(mode = "classification") %>%
 set_engine("glmnet")
model_glmnet
```

```
## Logistic Regression Model Specification (classification)
## 
## Computational engine: glmnet
```

```r
fit_glmnet <- model_glmnet %>%
 fit(factor(am) ~ disp + drat + qsec, data = mtcars)
```

---

### Using both formula and x/y

#### Formula

```r
fit_glm <- model_glm %>%
 fit(factor(am) ~ ., data = mtcars)
```

#### x/y

```r
fit_glm <- model_glm %>%
 fit_xy(x = as.matrix(mtcars[, c("disp", "drat", "qsec")]),
 y = factor(mtcars[, "am"]), 
 data = mtcars)
```

---

# Tidy prediction

```r
predict(fit_glm, mtcars)
```

```
## # A tibble: 32 x 1
## .pred_class
## <fct> 
## 1 1 
## 2 1 
## 3 1 
## 4 0 
## 5 0 
## 6 0 
## 7 0 
## 8 0 
## 9 0 
## 10 0 
## # … with 22 more rows
```

---

Consider now that we wanted to model a more advanded relation ship between variables

```r
fit_glm <- model_glm %>%
 fit(factor(am) ~ poly(mpg, 3) + pca(disp:wt)[1] + pca(disp:wt)[2] + pca(disp:wt)[3], 
 data = mtcars)
```

- Not all inline functions can be used with formulas
- Having to run some calculations many many times
- Connected to the model, calculations are not saved between models

Post by Max Kuhn about the bad sides of formula
https://rviews.rstudio.com/2017/03/01/the-r-formula-method-the-bad-parts/

---

![:scale 45%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/recipes.png)

???

Preprocessing - statistics  
Feature engineering - Computer Science

---

### Preprocessing steps

Some of things you may need to deal with before you can start modeling

- Same unit (center and scale)
- Remove correlation (filter and PCA extraction)
- Missing data (imputation)
- Dummy varibles
- Zero Variance

---

### Same units

```r
library(recipes)
car_rec <- recipe(mpg ~ ., mtcars) %>%
 step_center(all_predictors()) %>%
 step_scale(all_predictors())
```

### PCA

```r
library(recipes)
car_rec <- recipe(mpg ~ ., mtcars) %>%
 step_pca(all_predictors(), threshold = 0.8)
```

### Any combination of steps

```r
car_rec <- recipe(mpg ~ ., mtcars) %>%
 step_knnimpute(drat, wt, neighbors = 5) %>%
 step_zv(all_predictors()) %>%
 step_pca(all_predictors(), threshold = 0.8)
```

---

```r
library(recipes)
car_rec <- recipe(mpg ~ ., mtcars) %>%
 step_center(all_predictors()) %>%
 step_scale(all_predictors())
```

```r
car_rec
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         10
## 
## Operations:
## 
## Centering for all_predictors
## Scaling for all_predictors
```

---

```r
library(recipes)
car_rec <- recipe(mpg ~ ., mtcars) %>%
 step_center(all_predictors()) %>%
 step_scale(all_predictors())

car_preped <- prep(car_rec, training = mtcars)
```

```r
bake(car_preped, new_data = mtcars)
```

```
## # A tibble: 32 x 11
## cyl disp hp drat wt qsec vs am gear carb
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 -0.105 -0.571 -0.535 0.568 -0.610 -0.777 -0.868 1.19 0.424 0.735
## 2 -0.105 -0.571 -0.535 0.568 -0.350 -0.464 -0.868 1.19 0.424 0.735
## 3 -1.22 -0.990 -0.783 0.474 -0.917 0.426 1.12 1.19 0.424 -1.12 
## 4 -0.105 0.220 -0.535 -0.966 -0.00230 0.890 1.12 -0.814 -0.932 -1.12 
## 5 1.01 1.04 0.413 -0.835 0.228 -0.464 -0.868 -0.814 -0.932 -0.503
## 6 -0.105 -0.0462 -0.608 -1.56 0.248 1.33 1.12 -0.814 -0.932 -1.12 
## 7 1.01 1.04 1.43 -0.723 0.361 -1.12 -0.868 -0.814 -0.932 0.735
## 8 -1.22 -0.678 -1.24 0.175 -0.0278 1.20 1.12 -0.814 0.424 -0.503
## 9 -1.22 -0.726 -0.754 0.605 -0.0687 2.83 1.12 -0.814 0.424 -0.503
## 10 -0.105 -0.509 -0.345 0.605 0.228 0.253 1.12 -0.814 0.424 0.735
## # … with 22 more rows, and 1 more variable: mpg <dbl>
```

---

```r
library(recipes)
car_rec <- recipe(mpg ~ ., mtcars) %>%
 step_center(all_predictors()) %>%
 step_scale(all_predictors())

car_preped <- prep(car_rec, training = mtcars)
```

```r
juice(car_preped)
```

---

.huge[
.center[

```r
recipe -> prepare -> bake/juice

(define) -> (estimate) -> (apply)
```
]
]

---

## Types of data splitting

- Random
- By date
- By outcome
    - Classification: within class
    - regression: within quantile
    
---

## Training and Testing sets

```r
library(rsample)

car_preped <- prep(car_rec, training = mtcars)
```

---

![:scale 45%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/rsample.png)

---

## Training and Testing sets

```r
library(rsample)

set.seed(4595)

# These slides were almost finished and I didn't want to change the data in all the other slides
big_mtcars <- rerun(10, mtcars) %>%
 bind_rows()

data_split <- initial_split(big_mtcars, strata = "mpg", p = 0.80)

# Training and test data
cars_train <- training(data_split)
cars_test <- testing(data_split)

car_prep <- prep(car_rec, training = cars_train)

# Preprocessed data
cars_train_p <- juice(car_prep)
cars_test_p <- bake(car_prep, new_data = cars_test)
```

---

![:scale 45%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/yardstick.png)

---

```r
library(yardstick)
head(two_class_example)
```

```
##    truth      Class1       Class2 predicted
## 1 Class2 0.003589243 0.9964107574    Class2
## 2 Class1 0.678621054 0.3213789460    Class1
## 3 Class2 0.110893522 0.8891064779    Class2
## 4 Class1 0.735161703 0.2648382969    Class1
## 5 Class2 0.016239960 0.9837600397    Class2
## 6 Class1 0.999275071 0.0007249286    Class1
```

```r
metrics(two_class_example, truth = truth, estimate = predicted)
```

```
## # A tibble: 2 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.838
## 2 kap binary 0.675
```

---

```r
library(yardstick)
head(two_class_example)
```

```r
accuracy(two_class_example, truth = truth, estimate = predicted)
```

```
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.838
```

---

```r
library(yardstick)
head(two_class_example)
```

```r
j_index(two_class_example, truth = truth, estimate = predicted)
```

```
## # A tibble: 1 x 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 j_index binary 0.673
```

And many more!!

---

```r
library(yardstick)
head(two_class_example)
```

```r
conf_mat(two_class_example, truth = truth, estimate = predicted)
```

```
##           Truth
## Prediction Class1 Class2
##     Class1    227     50
##     Class2     31    192
```

---

```r
roc_curve(two_class_example, truth = truth, Class1) %>%
  autoplot()
```