class: center, middle, title-slide # Working with tidymodels ## OCRUG meetup ### Emil Hvitfeldt ### 2019-1-29 --- `tidymodels` is a "meta-package" for modeling and statistical analysis that share the underlying design philosophy, grammar, and data structures of the tidyverse. .left-column[ ![](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/tidymodels.png) ] .right-column[ ```r library(tidymodels) ``` ```r ## ✔ broom 0.5.1 ✔ purrr 0.3.0 ## ✔ dials 0.0.2 ✔ recipes 0.1.4 ## ✔ dplyr 0.7.8 ✔ rsample 0.0.4 ## ✔ ggplot2 3.1.0 ✔ tibble 2.0.1 ## ✔ infer 0.4.0 ✔ yardstick 0.0.2 ## ✔ parsnip 0.0.1.9000 ``` ] --- # The packages .pull-left[ - broom - dials - dplyr - ggplot2 - infer - parsnip ] .pull-right[ - purrr - recipes - rsample - tibble - yardstick ] --- # The packages (tidyverse) .pull-left[ - broom - dials - **dplyr** - **ggplot2** - infer - parsnip ] .pull-right[ - **purrr** - recipes - rsample - **tibble** - yardstick ] --- # The packages (tidyverse) .pull-left[ - broom - dials - **.light[dplyr]** - **.light[ggplot2]** - infer - parsnip ] .pull-right[ - **.light[purrr]** - recipes - rsample - **.light[tibble]** - yardstick ] --- # The packages .pull-left[ - **.light[broom]** - **.light[dials]** - **.light[dplyr]** - **.light[ggplot2]** - **.light[infer]** - **parsnip** ] .pull-right[ - **.light[purrr]** - **recipes** - **rsample** - **.light[tibble]** - **yardstick** ] --- # ⚠️ Disclaimer ⚠️ This talk is not designed to give opinions with respect to modeling best practices. This talk is designed to showcase what packages are available and what they can do. --- # Consider 32 cars from 1973-74 ```r head(mtcars) ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 ``` --- ```r model_glm <- glm(am ~ disp + drat + qsec, data = mtcars, family = "binomial") ``` -- ```r predict(model_glm) ``` ``` ## Mazda RX4 Mazda RX4 Wag Datsun 710 Hornet 4 Drive ## 105.97448 69.85214 27.10173 -276.59440 ## Hornet Sportabout Valiant Duster 360 Merc 240D ## -240.20025 -313.42683 -158.99087 -123.81721 ## Merc 230 Merc 280 Merc 280C Merc 450SE ## -284.08255 -20.37716 -59.07966 -167.78204 ## Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental ## -180.68287 -206.48454 -458.77204 -427.72554 ## Chrysler Imperial Fiat 128 Honda Civic Toyota Corolla ## -357.75792 27.25037 164.39652 20.76278 ## Toyota Corona Dodge Challenger AMC Javelin Camaro Z28 ## -90.84576 -211.90064 -189.27739 -74.78345 ## Pontiac Firebird Fiat X1-9 Porsche 914-2 Lotus Europa ## -297.35324 63.64819 184.39936 146.50220 ## Ford Pantera L Ferrari Dino Maserati Bora Volvo 142E ## 24.28831 162.60217 21.69348 33.80864 ``` --- ```r library(glmnet) model_glmnet <- glmnet(am ~ disp + drat + qsec, data = mtcars, family = "binomial") ``` -- ``` ## Loading required package: Matrix ``` ``` ## ## Attaching package: 'Matrix' ``` ``` ## The following objects are masked from 'package:tidyr': ## ## expand, pack, unpack ``` ``` ## Loaded glmnet 3.0-2 ``` ``` ## Error in drop(y): argument "y" is missing, with no default ``` -- <br> ```r model_glmnet <- glmnet(x = as.matrix(mtcars[, c("disp", "drat", "qsec")]), y = mtcars[, "am"], family = "binomial") ``` -- <br> ```r model_glm <- glm(x = as.matrix(mtcars[, c("disp", "drat", "qsec")]), y = mtcars[, "am"], family = "binomial") ``` -- ``` ## Error in environment(formula): argument "formula" is missing, with no default ``` --- # User-facing problems in modeling in R - Data must be a matrix (except when it needs to be a data.frame) - Must use formula or x/y (or both) - Inconsistent naming of arguments (ntree in randomForest, num.trees in ranger) - na.omit explicitly or silently - May or may not accept factors -- .center[![](https://emojipedia-us.s3.dualstack.us-west-1.amazonaws.com/thumbs/240/apple/155/tired-face_1f62b.png)] --- # Syntax for Computing Predicted Class Probabilities |Function |Package |Code | |:------------|:------------|:------------------------------------------| |`lda` |`MASS` |`predict(obj)` | |`glm` |`stats` |`predict(obj, type = "response")` | |`gbm` |`gbm` |`predict(obj, type = "response", n.trees)` | |`mda` |`mda` |`predict(obj, type = "posterior")` | |`rpart` |`rpart` |`predict(obj, type = "prob")` | |`Weka` |`RWeka` |`predict(obj, type = "probability")` | |`logitboost` |`LogitBoost` |`predict(obj, type = "raw", nIter)` | blatantly stolen from Max Kuhn --- ![:scale 50%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/parsnip.png) --- The goals of `parsnip` is... - Decouple the *model classification* from the *computational engine* - Separate the definition of a model from its evaluation - Harmonize argument names - Make consistent predictions (always tibbles with na.omit=FALSE) --- ```r model_glm <- glm(am ~ disp + drat + qsec, data = mtcars, family = "binomial") ``` --- ```r library(parsnip) model_glm <- logistic_reg(mode = "classification") %>% set_engine("glm") model_glm ``` ``` ## Logistic Regression Model Specification (classification) ## ## Computational engine: glm ``` -- ```r fit_glm <- model_glm %>% fit(factor(am) ~ disp + drat + qsec, data = mtcars) ``` ??? decision_tree(), linear_reg(), logistic_reg(), rand_forest(), svm_poly() --- ```r library(parsnip) model_glmnet <- logistic_reg(mode = "classification") %>% set_engine("glmnet") model_glmnet ``` ``` ## Logistic Regression Model Specification (classification) ## ## Computational engine: glmnet ``` ```r fit_glmnet <- model_glmnet %>% fit(factor(am) ~ disp + drat + qsec, data = mtcars) ``` --- ### Using both formula and x/y #### Formula ```r fit_glm <- model_glm %>% fit(factor(am) ~ ., data = mtcars) ``` #### x/y ```r fit_glm <- model_glm %>% fit_xy(x = as.matrix(mtcars[, c("disp", "drat", "qsec")]), y = factor(mtcars[, "am"]), data = mtcars) ``` --- # Tidy prediction ```r predict(fit_glm, mtcars) ``` ``` ## # A tibble: 32 x 1 ## .pred_class ## <fct> ## 1 1 ## 2 1 ## 3 1 ## 4 0 ## 5 0 ## 6 0 ## 7 0 ## 8 0 ## 9 0 ## 10 0 ## # … with 22 more rows ``` --- Consider now that we wanted to model a more advanded relation ship between variables ```r fit_glm <- model_glm %>% fit(factor(am) ~ poly(mpg, 3) + pca(disp:wt)[1] + pca(disp:wt)[2] + pca(disp:wt)[3], data = mtcars) ``` -- - Not all inline functions can be used with formulas - Having to run some calculations many many times - Connected to the model, calculations are not saved between models Post by Max Kuhn about the bad sides of formula https://rviews.rstudio.com/2017/03/01/the-r-formula-method-the-bad-parts/ --- ![:scale 45%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/recipes.png) ??? Preprocessing - statistics Feature engineering - Computer Science --- ### Preprocessing steps Some of things you may need to deal with before you can start modeling - Same unit (center and scale) - Remove correlation (filter and PCA extraction) - Missing data (imputation) - Dummy varibles - Zero Variance --- ### Same units ```r library(recipes) car_rec <- recipe(mpg ~ ., mtcars) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) ``` ### PCA ```r library(recipes) car_rec <- recipe(mpg ~ ., mtcars) %>% step_pca(all_predictors(), threshold = 0.8) ``` ### Any combination of steps ```r car_rec <- recipe(mpg ~ ., mtcars) %>% step_knnimpute(drat, wt, neighbors = 5) %>% step_zv(all_predictors()) %>% step_pca(all_predictors(), threshold = 0.8) ``` --- ```r library(recipes) car_rec <- recipe(mpg ~ ., mtcars) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) ``` -- ```r car_rec ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 10 ## ## Operations: ## ## Centering for all_predictors ## Scaling for all_predictors ``` --- ```r library(recipes) car_rec <- recipe(mpg ~ ., mtcars) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) car_preped <- prep(car_rec, training = mtcars) ``` -- ```r bake(car_preped, new_data = mtcars) ``` -- ``` ## # A tibble: 32 x 11 ## cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -0.105 -0.571 -0.535 0.568 -0.610 -0.777 -0.868 1.19 0.424 0.735 ## 2 -0.105 -0.571 -0.535 0.568 -0.350 -0.464 -0.868 1.19 0.424 0.735 ## 3 -1.22 -0.990 -0.783 0.474 -0.917 0.426 1.12 1.19 0.424 -1.12 ## 4 -0.105 0.220 -0.535 -0.966 -0.00230 0.890 1.12 -0.814 -0.932 -1.12 ## 5 1.01 1.04 0.413 -0.835 0.228 -0.464 -0.868 -0.814 -0.932 -0.503 ## 6 -0.105 -0.0462 -0.608 -1.56 0.248 1.33 1.12 -0.814 -0.932 -1.12 ## 7 1.01 1.04 1.43 -0.723 0.361 -1.12 -0.868 -0.814 -0.932 0.735 ## 8 -1.22 -0.678 -1.24 0.175 -0.0278 1.20 1.12 -0.814 0.424 -0.503 ## 9 -1.22 -0.726 -0.754 0.605 -0.0687 2.83 1.12 -0.814 0.424 -0.503 ## 10 -0.105 -0.509 -0.345 0.605 0.228 0.253 1.12 -0.814 0.424 0.735 ## # … with 22 more rows, and 1 more variable: mpg <dbl> ``` --- ```r library(recipes) car_rec <- recipe(mpg ~ ., mtcars) %>% step_center(all_predictors()) %>% step_scale(all_predictors()) car_preped <- prep(car_rec, training = mtcars) ``` ```r juice(car_preped) ``` ``` ## # A tibble: 32 x 11 ## cyl disp hp drat wt qsec vs am gear carb ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -0.105 -0.571 -0.535 0.568 -0.610 -0.777 -0.868 1.19 0.424 0.735 ## 2 -0.105 -0.571 -0.535 0.568 -0.350 -0.464 -0.868 1.19 0.424 0.735 ## 3 -1.22 -0.990 -0.783 0.474 -0.917 0.426 1.12 1.19 0.424 -1.12 ## 4 -0.105 0.220 -0.535 -0.966 -0.00230 0.890 1.12 -0.814 -0.932 -1.12 ## 5 1.01 1.04 0.413 -0.835 0.228 -0.464 -0.868 -0.814 -0.932 -0.503 ## 6 -0.105 -0.0462 -0.608 -1.56 0.248 1.33 1.12 -0.814 -0.932 -1.12 ## 7 1.01 1.04 1.43 -0.723 0.361 -1.12 -0.868 -0.814 -0.932 0.735 ## 8 -1.22 -0.678 -1.24 0.175 -0.0278 1.20 1.12 -0.814 0.424 -0.503 ## 9 -1.22 -0.726 -0.754 0.605 -0.0687 2.83 1.12 -0.814 0.424 -0.503 ## 10 -0.105 -0.509 -0.345 0.605 0.228 0.253 1.12 -0.814 0.424 0.735 ## # … with 22 more rows, and 1 more variable: mpg <dbl> ``` --- <br> <br> <br> <br> <br> .huge[ .center[ ```r recipe -> prepare -> bake/juice (define) -> (estimate) -> (apply) ``` ] ] --- ## Types of data splitting - Random - By date - By outcome - Classification: within class - regression: within quantile --- ## Training and Testing sets ```r library(rsample) car_preped <- prep(car_rec, training = mtcars) ``` --- ![:scale 45%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/rsample.png) --- ## Training and Testing sets ```r library(rsample) set.seed(4595) # These slides were almost finished and I didn't want to change the data in all the other slides big_mtcars <- rerun(10, mtcars) %>% bind_rows() data_split <- initial_split(big_mtcars, strata = "mpg", p = 0.80) # Training and test data cars_train <- training(data_split) cars_test <- testing(data_split) car_prep <- prep(car_rec, training = cars_train) # Preprocessed data cars_train_p <- juice(car_prep) cars_test_p <- bake(car_prep, new_data = cars_test) ``` --- ![:scale 45%](https://raw.githubusercontent.com/rstudio/hex-stickers/master/PNG/yardstick.png) --- ```r library(yardstick) head(two_class_example) ``` ``` ## truth Class1 Class2 predicted ## 1 Class2 0.003589243 0.9964107574 Class2 ## 2 Class1 0.678621054 0.3213789460 Class1 ## 3 Class2 0.110893522 0.8891064779 Class2 ## 4 Class1 0.735161703 0.2648382969 Class1 ## 5 Class2 0.016239960 0.9837600397 Class2 ## 6 Class1 0.999275071 0.0007249286 Class1 ``` -- ```r metrics(two_class_example, truth = truth, estimate = predicted) ``` ``` ## # A tibble: 2 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.838 ## 2 kap binary 0.675 ``` --- ```r library(yardstick) head(two_class_example) ``` ``` ## truth Class1 Class2 predicted ## 1 Class2 0.003589243 0.9964107574 Class2 ## 2 Class1 0.678621054 0.3213789460 Class1 ## 3 Class2 0.110893522 0.8891064779 Class2 ## 4 Class1 0.735161703 0.2648382969 Class1 ## 5 Class2 0.016239960 0.9837600397 Class2 ## 6 Class1 0.999275071 0.0007249286 Class1 ``` ```r accuracy(two_class_example, truth = truth, estimate = predicted) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.838 ``` --- ```r library(yardstick) head(two_class_example) ``` ``` ## truth Class1 Class2 predicted ## 1 Class2 0.003589243 0.9964107574 Class2 ## 2 Class1 0.678621054 0.3213789460 Class1 ## 3 Class2 0.110893522 0.8891064779 Class2 ## 4 Class1 0.735161703 0.2648382969 Class1 ## 5 Class2 0.016239960 0.9837600397 Class2 ## 6 Class1 0.999275071 0.0007249286 Class1 ``` ```r j_index(two_class_example, truth = truth, estimate = predicted) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 j_index binary 0.673 ``` And many more!! --- ```r library(yardstick) head(two_class_example) ``` ``` ## truth Class1 Class2 predicted ## 1 Class2 0.003589243 0.9964107574 Class2 ## 2 Class1 0.678621054 0.3213789460 Class1 ## 3 Class2 0.110893522 0.8891064779 Class2 ## 4 Class1 0.735161703 0.2648382969 Class1 ## 5 Class2 0.016239960 0.9837600397 Class2 ## 6 Class1 0.999275071 0.0007249286 Class1 ``` ```r conf_mat(two_class_example, truth = truth, estimate = predicted) ``` ``` ## Truth ## Prediction Class1 Class2 ## Class1 227 50 ## Class2 31 192 ``` --- ```r roc_curve(two_class_example, truth = truth, Class1) %>% autoplot() ``` <img src="index_files/figure-html/unnamed-chunk-42-1.png" width="700px" style="display: block; margin: auto;" />