Modeling in the Tidyverse

---

## Modeling in the Tidyverse

### Applications of Data Science - Class 5

### Giora Simchoni

#### `gsimchoni@gmail.com and add #dsapps in subject`

### Stat. and OR Department, TAU
### 2020-03-28

---

<div class="my-footer">
  <span>
    <a href="https://dsapps-2020.github.io/Class_Slides/" target="_blank">Applications of Data Science
    </a>
  </span>
</div>

---

# The Problem

---

### Inconsistency, Inextensibility

```r
n <- 10000
x1 <- runif(n)
x2 <- runif(n)
t <- 1 + 2 * x1 + 3 * x2
y <- rbinom(n, 1, 1 / (1 + exp(-t)))
```

```r
glm(y ~ x1 + x2, family = "binomial")
```

```r
glmnet(as.matrix(cbind(x1, x2)), as.factor(y), family = "binomial")
```

```r
randomForest(as.factor(y) ~ x1 + x2)
```

```r
gbm(y ~ x1 + x2, data = data.frame(x1 = x1, x2 = x2, y = y))
```

😱

---

### Compare this with `sklearn`

```python
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier,
  GradientBoostingClassifier

LogisticRegression(penalty='none').fit(X, y)

LogisticRegression(penalty='l2', C=0.001).fit(X, y)

RandomForestClassifier(n_estimators=100).fit(X, y)

GradientBoostingClassifier(n_estimators=100).fit(X, y)
```

---

# Detour: A Regression Problem

---

### IPF-Lifts: Predicting Bench Lifting

- Dataset was published as part of the [TidyTuesday](https://github.com/rfordatascience/tidytuesday) intiative
- Comes from [Open Powerlifting](https://www.openpowerlifting.org/data)
- [Wikipedia](https://en.wikipedia.org/wiki/Powerlifting): Powerlifting is a strength sport that consists of three attempts at maximal weight on three lifts: squat, bench press, and deadlift

---

The raw data has over 40K rows: for each athlete, for each event, stats about athlete gender, age and weight, and the maximal weight lifted in the 3 types of Powerlifting.

We will be predicting `best3bench_kg` based on a few predictors, no missing values:

```r
library(lubridate)

ipf_lifts <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-08/ipf_lifts.csv")

ipf_lifts <- ipf_lifts %>%
  drop_na(best3bench_kg, age) %>%
  filter(between(age, 18, 100), best3bench_kg > 0, equipment != "Wraps") %>%
  select(sex, event, equipment, age, division, bodyweight_kg, best3bench_kg, date, meet_name) %>%
  drop_na() %>%
  mutate(year = year(date), month = month(date),
         dayofweek = wday(date)) %>%
  select(-date) %>%
  mutate_if(is.character, as.factor)

dim(ipf_lifts)
```

```
## [1] 32047    11
```

---

```r
glimpse(ipf_lifts)
```

```
## Observations: 32,047
## Variables: 11
## $ sex           <fct> F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, ...
## $ event         <fct> SBD, SBD, SBD, SBD, SBD, SBD, SBD, SBD, SBD, SBD...
## $ equipment     <fct> Single-ply, Single-ply, Single-ply, Single-ply, ...
## $ age           <dbl> 33.5, 34.5, 23.5, 27.5, 37.5, 25.5, 33.5, 26.0, ...
## $ division      <fct> Open, Open, Open, Open, Open, Open, Open, Open, ...
## $ bodyweight_kg <dbl> 44, 44, 44, 44, 44, 44, 48, 48, 48, 48, 48, 48, ...
## $ best3bench_kg <dbl> 60.0, 62.5, 62.5, 60.0, 65.0, 45.0, 62.5, 77.5, ...
## $ meet_name     <fct> World Powerlifting Championships, World Powerlif...
## $ year          <dbl> 1989, 1989, 1989, 1989, 1989, 1989, 1989, 1989, ...
## $ month         <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ...
## $ dayofweek     <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ...
```

---

See the dependent variable distribution:

```r
ggplot(ipf_lifts, aes(best3bench_kg)) +
  geom_histogram(fill = "red", alpha = 0.5) +
  theme_classic()
```

---

See it vs. say age, facetted by equipment:

```r
ggplot(ipf_lifts, aes(age, best3bench_kg)) +
  geom_point(color = "red", alpha = 0.5) +
  facet_wrap(~ equipment) +
  theme_classic()
```

---

See it vs. year, by gender:

```r
ggplot(ipf_lifts, aes(factor(year), best3bench_kg, fill = sex)) +
  geom_boxplot(outlier.alpha = 0.5) +
  labs(fill = "", x = "", y = "") +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
```

---

Maybe add `$age^2$` and `$year^2$` to make Linear Regression's life easier?

```r
ipf_lifts <- ipf_lifts %>%
  mutate(age2 = age ^ 2, year2 = year ^2)
```

---

# End of Detour

---

# WARNING

.warning[
💡 What you're about to see is not a good modeling/prediction flow. This is just an intro to tidy modeling. Some of the issues with how things are done here will be raised, some will have to wait till later in the course.
]
---

# The Present Solution: `caret`

---

### Split Data

```r
library(caret)

train_idx <- createDataPartition(ipf_lifts$best3bench_kg,
                                 p = 0.6, list = FALSE)

ipf_tr <- ipf_lifts[train_idx, ]
ipf_te <- ipf_lifts[-train_idx, ]

library(glue)
glue("train no. of rows: {nrow(ipf_tr)}
     test no. of rows: {nrow(ipf_te)}")
```

```
## train no. of rows: 19230
## test no. of rows: 12817
```

Here you might consider some preprocessing.

`caret` has some nice documentation [here](https://topepo.github.io/caret/index.html).

---

### Tuning and Modeling

Define general methodology, e.g. 5-fold Cross-Validation:

```r
fit_control <- trainControl(method = "cv", number = 5)

ridge_grid <- expand.grid(alpha=0, lambda = 10^seq(-3, 1, length = 50))
lasso_grid <- expand.grid(alpha=1, lambda = 10^seq(-3, 1, length = 50))
rf_grid <- expand.grid(splitrule = "variance",
                       min.node.size = seq(10, 30, 10),
                       mtry = seq(2, 10, 2))

mod_ridge <- train(best3bench_kg ~ ., data = ipf_tr, method = "glmnet",
                trControl = fit_control, tuneGrid = ridge_grid,
                metric = "RMSE")

mod_lasso <- train(best3bench_kg ~ ., data = ipf_tr, method = "glmnet",
                trControl = fit_control, tuneGrid = lasso_grid,
                metric = "RMSE")

mod_rf <- train(best3bench_kg ~ ., data = ipf_tr, method = "ranger",
                trControl = fit_control, tuneGrid = rf_grid,
                num.trees = 50, metric = "RMSE")
```

---

### Evaluating Models

```r
mod_ridge
```

```
## glmnet 
## 
## 19230 samples
##    12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15384, 15384, 15384, 15383, 15385 
## Resampling results across tuning parameters:
## 
##   lambda        RMSE      Rsquared   MAE     
##    0.001000000  26.62116  0.8131891  20.33717
##    0.001206793  26.62116  0.8131891  20.33717
##    0.001456348  26.62116  0.8131891  20.33717
##    0.001757511  26.62116  0.8131891  20.33717
##    0.002120951  26.62116  0.8131891  20.33717
##    0.002559548  26.62116  0.8131891  20.33717
##    0.003088844  26.62116  0.8131891  20.33717
##    0.003727594  26.62116  0.8131891  20.33717
##    0.004498433  26.62116  0.8131891  20.33717
##    0.005428675  26.62116  0.8131891  20.33717
##    0.006551286  26.62116  0.8131891  20.33717
##    0.007906043  26.62116  0.8131891  20.33717
##    0.009540955  26.62116  0.8131891  20.33717
##    0.011513954  26.62116  0.8131891  20.33717
##    0.013894955  26.62116  0.8131891  20.33717
##    0.016768329  26.62116  0.8131891  20.33717
##    0.020235896  26.62116  0.8131891  20.33717
##    0.024420531  26.62116  0.8131891  20.33717
##    0.029470517  26.62116  0.8131891  20.33717
##    0.035564803  26.62116  0.8131891  20.33717
##    0.042919343  26.62116  0.8131891  20.33717
##    0.051794747  26.62116  0.8131891  20.33717
##    0.062505519  26.62116  0.8131891  20.33717
##    0.075431201  26.62116  0.8131891  20.33717
##    0.091029818  26.62116  0.8131891  20.33717
##    0.109854114  26.62116  0.8131891  20.33717
##    0.132571137  26.62116  0.8131891  20.33717
##    0.159985872  26.62116  0.8131891  20.33717
##    0.193069773  26.62116  0.8131891  20.33717
##    0.232995181  26.62116  0.8131891  20.33717
##    0.281176870  26.62116  0.8131891  20.33717
##    0.339322177  26.62116  0.8131891  20.33717
##    0.409491506  26.62116  0.8131891  20.33717
##    0.494171336  26.62116  0.8131891  20.33717
##    0.596362332  26.62116  0.8131891  20.33717
##    0.719685673  26.62116  0.8131891  20.33717
##    0.868511374  26.62116  0.8131891  20.33717
##    1.048113134  26.62116  0.8131891  20.33717
##    1.264855217  26.62116  0.8131891  20.33717
##    1.526417967  26.62116  0.8131891  20.33717
##    1.842069969  26.62116  0.8131891  20.33717
##    2.222996483  26.62116  0.8131891  20.33717
##    2.682695795  26.62116  0.8131891  20.33717
##    3.237457543  26.62116  0.8131891  20.33717
##    3.906939937  26.62116  0.8131891  20.33717
##    4.714866363  26.65698  0.8129985  20.36312
##    5.689866029  26.75322  0.8125276  20.43353
##    6.866488450  26.87743  0.8119974  20.52643
##    8.286427729  27.03895  0.8113904  20.65026
##   10.000000000  27.24875  0.8106966  20.81558
## 
## Tuning parameter 'alpha' was held constant at a value of 0
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 0 and lambda = 3.90694.
```

---

```r
mod_lasso
```

```
## glmnet 
## 
## 19230 samples
##    12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15385, 15383, 15384, 15384, 15384 
## Resampling results across tuning parameters:
## 
##   lambda        RMSE      Rsquared   MAE     
##    0.001000000  26.17306  0.8175264  20.02156
##    0.001206793  26.17306  0.8175264  20.02156
##    0.001456348  26.17306  0.8175264  20.02156
##    0.001757511  26.17306  0.8175264  20.02156
##    0.002120951  26.17306  0.8175264  20.02156
##    0.002559548  26.17306  0.8175264  20.02156
##    0.003088844  26.17306  0.8175264  20.02156
##    0.003727594  26.17306  0.8175264  20.02156
##    0.004498433  26.17306  0.8175264  20.02156
##    0.005428675  26.17306  0.8175264  20.02156
##    0.006551286  26.17306  0.8175264  20.02156
##    0.007906043  26.17306  0.8175264  20.02156
##    0.009540955  26.17306  0.8175264  20.02156
##    0.011513954  26.17306  0.8175264  20.02156
##    0.013894955  26.17306  0.8175264  20.02156
##    0.016768329  26.17306  0.8175264  20.02156
##    0.020235896  26.17306  0.8175264  20.02156
##    0.024420531  26.17411  0.8175123  20.02235
##    0.029470517  26.17936  0.8174401  20.02631
##    0.035564803  26.18784  0.8173229  20.03279
##    0.042919343  26.19558  0.8172169  20.03818
##    0.051794747  26.20434  0.8170972  20.04438
##    0.062505519  26.22255  0.8168466  20.05858
##    0.075431201  26.24173  0.8165830  20.07350
##    0.091029818  26.26375  0.8162812  20.09027
##    0.109854114  26.29664  0.8158283  20.11522
##    0.132571137  26.34132  0.8152117  20.14845
##    0.159985872  26.39904  0.8144131  20.19180
##    0.193069773  26.44300  0.8138098  20.22349
##    0.232995181  26.45547  0.8136621  20.22995
##    0.281176870  26.47118  0.8134819  20.23896
##    0.339322177  26.49316  0.8132309  20.25172
##    0.409491506  26.52358  0.8128851  20.26987
##    0.494171336  26.56392  0.8124324  20.29394
##    0.596362332  26.61843  0.8118256  20.32849
##    0.719685673  26.69321  0.8109932  20.37825
##    0.868511374  26.78233  0.8100477  20.43633
##    1.048113134  26.89414  0.8089071  20.50883
##    1.264855217  27.04615  0.8073601  20.60935
##    1.526417967  27.25127  0.8052708  20.75002
##    1.842069969  27.51652  0.8026082  20.93213
##    2.222996483  27.88188  0.7988217  21.18328
##    2.682695795  28.36021  0.7938001  21.51756
##    3.237457543  29.00320  0.7867506  21.98933
##    3.906939937  29.83979  0.7771374  22.63028
##    4.714866363  30.87884  0.7647061  23.46954
##    5.689866029  32.29939  0.7455284  24.66346
##    6.866488450  33.96131  0.7213668  26.09524
##    8.286427729  35.45015  0.7018655  27.40190
##   10.000000000  37.33238  0.6739192  29.02023
## 
## Tuning parameter 'alpha' was held constant at a value of 1
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were alpha = 1 and lambda = 0.0202359.
```

---

```r
mod_rf
```

```
## Random Forest 
## 
## 19230 samples
##    12 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15383, 15384, 15385, 15385, 15383 
## Resampling results across tuning parameters:
## 
##   min.node.size  mtry  RMSE      Rsquared   MAE     
##   10              2    44.40672  0.7004199  35.68288
##   10              4    34.03673  0.7885218  26.78077
##   10              6    28.22751  0.8205982  21.77966
##   10              8    25.77867  0.8363904  19.74503
##   10             10    24.59374  0.8435989  18.71429
##   20              2    43.79755  0.7232273  35.13259
##   20              4    33.79883  0.7851855  26.46222
##   20              6    28.84054  0.8154826  22.23134
##   20              8    25.94377  0.8355417  19.81437
##   20             10    24.65925  0.8431627  18.74507
##   30              2    44.56685  0.7097288  35.74805
##   30              4    33.24043  0.7890227  25.97921
##   30              6    28.49393  0.8190458  21.96644
##   30              8    25.70164  0.8368138  19.67986
##   30             10    24.67475  0.8443715  18.76560
## 
## Tuning parameter 'splitrule' was held constant at a value of variance
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were mtry = 10, splitrule =
##  variance and min.node.size = 10.
```

---

```r
plot(mod_ridge)
```

---

```r
plot(mod_lasso)
```

---

```r
plot(mod_rf)
```

---

### Comparing Models

```r
resamps <- resamples(list(Ridge = mod_ridge,
                          Lasso = mod_lasso,
                          RF = mod_rf))
summary(resamps)
```

```
## 
## Call:
## summary.resamples(object = resamps)
## 
## Models: Ridge, Lasso, RF 
## Number of resamples: 5 
## 
## MAE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Ridge 20.08838 20.22315 20.40625 20.33717 20.42442 20.54366    0
## Lasso 19.78475 19.93539 19.95993 20.02156 19.97522 20.45251    0
## RF    18.43908 18.55445 18.60816 18.71429 18.75826 19.21148    0
## 
## RMSE 
##           Min.  1st Qu.   Median     Mean  3rd Qu.     Max. NA's
## Ridge 26.37970 26.38121 26.69809 26.62116 26.70952 26.93727    0
## Lasso 25.85918 26.00483 26.10097 26.17306 26.14174 26.75859    0
## RF    24.25151 24.28312 24.64581 24.59374 24.67083 25.11742    0
## 
## Rsquared 
##            Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Ridge 0.8077323 0.8100824 0.8137347 0.8131891 0.8159804 0.8184157    0
## Lasso 0.8068279 0.8187298 0.8201238 0.8175264 0.8206666 0.8212841    0
## RF    0.8389078 0.8419422 0.8444349 0.8435989 0.8445518 0.8481578    0
```

---

```r
dotplot(resamps, metric = "RMSE")
```

---

### Predicting

```r
pred_ridge <- predict(mod_ridge, newdata = ipf_te)
pred_lasso <- predict(mod_lasso, newdata = ipf_te)
pred_rf <- predict(mod_rf, newdata = ipf_te)

rmse_ridge <- RMSE(pred_ridge, ipf_te$best3bench_kg)
rmse_lasso <- RMSE(pred_lasso, ipf_te$best3bench_kg)
rmse_rf <- RMSE(pred_rf, ipf_te$best3bench_kg)

glue("Test RMSE Ridge: {format(rmse_ridge, digits = 3)}
     Test RMSE Lassoe: {format(rmse_lasso, digits = 3)}
     Test RMSE RF: {format(rmse_rf, digits = 3)}")
```

```
## Test RMSE Ridge: 26.8
## Test RMSE Lassoe: 26.4
## Test RMSE RF: 24.6
```

Ask yourself what is this model good for, if at all 🤭
]

---

```r
bind_rows(
  tibble(method = "Ridge", pred = pred_ridge, true = ipf_te$best3bench_kg),
  tibble(method = "Lasso", pred = pred_lasso, true = ipf_te$best3bench_kg),
  tibble(method = "RF", pred = pred_rf, true = ipf_te$best3bench_kg)) %>%
  ggplot(aes(pred, true)) +
  geom_point(color = "red", alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0) +
  facet_wrap(~ method) +
  theme_bw()
```

---

# The Future Solution: `tidymodels`

#### Inspired by [Julia Silge](https://juliasilge.com/blog/intro-tidymodels/)

---

### Packages under tidymodels

- `parsnip`: tidy `caret`
- `dials` and `tune`: specifying and tuning model parameters
- `rsample`: sampling, data partitioning
- `recipes` and `embed`: preprocessing and creating model matrices
- `infer`: tidy statistics
- `yardstick`: measuring models performance
- `broom`: convert models output into tidy tibbles

And [more](https://www.tidyverse.org/blog/2018/08/tidymodels-0-0-1/).

---

### Split Data

The `initial_split()` function is from the `rsample` package:

```r
library(tidymodels)

ipf_split_obj <- ipf_lifts %>%
  initial_split(prop = 0.6, strata = equipment)

ipf_tr <- training(ipf_split_obj)
ipf_te <- testing(ipf_split_obj)

glue("train no. of rows: {nrow(ipf_tr)}
     test no. of rows: {nrow(ipf_te)}")
```

```
## train no. of rows: 19229
## test no. of rows: 12818
```

```r
print(ipf_split_obj)
```

```
## <19229/12818/32047>
```

---

### Preprocess .font80percent[(but we're not gonna use it)]

The `recipe()` function is from the `recipes` package. It allows you to specify a python-like pipe you can later apply to any dataset, including all preprocessing steps:

```r
ipf_rec <- recipe(best3bench_kg ~ ., data = ipf_tr)
ipf_rec
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         12
```

`recipes` contains more preprocessing [`step_`s](https://tidymodels.github.io/recipes/reference/index.html) than you imagine:

```r
ipf_rec <-  ipf_rec %>%
  step_normalize(all_numeric())
```

---

After you have your `recipe` you need to `prep()` materials...

```r
ipf_rec <- ipf_rec %>% prep(ipf_tr)

ipf_rec
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor         12
## 
## Training data contained 19229 data points and no missing data.
## 
## Operations:
## 
## Centering and scaling for age, bodyweight_kg, year, month, ... [trained]
```

At this point our `recipe` has all necessary `sd` and `mean`s for numeric variables.

---

```r
ipf_rec$var_info
```

```
## # A tibble: 13 x 4
##    variable      type    role      source  
##    <chr>         <chr>   <chr>     <chr>   
##  1 sex           nominal predictor original
##  2 event         nominal predictor original
##  3 equipment     nominal predictor original
##  4 age           numeric predictor original
##  5 division      nominal predictor original
##  6 bodyweight_kg numeric predictor original
##  7 meet_name     nominal predictor original
##  8 year          numeric predictor original
##  9 month         numeric predictor original
## 10 dayofweek     numeric predictor original
## 11 age2          numeric predictor original
## 12 year2         numeric predictor original
## 13 best3bench_kg numeric outcome   original
```

---

```r
ipf_rec$levels$meet_name
```

```
## $values
##  [1] "3rd World University Powerlifting Cup"                 
##  [2] "48th World Open Championships"                         
##  [3] "6th World Classic Championships"                       
##  [4] "Disabled Bench Press World Championships"              
##  [5] "Men's World Powerlifting Championships"                
##  [6] "Reykjavík International Games"                         
##  [7] "Student's World Cup"                                   
##  [8] "University Powerlifting Cup"                           
##  [9] "Women's World Powerlifting Championship"               
## [10] "Women's World Powerlifting Championships"              
## [11] "World Bench Press Championships"                       
## [12] "World Classic Bench Press Championships"               
## [13] "World Classic Powerlifting Championships"              
## [14] "World Classic Powerlifting Cup"                        
## [15] "World Disabled Bench Press Championships"              
## [16] "World Games"                                           
## [17] "World Juniors & Sub-Juniors Championships"             
## [18] "World Juniors Powerlifting Championships"              
## [19] "World Masters Bench Press Championships"               
## [20] "World Masters Championships"                           
## [21] "World Masters Powerlifting Championships"              
## [22] "World Men's Powerlifting Championship"                 
## [23] "World Open Bench Press Championships"                  
## [24] "World Open Championships"                              
## [25] "World Powerlifting Championships"                      
## [26] "World Students Cup"                                    
## [27] "World Sub-Juniors & Juniors Bench Press Championships" 
## [28] "World Sub-Juniors & Juniors Powerlifting Championships"
## [29] "World Sub-Juniors Powerlifting Championships"          
## [30] "World University Powerlifting Cup"                     
## 
## $ordered
## [1] FALSE
## 
## $factor
## [1] TRUE
```

---

```r
ipf_rec$steps[[1]]$means
```

```
##           age bodyweight_kg          year         month     dayofweek 
##  3.661433e+01  8.146179e+01  2.006976e+03  7.753029e+00  3.576005e+00 
##          age2         year2 best3bench_kg 
##  1.544206e+03  4.028041e+06  1.481720e+02
```

```r
ipf_rec$steps[[1]]$sds
```

```
##           age bodyweight_kg          year         month     dayofweek 
##     14.269106     24.865236      9.326106      2.716649      1.740679 
##          age2         year2 best3bench_kg 
##   1193.043744  37383.445309     60.975633
```

---

And then we `bake()` (or [`juice()`](https://tidymodels.github.io/recipes/reference/juice.html)):

```r
ipf_tr2 <- ipf_rec %>% bake(ipf_tr)
ipf_te2 <- ipf_rec %>% bake(ipf_te)

glue("mean of age in orig training: {format(mean(ipf_tr$age), digits = 3)}, sd: {format(sd(ipf_tr$age), digits = 3)}
     mean of age in baked training: {format(mean(ipf_tr2$age), digits = 0)}, sd: {format(sd(ipf_tr2$age), digits = 3)}")
```

```
## mean of age in orig training: 36.6, sd: 14.3
## mean of age in baked training: 0, sd: 1
```

```r
glue("mean of age in orig testing: {format(mean(ipf_te$age), digits = 3)}, sd: {format(sd(ipf_te$age), digits = 3)}
     mean of age in baked testing: {format(mean(ipf_te2$age), digits = 0)}, sd: {format(sd(ipf_te2$age), digits = 3)}")
```

```
## mean of age in orig testing: 36.6, sd: 14.3
## mean of age in baked testing: 0, sd: 1
```

---

Or you can do it all in a single pipe:

```r
ipf_rec <- recipe(best3bench_kg ~ ., data = ipf_tr) %>%
  step_normalize(all_numeric()) %>%
  prep(ipf_tr)

ipf_tr2 <- ipf_rec %>% bake(ipf_tr)
ipf_te2 <- ipf_rec %>% bake(ipf_te)

```
## mean of age in orig training: 36.6, sd: 14.3
## mean of age in baked training: 0, sd: 1
```

```
## mean of age in orig testing: 36.6, sd: 14.3
## mean of age in baked testing: 0, sd: 1
```

---

### Modeling

For now let us use the original `ipf_tr` data.

Functions `linear_reg()` and `set_engine()` are from the `parsnip` package:

```r
mod_ridge_spec <- linear_reg(mixture = 0, penalty = 0.001) %>%
  set_engine(engine = "glmnet")

mod_ridge_spec
```

```
## Linear Regression Model Specification (regression)
## 
## Main Arguments:
##   penalty = 0.001
##   mixture = 0
## 
## Computational engine: glmnet
```

---

```r
mod_ridge <- mod_ridge_spec %>%
  fit(best3bench_kg ~ ., data = ipf_tr)

mod_ridge
```

```
## parsnip model object
## 
## Fit in:  51ms
## Call:  glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian",      alpha = ~0) 
## 
##     Df    %Dev Lambda
## 1   51 0.00000  42890
## 2   51 0.00398  39080
## 3   51 0.00436  35610
## 4   51 0.00478  32450
## 5   51 0.00525  29560
## 6   51 0.00576  26940
## 7   51 0.00632  24540
## 8   51 0.00693  22360
## 9   51 0.00760  20380
## 10  51 0.00833  18570
## 11  51 0.00914  16920
## 12  51 0.01002  15410
## 13  51 0.01098  14050
## 14  51 0.01204  12800
## 15  51 0.01320  11660
## 16  51 0.01447  10620
## 17  51 0.01586   9681
## 18  51 0.01737   8821
## 19  51 0.01904   8037
## 20  51 0.02085   7323
## 21  51 0.02284   6673
## 22  51 0.02501   6080
## 23  51 0.02738   5540
## 24  51 0.02997   5048
## 25  51 0.03279   4599
## 26  51 0.03587   4191
## 27  51 0.03923   3818
## 28  51 0.04289   3479
## 29  51 0.04687   3170
## 30  51 0.05120   2888
## 31  51 0.05591   2632
## 32  51 0.06102   2398
## 33  51 0.06657   2185
## 34  51 0.07258   1991
## 35  51 0.07908   1814
## 36  51 0.08612   1653
## 37  51 0.09371   1506
## 38  51 0.10190   1372
## 39  51 0.11070   1250
## 40  51 0.12020   1139
## 41  51 0.13030   1038
## 42  51 0.14120    946
## 43  51 0.15280    862
## 44  51 0.16520    785
## 45  51 0.17830    716
## 46  51 0.19220    652
## 47  51 0.20700    594
## 48  51 0.22250    541
## 49  51 0.23880    493
## 50  51 0.25590    449
## 51  51 0.27370    409
## 52  51 0.29230    373
## 53  51 0.31140    340
## 54  51 0.33120    310
## 55  51 0.35150    282
## 56  51 0.37230    257
## 57  51 0.39340    234
## 58  51 0.41470    214
## 59  51 0.43620    194
## 60  51 0.45780    177
## 61  51 0.47920    162
## 62  51 0.50050    147
## 63  51 0.52150    134
## 64  51 0.54200    122
## 65  51 0.56200    111
## 66  51 0.58140    101
## 67  51 0.60010     92
## 68  51 0.61800     84
## 69  51 0.63500     77
## 70  51 0.65110     70
## 71  51 0.66630     64
## 72  51 0.68050     58
## 73  51 0.69370     53
## 74  51 0.70600     48
## 75  51 0.71730     44
## 76  51 0.72770     40
## 77  51 0.73720     36
## 78  51 0.74580     33
## 79  51 0.75360     30
## 80  51 0.76060     28
## 81  51 0.76700     25
## 82  51 0.77260     23
## 83  51 0.77760     21
## 84  51 0.78210     19
## 85  51 0.78610     17
## 86  51 0.78960     16
## 87  51 0.79270     14
## 88  51 0.79540     13
## 89  51 0.79780     12
## 90  51 0.79990     11
## 91  51 0.80180     10
## 92  51 0.80340      9
## 93  51 0.80480      8
## 94  51 0.80610      7
## 95  51 0.80720      7
## 96  51 0.80810      6
## 97  51 0.80890      6
## 98  51 0.80970      5
## 99  51 0.81030      5
## 100 51 0.81090      4
```

---

In a single pipe:

```r
mod_lasso <- linear_reg(mixture = 1, penalty = 0.001) %>%
  set_engine(engine = "glmnet") %>%
  fit(best3bench_kg ~ ., data = ipf_tr)

mod_lasso
```

```
## parsnip model object
## 
## Fit in:  50ms
## Call:  glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian",      alpha = ~1) 
## 
##    Df    %Dev Lambda
## 1   0 0.00000 42.890
## 2   1 0.08402 39.080
## 3   2 0.16380 35.610
## 4   2 0.24350 32.450
## 5   2 0.30960 29.560
## 6   2 0.36450 26.940
## 7   2 0.41000 24.540
## 8   2 0.44790 22.360
## 9   2 0.47930 20.380
## 10  2 0.50540 18.570
## 11  2 0.52700 16.920
## 12  2 0.54500 15.410
## 13  2 0.55990 14.050
## 14  2 0.57230 12.800
## 15  4 0.59030 11.660
## 16  4 0.61170 10.620
## 17  6 0.63330  9.681
## 18  6 0.65230  8.821
## 19  6 0.66810  8.037
## 20  7 0.68240  7.323
## 21  7 0.69490  6.673
## 22  8 0.70840  6.080
## 23  8 0.72340  5.540
## 24  8 0.73580  5.048
## 25  9 0.74620  4.599
## 26  9 0.75480  4.191
## 27 12 0.76240  3.818
## 28 12 0.76930  3.479
## 29 13 0.77540  3.170
## 30 13 0.78060  2.888
## 31 13 0.78500  2.632
## 32 14 0.78870  2.398
## 33 15 0.79200  2.185
## 34 15 0.79490  1.991
## 35 16 0.79740  1.814
## 36 18 0.79950  1.653
## 37 19 0.80150  1.506
## 38 19 0.80320  1.372
## 39 19 0.80460  1.250
## 40 19 0.80580  1.139
## 41 19 0.80680  1.038
## 42 21 0.80760  0.946
## 43 24 0.80840  0.862
## 44 24 0.80920  0.785
## 45 25 0.80990  0.716
## 46 26 0.81040  0.652
## 47 26 0.81090  0.594
## 48 27 0.81140  0.541
## 49 27 0.81170  0.493
## 50 28 0.81200  0.449
## 51 29 0.81230  0.409
## 52 30 0.81250  0.373
## 53 30 0.81270  0.340
## 54 31 0.81280  0.310
## 55 34 0.81300  0.282
## 56 35 0.81310  0.257
## 57 36 0.81320  0.234
## 58 38 0.81330  0.214
## 59 40 0.81340  0.194
## 60 40 0.81360  0.177
## 61 40 0.81410  0.162
## 62 40 0.81450  0.147
## 63 42 0.81490  0.134
## 64 44 0.81520  0.122
## 65 41 0.81550  0.111
## 66 41 0.81570  0.101
## 67 42 0.81590  0.092
## 68 42 0.81620  0.084
## 69 44 0.81630  0.077
## 70 45 0.81650  0.070
## 71 47 0.81660  0.064
## 72 48 0.81670  0.058
## 73 49 0.81690  0.053
## 74 46 0.81700  0.048
## 75 47 0.81700  0.044
## 76 47 0.81710  0.040
## 77 49 0.81710  0.036
## 78 48 0.81720  0.033
## 79 48 0.81720  0.030
## 80 49 0.81730  0.028
## 81 50 0.81730  0.025
## 82 50 0.81740  0.023
## 83 50 0.81740  0.021
## 84 50 0.81740  0.019
```

---

Can also use `fit_xy()` a-la `sklearn`:

```r
mod_rf <- rand_forest(mode = "regression", mtry = 4, trees = 50, min_n = 30) %>%
  set_engine("ranger") %>%
  fit_xy(x = ipf_tr[, -7],
         y = ipf_tr$best3bench_kg)

mod_rf
```

```
## parsnip model object
## 
## Fit in:  720msRanger result
## 
## Call:
##  ranger::ranger(formula = formula, data = data, mtry = ~4, num.trees = ~50,      min.node.size = ~30, num.threads = 1, verbose = FALSE, seed = sample.int(10^5,          1)) 
## 
## Type:                             Regression 
## Number of trees:                  50 
## Sample size:                      19229 
## Number of independent variables:  12 
## Mtry:                             4 
## Target node size:                 30 
## Variable importance mode:         none 
## Splitrule:                        variance 
## OOB prediction error (MSE):       560.9305 
## R squared (OOB):                  0.8491322
```

---

Notice how easy it is to get the model's results in a tidy way using the `tidy()` function:

```r
tidy(mod_ridge)
```

```
## # A tibble: 5,200 x 5
##    term                 step  estimate lambda dev.ratio
##    <chr>               <dbl>     <dbl>  <dbl>     <dbl>
##  1 (Intercept)             1  1.48e+ 2 42893.  2.58e-36
##  2 sexM                    1  8.47e-35 42893.  2.58e-36
##  3 eventSB                 1 -8.25e-36 42893.  2.58e-36
##  4 eventSBD                1 -2.53e-35 42893.  2.58e-36
##  5 equipmentSingle-ply     1  2.93e-35 42893.  2.58e-36
##  6 age                     1 -5.11e-37 42893.  2.58e-36
##  7 divisionJuniors         1 -5.96e-36 42893.  2.58e-36
##  8 divisionLight           1 -1.34e-35 42893.  2.58e-36
##  9 divisionMasters 1       1 -9.26e-37 42893.  2.58e-36
## 10 divisionMasters 2       1 -1.42e-35 42893.  2.58e-36
## # ... with 5,190 more rows
```

---

### Predicting

```r
results_test <- mod_ridge %>%
  predict(new_data = ipf_te, penalty = 0.001) %>%
  mutate(
    truth = ipf_te$best3bench_kg,
    method = "Ridge"
  ) %>%
  bind_rows(mod_lasso %>%
    predict(new_data = ipf_te) %>%
    mutate(
      truth = ipf_te$best3bench_kg,
      method = "Lasso"
  )) %>%
  bind_rows(mod_rf %>%
    predict(new_data = ipf_te) %>%
    mutate(
      truth = ipf_te$best3bench_kg,
      method = "RF"
  ))

dim(results_test)
```

```
## [1] 38454     3
```

---

### Comparing Models

The package `yardstick` has tons of performance [metrics](https://tidymodels.github.io/yardstick/articles/metric-types.html):

```r
results_test %>%
  group_by(method) %>%
  yardstick::rmse(truth = truth, estimate = .pred)
```

```
## # A tibble: 3 x 4
##   method .metric .estimator .estimate
##   <chr>  <chr>   <chr>          <dbl>
## 1 Lasso  rmse    standard        26.4
## 2 RF     rmse    standard        23.7
## 3 Ridge  rmse    standard        26.9
```

---

```r
results_test %>%
  ggplot(aes(.pred, truth)) +
  geom_point(color = "red", alpha = 0.5) +
  geom_abline(slope = 1, intercept = 0) +
  facet_wrap(~ method) +
  theme_bw()
```

---

### Tuning

Define your model spec, using `tune()` from the `tune` package for a parameter you wish to tune:

```r
library(tune)

mod_rf_spec <- rand_forest(mode = "regression",
                           mtry = tune(),
                           min_n = tune(),
                           trees = 100) %>%
  set_engine("ranger")
```

---

Define the `grid` on which you train your params, with the `dials` package:

```r
rf_grid <- grid_regular(mtry(range(2, 10)), min_n(range(10, 30)),
                        levels = c(5, 3))

rf_grid
```

```
## # A tibble: 15 x 2
##     mtry min_n
##    <int> <int>
##  1     2    10
##  2     4    10
##  3     6    10
##  4     8    10
##  5    10    10
##  6     2    20
##  7     4    20
##  8     6    20
##  9     8    20
## 10    10    20
## 11     2    30
## 12     4    30
## 13     6    30
## 14     8    30
## 15    10    30
```

---

Split your data into a few folds for Cross Validation with `vfold_cv()` from the `rsample` package:

```r
cv_splits <- vfold_cv(ipf_tr, v = 5)

cv_splits
```

```
## #  5-fold cross-validation 
## # A tibble: 5 x 2
##   splits               id   
##   <named list>         <chr>
## 1 <split [15.4K/3.8K]> Fold1
## 2 <split [15.4K/3.8K]> Fold2
## 3 <split [15.4K/3.8K]> Fold3
## 4 <split [15.4K/3.8K]> Fold4
## 5 <split [15.4K/3.8K]> Fold5
```

---

Now perform cross validation with `tune_grid()` from the `tune` package:

```r
tune_res <- tune_grid(recipe(best3bench_kg ~ ., data = ipf_tr),
                      model = mod_rf_spec,
                      resamples = cv_splits,
                      grid = rf_grid,
                      metrics = metric_set(rmse))
tune_res
```

```
## #  5-fold cross-validation 
## # A tibble: 5 x 4
##   splits               id    .metrics          .notes          
## * <list>               <chr> <list>            <list>          
## 1 <split [15.4K/3.8K]> Fold1 <tibble [15 x 5]> <tibble [0 x 1]>
## 2 <split [15.4K/3.8K]> Fold2 <tibble [15 x 5]> <tibble [0 x 1]>
## 3 <split [15.4K/3.8K]> Fold3 <tibble [15 x 5]> <tibble [0 x 1]>
## 4 <split [15.4K/3.8K]> Fold4 <tibble [15 x 5]> <tibble [0 x 1]>
## 5 <split [15.4K/3.8K]> Fold5 <tibble [15 x 5]> <tibble [0 x 1]>
```

---

```r
tune_res$.metrics[[1]]
```

```
## # A tibble: 15 x 5
##     mtry min_n .metric .estimator .estimate
##    <int> <int> <chr>   <chr>          <dbl>
##  1     2    10 rmse    standard        25.4
##  2     4    10 rmse    standard        24.4
##  3     6    10 rmse    standard        24.6
##  4     8    10 rmse    standard        24.8
##  5    10    10 rmse    standard        24.9
##  6     2    20 rmse    standard        25.4
##  7     4    20 rmse    standard        24.3
##  8     6    20 rmse    standard        24.4
##  9     8    20 rmse    standard        24.5
## 10    10    20 rmse    standard        24.6
## 11     2    30 rmse    standard        25.5
## 12     4    30 rmse    standard        24.2
## 13     6    30 rmse    standard        24.3
## 14     8    30 rmse    standard        24.4
## 15    10    30 rmse    standard        24.4
```

---

Collect the mean metric across folds:

```r
estimates <- collect_metrics(tune_res)

estimates
```

```
## # A tibble: 15 x 7
##     mtry min_n .metric .estimator  mean     n std_err
##    <int> <int> <chr>   <chr>      <dbl> <int>   <dbl>
##  1     2    10 rmse    standard    24.7     5   0.197
##  2     2    20 rmse    standard    24.7     5   0.190
##  3     2    30 rmse    standard    24.8     5   0.192
##  4     4    10 rmse    standard    23.7     5   0.186
##  5     4    20 rmse    standard    23.6     5   0.192
##  6     4    30 rmse    standard    23.5     5   0.181
##  7     6    10 rmse    standard    23.9     5   0.191
##  8     6    20 rmse    standard    23.7     5   0.186
##  9     6    30 rmse    standard    23.6     5   0.186
## 10     8    10 rmse    standard    24.0     5   0.203
## 11     8    20 rmse    standard    23.7     5   0.185
## 12     8    30 rmse    standard    23.6     5   0.185
## 13    10    10 rmse    standard    24.1     5   0.206
## 14    10    20 rmse    standard    23.8     5   0.181
## 15    10    30 rmse    standard    23.7     5   0.183
```

---

Choose best paramter:

```r
estimates %>%
  mutate(min_n = factor(min_n)) %>%
  ggplot(aes(x = mtry, y = mean, color = min_n)) + 
  geom_point() + 
  geom_line() + 
  labs(y = "Mean RMSE") +
  theme_classic()
```

---

There are of course also methods for helping us choose best params and final model.

```r
best_rmse <- tune_res %>% select_best(metric = "rmse", maximize = FALSE)
best_rmse
```

```
## # A tibble: 1 x 2
##    mtry min_n
##   <int> <int>
## 1     4    30
```