class: logo-slide --- class: title-slide ## Modeling in the Tidyverse ### Applications of Data Science - Class 5 ### Giora Simchoni #### `gsimchoni@gmail.com and add #dsapps in subject` ### Stat. and OR Department, TAU ### 2020-03-28 --- layout: true <div class="my-footer"> <span> <a href="https://dsapps-2020.github.io/Class_Slides/" target="_blank">Applications of Data Science </a> </span> </div> --- class: section-slide # The Problem --- ### Inconsistency, Inextensibility ```r n <- 10000 x1 <- runif(n) x2 <- runif(n) t <- 1 + 2 * x1 + 3 * x2 y <- rbinom(n, 1, 1 / (1 + exp(-t))) ``` ```r glm(y ~ x1 + x2, family = "binomial") ``` ```r glmnet(as.matrix(cbind(x1, x2)), as.factor(y), family = "binomial") ``` ```r randomForest(as.factor(y) ~ x1 + x2) ``` ```r gbm(y ~ x1 + x2, data = data.frame(x1 = x1, x2 = x2, y = y)) ``` 😱 --- ### Compare this with `sklearn` ```python from sklearn.linear_model import LogisticRegression from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier LogisticRegression(penalty='none').fit(X, y) LogisticRegression(penalty='l2', C=0.001).fit(X, y) RandomForestClassifier(n_estimators=100).fit(X, y) GradientBoostingClassifier(n_estimators=100).fit(X, y) ``` --- class: section-slide # Detour: A Regression Problem --- ### IPF-Lifts: Predicting Bench Lifting - Dataset was published as part of the [TidyTuesday](https://github.com/rfordatascience/tidytuesday) intiative - Comes from [Open Powerlifting](https://www.openpowerlifting.org/data) - [Wikipedia](https://en.wikipedia.org/wiki/Powerlifting): Powerlifting is a strength sport that consists of three attempts at maximal weight on three lifts: squat, bench press, and deadlift <img src="images/pl_bench.jpg" style="width: 70%" /> --- The raw data has over 40K rows: for each athlete, for each event, stats about athlete gender, age and weight, and the maximal weight lifted in the 3 types of Powerlifting. We will be predicting `best3bench_kg` based on a few predictors, no missing values: ```r library(lubridate) ipf_lifts <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-08/ipf_lifts.csv") ipf_lifts <- ipf_lifts %>% drop_na(best3bench_kg, age) %>% filter(between(age, 18, 100), best3bench_kg > 0, equipment != "Wraps") %>% select(sex, event, equipment, age, division, bodyweight_kg, best3bench_kg, date, meet_name) %>% drop_na() %>% mutate(year = year(date), month = month(date), dayofweek = wday(date)) %>% select(-date) %>% mutate_if(is.character, as.factor) dim(ipf_lifts) ``` ``` ## [1] 32047 11 ``` --- ```r glimpse(ipf_lifts) ``` ``` ## Observations: 32,047 ## Variables: 11 ## $ sex <fct> F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, F, ... ## $ event <fct> SBD, SBD, SBD, SBD, SBD, SBD, SBD, SBD, SBD, SBD... ## $ equipment <fct> Single-ply, Single-ply, Single-ply, Single-ply, ... ## $ age <dbl> 33.5, 34.5, 23.5, 27.5, 37.5, 25.5, 33.5, 26.0, ... ## $ division <fct> Open, Open, Open, Open, Open, Open, Open, Open, ... ## $ bodyweight_kg <dbl> 44, 44, 44, 44, 44, 44, 48, 48, 48, 48, 48, 48, ... ## $ best3bench_kg <dbl> 60.0, 62.5, 62.5, 60.0, 65.0, 45.0, 62.5, 77.5, ... ## $ meet_name <fct> World Powerlifting Championships, World Powerlif... ## $ year <dbl> 1989, 1989, 1989, 1989, 1989, 1989, 1989, 1989, ... ## $ month <dbl> 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, 11, ... ## $ dayofweek <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, ... ``` --- See the dependent variable distribution: ```r ggplot(ipf_lifts, aes(best3bench_kg)) + geom_histogram(fill = "red", alpha = 0.5) + theme_classic() ``` <img src="images/Bench-Hist-1.png" width="100%" /> --- See it vs. say age, facetted by equipment: ```r ggplot(ipf_lifts, aes(age, best3bench_kg)) + geom_point(color = "red", alpha = 0.5) + facet_wrap(~ equipment) + theme_classic() ``` <img src="images/Bench-Age-Equipment-1.png" width="100%" /> --- See it vs. year, by gender: ```r ggplot(ipf_lifts, aes(factor(year), best3bench_kg, fill = sex)) + geom_boxplot(outlier.alpha = 0.5) + labs(fill = "", x = "", y = "") + theme_classic() + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5)) ``` <img src="images/Bench-Year-Gender-1.png" width="100%" /> --- Maybe add `\(age^2\)` and `\(year^2\)` to make Linear Regression's life easier? ```r ipf_lifts <- ipf_lifts %>% mutate(age2 = age ^ 2, year2 = year ^2) ``` --- class: section-slide # End of Detour --- # WARNING .warning[ 💡 What you're about to see is not a good modeling/prediction flow. This is just an intro to tidy modeling. Some of the issues with how things are done here will be raised, some will have to wait till later in the course. ] --- class: section-slide # The Present Solution: `caret` --- ### Split Data ```r library(caret) train_idx <- createDataPartition(ipf_lifts$best3bench_kg, p = 0.6, list = FALSE) ipf_tr <- ipf_lifts[train_idx, ] ipf_te <- ipf_lifts[-train_idx, ] library(glue) glue("train no. of rows: {nrow(ipf_tr)} test no. of rows: {nrow(ipf_te)}") ``` ``` ## train no. of rows: 19230 ## test no. of rows: 12817 ``` Here you might consider some preprocessing. `caret` has some nice documentation [here](https://topepo.github.io/caret/index.html). --- ### Tuning and Modeling Define general methodology, e.g. 5-fold Cross-Validation: ```r fit_control <- trainControl(method = "cv", number = 5) ridge_grid <- expand.grid(alpha=0, lambda = 10^seq(-3, 1, length = 50)) lasso_grid <- expand.grid(alpha=1, lambda = 10^seq(-3, 1, length = 50)) rf_grid <- expand.grid(splitrule = "variance", min.node.size = seq(10, 30, 10), mtry = seq(2, 10, 2)) mod_ridge <- train(best3bench_kg ~ ., data = ipf_tr, method = "glmnet", trControl = fit_control, tuneGrid = ridge_grid, metric = "RMSE") mod_lasso <- train(best3bench_kg ~ ., data = ipf_tr, method = "glmnet", trControl = fit_control, tuneGrid = lasso_grid, metric = "RMSE") mod_rf <- train(best3bench_kg ~ ., data = ipf_tr, method = "ranger", trControl = fit_control, tuneGrid = rf_grid, num.trees = 50, metric = "RMSE") ``` --- ### Evaluating Models ```r mod_ridge ``` ``` ## glmnet ## ## 19230 samples ## 12 predictor ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 15384, 15384, 15384, 15383, 15385 ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared MAE ## 0.001000000 26.62116 0.8131891 20.33717 ## 0.001206793 26.62116 0.8131891 20.33717 ## 0.001456348 26.62116 0.8131891 20.33717 ## 0.001757511 26.62116 0.8131891 20.33717 ## 0.002120951 26.62116 0.8131891 20.33717 ## 0.002559548 26.62116 0.8131891 20.33717 ## 0.003088844 26.62116 0.8131891 20.33717 ## 0.003727594 26.62116 0.8131891 20.33717 ## 0.004498433 26.62116 0.8131891 20.33717 ## 0.005428675 26.62116 0.8131891 20.33717 ## 0.006551286 26.62116 0.8131891 20.33717 ## 0.007906043 26.62116 0.8131891 20.33717 ## 0.009540955 26.62116 0.8131891 20.33717 ## 0.011513954 26.62116 0.8131891 20.33717 ## 0.013894955 26.62116 0.8131891 20.33717 ## 0.016768329 26.62116 0.8131891 20.33717 ## 0.020235896 26.62116 0.8131891 20.33717 ## 0.024420531 26.62116 0.8131891 20.33717 ## 0.029470517 26.62116 0.8131891 20.33717 ## 0.035564803 26.62116 0.8131891 20.33717 ## 0.042919343 26.62116 0.8131891 20.33717 ## 0.051794747 26.62116 0.8131891 20.33717 ## 0.062505519 26.62116 0.8131891 20.33717 ## 0.075431201 26.62116 0.8131891 20.33717 ## 0.091029818 26.62116 0.8131891 20.33717 ## 0.109854114 26.62116 0.8131891 20.33717 ## 0.132571137 26.62116 0.8131891 20.33717 ## 0.159985872 26.62116 0.8131891 20.33717 ## 0.193069773 26.62116 0.8131891 20.33717 ## 0.232995181 26.62116 0.8131891 20.33717 ## 0.281176870 26.62116 0.8131891 20.33717 ## 0.339322177 26.62116 0.8131891 20.33717 ## 0.409491506 26.62116 0.8131891 20.33717 ## 0.494171336 26.62116 0.8131891 20.33717 ## 0.596362332 26.62116 0.8131891 20.33717 ## 0.719685673 26.62116 0.8131891 20.33717 ## 0.868511374 26.62116 0.8131891 20.33717 ## 1.048113134 26.62116 0.8131891 20.33717 ## 1.264855217 26.62116 0.8131891 20.33717 ## 1.526417967 26.62116 0.8131891 20.33717 ## 1.842069969 26.62116 0.8131891 20.33717 ## 2.222996483 26.62116 0.8131891 20.33717 ## 2.682695795 26.62116 0.8131891 20.33717 ## 3.237457543 26.62116 0.8131891 20.33717 ## 3.906939937 26.62116 0.8131891 20.33717 ## 4.714866363 26.65698 0.8129985 20.36312 ## 5.689866029 26.75322 0.8125276 20.43353 ## 6.866488450 26.87743 0.8119974 20.52643 ## 8.286427729 27.03895 0.8113904 20.65026 ## 10.000000000 27.24875 0.8106966 20.81558 ## ## Tuning parameter 'alpha' was held constant at a value of 0 ## RMSE was used to select the optimal model using the smallest value. ## The final values used for the model were alpha = 0 and lambda = 3.90694. ``` --- ```r mod_lasso ``` ``` ## glmnet ## ## 19230 samples ## 12 predictor ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 15385, 15383, 15384, 15384, 15384 ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared MAE ## 0.001000000 26.17306 0.8175264 20.02156 ## 0.001206793 26.17306 0.8175264 20.02156 ## 0.001456348 26.17306 0.8175264 20.02156 ## 0.001757511 26.17306 0.8175264 20.02156 ## 0.002120951 26.17306 0.8175264 20.02156 ## 0.002559548 26.17306 0.8175264 20.02156 ## 0.003088844 26.17306 0.8175264 20.02156 ## 0.003727594 26.17306 0.8175264 20.02156 ## 0.004498433 26.17306 0.8175264 20.02156 ## 0.005428675 26.17306 0.8175264 20.02156 ## 0.006551286 26.17306 0.8175264 20.02156 ## 0.007906043 26.17306 0.8175264 20.02156 ## 0.009540955 26.17306 0.8175264 20.02156 ## 0.011513954 26.17306 0.8175264 20.02156 ## 0.013894955 26.17306 0.8175264 20.02156 ## 0.016768329 26.17306 0.8175264 20.02156 ## 0.020235896 26.17306 0.8175264 20.02156 ## 0.024420531 26.17411 0.8175123 20.02235 ## 0.029470517 26.17936 0.8174401 20.02631 ## 0.035564803 26.18784 0.8173229 20.03279 ## 0.042919343 26.19558 0.8172169 20.03818 ## 0.051794747 26.20434 0.8170972 20.04438 ## 0.062505519 26.22255 0.8168466 20.05858 ## 0.075431201 26.24173 0.8165830 20.07350 ## 0.091029818 26.26375 0.8162812 20.09027 ## 0.109854114 26.29664 0.8158283 20.11522 ## 0.132571137 26.34132 0.8152117 20.14845 ## 0.159985872 26.39904 0.8144131 20.19180 ## 0.193069773 26.44300 0.8138098 20.22349 ## 0.232995181 26.45547 0.8136621 20.22995 ## 0.281176870 26.47118 0.8134819 20.23896 ## 0.339322177 26.49316 0.8132309 20.25172 ## 0.409491506 26.52358 0.8128851 20.26987 ## 0.494171336 26.56392 0.8124324 20.29394 ## 0.596362332 26.61843 0.8118256 20.32849 ## 0.719685673 26.69321 0.8109932 20.37825 ## 0.868511374 26.78233 0.8100477 20.43633 ## 1.048113134 26.89414 0.8089071 20.50883 ## 1.264855217 27.04615 0.8073601 20.60935 ## 1.526417967 27.25127 0.8052708 20.75002 ## 1.842069969 27.51652 0.8026082 20.93213 ## 2.222996483 27.88188 0.7988217 21.18328 ## 2.682695795 28.36021 0.7938001 21.51756 ## 3.237457543 29.00320 0.7867506 21.98933 ## 3.906939937 29.83979 0.7771374 22.63028 ## 4.714866363 30.87884 0.7647061 23.46954 ## 5.689866029 32.29939 0.7455284 24.66346 ## 6.866488450 33.96131 0.7213668 26.09524 ## 8.286427729 35.45015 0.7018655 27.40190 ## 10.000000000 37.33238 0.6739192 29.02023 ## ## Tuning parameter 'alpha' was held constant at a value of 1 ## RMSE was used to select the optimal model using the smallest value. ## The final values used for the model were alpha = 1 and lambda = 0.0202359. ``` --- ```r mod_rf ``` ``` ## Random Forest ## ## 19230 samples ## 12 predictor ## ## No pre-processing ## Resampling: Cross-Validated (5 fold) ## Summary of sample sizes: 15383, 15384, 15385, 15385, 15383 ## Resampling results across tuning parameters: ## ## min.node.size mtry RMSE Rsquared MAE ## 10 2 44.40672 0.7004199 35.68288 ## 10 4 34.03673 0.7885218 26.78077 ## 10 6 28.22751 0.8205982 21.77966 ## 10 8 25.77867 0.8363904 19.74503 ## 10 10 24.59374 0.8435989 18.71429 ## 20 2 43.79755 0.7232273 35.13259 ## 20 4 33.79883 0.7851855 26.46222 ## 20 6 28.84054 0.8154826 22.23134 ## 20 8 25.94377 0.8355417 19.81437 ## 20 10 24.65925 0.8431627 18.74507 ## 30 2 44.56685 0.7097288 35.74805 ## 30 4 33.24043 0.7890227 25.97921 ## 30 6 28.49393 0.8190458 21.96644 ## 30 8 25.70164 0.8368138 19.67986 ## 30 10 24.67475 0.8443715 18.76560 ## ## Tuning parameter 'splitrule' was held constant at a value of variance ## RMSE was used to select the optimal model using the smallest value. ## The final values used for the model were mtry = 10, splitrule = ## variance and min.node.size = 10. ``` --- ```r plot(mod_ridge) ``` <img src="images/Ridge-CV-1.png" width="80%" /> --- ```r plot(mod_lasso) ``` <img src="images/Lasso-CV-1.png" width="80%" /> --- ```r plot(mod_rf) ``` <img src="images/RF-CV-1.png" width="80%" /> --- ### Comparing Models ```r resamps <- resamples(list(Ridge = mod_ridge, Lasso = mod_lasso, RF = mod_rf)) summary(resamps) ``` ``` ## ## Call: ## summary.resamples(object = resamps) ## ## Models: Ridge, Lasso, RF ## Number of resamples: 5 ## ## MAE ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## Ridge 20.08838 20.22315 20.40625 20.33717 20.42442 20.54366 0 ## Lasso 19.78475 19.93539 19.95993 20.02156 19.97522 20.45251 0 ## RF 18.43908 18.55445 18.60816 18.71429 18.75826 19.21148 0 ## ## RMSE ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## Ridge 26.37970 26.38121 26.69809 26.62116 26.70952 26.93727 0 ## Lasso 25.85918 26.00483 26.10097 26.17306 26.14174 26.75859 0 ## RF 24.25151 24.28312 24.64581 24.59374 24.67083 25.11742 0 ## ## Rsquared ## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's ## Ridge 0.8077323 0.8100824 0.8137347 0.8131891 0.8159804 0.8184157 0 ## Lasso 0.8068279 0.8187298 0.8201238 0.8175264 0.8206666 0.8212841 0 ## RF 0.8389078 0.8419422 0.8444349 0.8435989 0.8445518 0.8481578 0 ``` --- ```r dotplot(resamps, metric = "RMSE") ``` <img src="images/Caret-RMSE-Comp-1.png" width="100%" /> --- ### Predicting ```r pred_ridge <- predict(mod_ridge, newdata = ipf_te) pred_lasso <- predict(mod_lasso, newdata = ipf_te) pred_rf <- predict(mod_rf, newdata = ipf_te) rmse_ridge <- RMSE(pred_ridge, ipf_te$best3bench_kg) rmse_lasso <- RMSE(pred_lasso, ipf_te$best3bench_kg) rmse_rf <- RMSE(pred_rf, ipf_te$best3bench_kg) glue("Test RMSE Ridge: {format(rmse_ridge, digits = 3)} Test RMSE Lassoe: {format(rmse_lasso, digits = 3)} Test RMSE RF: {format(rmse_rf, digits = 3)}") ``` ``` ## Test RMSE Ridge: 26.8 ## Test RMSE Lassoe: 26.4 ## Test RMSE RF: 24.6 ``` .warning[ ⚠️ Is using a "regular" regression model the natural approach for these data? Ask yourself what is this model good for, if at all 🤭 ] --- ```r bind_rows( tibble(method = "Ridge", pred = pred_ridge, true = ipf_te$best3bench_kg), tibble(method = "Lasso", pred = pred_lasso, true = ipf_te$best3bench_kg), tibble(method = "RF", pred = pred_rf, true = ipf_te$best3bench_kg)) %>% ggplot(aes(pred, true)) + geom_point(color = "red", alpha = 0.5) + geom_abline(slope = 1, intercept = 0) + facet_wrap(~ method) + theme_bw() ``` <img src="images/Caret-Pred-vs-True-1.png" width="100%" /> --- class: section-slide # The Future Solution: `tidymodels` #### Inspired by [Julia Silge](https://juliasilge.com/blog/intro-tidymodels/) --- ### Packages under tidymodels - `parsnip`: tidy `caret` - `dials` and `tune`: specifying and tuning model parameters - `rsample`: sampling, data partitioning - `recipes` and `embed`: preprocessing and creating model matrices - `infer`: tidy statistics - `yardstick`: measuring models performance - `broom`: convert models output into tidy tibbles And [more](https://www.tidyverse.org/blog/2018/08/tidymodels-0-0-1/). .warning[ ⚠️ All `tidymodels` packages are under development! ] --- ### Split Data The `initial_split()` function is from the `rsample` package: ```r library(tidymodels) ipf_split_obj <- ipf_lifts %>% initial_split(prop = 0.6, strata = equipment) ipf_tr <- training(ipf_split_obj) ipf_te <- testing(ipf_split_obj) glue("train no. of rows: {nrow(ipf_tr)} test no. of rows: {nrow(ipf_te)}") ``` ``` ## train no. of rows: 19229 ## test no. of rows: 12818 ``` ```r print(ipf_split_obj) ``` ``` ## <19229/12818/32047> ``` --- ### Preprocess .font80percent[(but we're not gonna use it)] The `recipe()` function is from the `recipes` package. It allows you to specify a python-like pipe you can later apply to any dataset, including all preprocessing steps: ```r ipf_rec <- recipe(best3bench_kg ~ ., data = ipf_tr) ipf_rec ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 12 ``` `recipes` contains more preprocessing [`step_`s](https://tidymodels.github.io/recipes/reference/index.html) than you imagine: ```r ipf_rec <- ipf_rec %>% step_normalize(all_numeric()) ``` --- After you have your `recipe` you need to `prep()` materials... ```r ipf_rec <- ipf_rec %>% prep(ipf_tr) ipf_rec ``` ``` ## Data Recipe ## ## Inputs: ## ## role #variables ## outcome 1 ## predictor 12 ## ## Training data contained 19229 data points and no missing data. ## ## Operations: ## ## Centering and scaling for age, bodyweight_kg, year, month, ... [trained] ``` At this point our `recipe` has all necessary `sd` and `mean`s for numeric variables. --- ```r ipf_rec$var_info ``` ``` ## # A tibble: 13 x 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 sex nominal predictor original ## 2 event nominal predictor original ## 3 equipment nominal predictor original ## 4 age numeric predictor original ## 5 division nominal predictor original ## 6 bodyweight_kg numeric predictor original ## 7 meet_name nominal predictor original ## 8 year numeric predictor original ## 9 month numeric predictor original ## 10 dayofweek numeric predictor original ## 11 age2 numeric predictor original ## 12 year2 numeric predictor original ## 13 best3bench_kg numeric outcome original ``` --- ```r ipf_rec$levels$meet_name ``` ``` ## $values ## [1] "3rd World University Powerlifting Cup" ## [2] "48th World Open Championships" ## [3] "6th World Classic Championships" ## [4] "Disabled Bench Press World Championships" ## [5] "Men's World Powerlifting Championships" ## [6] "Reykjavík International Games" ## [7] "Student's World Cup" ## [8] "University Powerlifting Cup" ## [9] "Women's World Powerlifting Championship" ## [10] "Women's World Powerlifting Championships" ## [11] "World Bench Press Championships" ## [12] "World Classic Bench Press Championships" ## [13] "World Classic Powerlifting Championships" ## [14] "World Classic Powerlifting Cup" ## [15] "World Disabled Bench Press Championships" ## [16] "World Games" ## [17] "World Juniors & Sub-Juniors Championships" ## [18] "World Juniors Powerlifting Championships" ## [19] "World Masters Bench Press Championships" ## [20] "World Masters Championships" ## [21] "World Masters Powerlifting Championships" ## [22] "World Men's Powerlifting Championship" ## [23] "World Open Bench Press Championships" ## [24] "World Open Championships" ## [25] "World Powerlifting Championships" ## [26] "World Students Cup" ## [27] "World Sub-Juniors & Juniors Bench Press Championships" ## [28] "World Sub-Juniors & Juniors Powerlifting Championships" ## [29] "World Sub-Juniors Powerlifting Championships" ## [30] "World University Powerlifting Cup" ## ## $ordered ## [1] FALSE ## ## $factor ## [1] TRUE ``` --- ```r ipf_rec$steps[[1]]$means ``` ``` ## age bodyweight_kg year month dayofweek ## 3.661433e+01 8.146179e+01 2.006976e+03 7.753029e+00 3.576005e+00 ## age2 year2 best3bench_kg ## 1.544206e+03 4.028041e+06 1.481720e+02 ``` ```r ipf_rec$steps[[1]]$sds ``` ``` ## age bodyweight_kg year month dayofweek ## 14.269106 24.865236 9.326106 2.716649 1.740679 ## age2 year2 best3bench_kg ## 1193.043744 37383.445309 60.975633 ``` --- And then we `bake()` (or [`juice()`](https://tidymodels.github.io/recipes/reference/juice.html)): ```r ipf_tr2 <- ipf_rec %>% bake(ipf_tr) ipf_te2 <- ipf_rec %>% bake(ipf_te) glue("mean of age in orig training: {format(mean(ipf_tr$age), digits = 3)}, sd: {format(sd(ipf_tr$age), digits = 3)} mean of age in baked training: {format(mean(ipf_tr2$age), digits = 0)}, sd: {format(sd(ipf_tr2$age), digits = 3)}") ``` ``` ## mean of age in orig training: 36.6, sd: 14.3 ## mean of age in baked training: 0, sd: 1 ``` ```r glue("mean of age in orig testing: {format(mean(ipf_te$age), digits = 3)}, sd: {format(sd(ipf_te$age), digits = 3)} mean of age in baked testing: {format(mean(ipf_te2$age), digits = 0)}, sd: {format(sd(ipf_te2$age), digits = 3)}") ``` ``` ## mean of age in orig testing: 36.6, sd: 14.3 ## mean of age in baked testing: 0, sd: 1 ``` --- Or you can do it all in a single pipe: ```r ipf_rec <- recipe(best3bench_kg ~ ., data = ipf_tr) %>% step_normalize(all_numeric()) %>% prep(ipf_tr) ipf_tr2 <- ipf_rec %>% bake(ipf_tr) ipf_te2 <- ipf_rec %>% bake(ipf_te) glue("mean of age in orig training: {format(mean(ipf_tr$age), digits = 3)}, sd: {format(sd(ipf_tr$age), digits = 3)} mean of age in baked training: {format(mean(ipf_tr2$age), digits = 0)}, sd: {format(sd(ipf_tr2$age), digits = 2)}") ``` ``` ## mean of age in orig training: 36.6, sd: 14.3 ## mean of age in baked training: 0, sd: 1 ``` ```r glue("mean of age in orig testing: {format(mean(ipf_te$age), digits = 3)}, sd: {format(sd(ipf_te$age), digits = 3)} mean of age in baked testing: {format(mean(ipf_te2$age), digits = 0)}, sd: {format(sd(ipf_te2$age), digits = 3)}") ``` ``` ## mean of age in orig testing: 36.6, sd: 14.3 ## mean of age in baked testing: 0, sd: 1 ``` --- ### Modeling For now let us use the original `ipf_tr` data. Functions `linear_reg()` and `set_engine()` are from the `parsnip` package: ```r mod_ridge_spec <- linear_reg(mixture = 0, penalty = 0.001) %>% set_engine(engine = "glmnet") mod_ridge_spec ``` ``` ## Linear Regression Model Specification (regression) ## ## Main Arguments: ## penalty = 0.001 ## mixture = 0 ## ## Computational engine: glmnet ``` --- ```r mod_ridge <- mod_ridge_spec %>% fit(best3bench_kg ~ ., data = ipf_tr) mod_ridge ``` ``` ## parsnip model object ## ## Fit in: 51ms ## Call: glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian", alpha = ~0) ## ## Df %Dev Lambda ## 1 51 0.00000 42890 ## 2 51 0.00398 39080 ## 3 51 0.00436 35610 ## 4 51 0.00478 32450 ## 5 51 0.00525 29560 ## 6 51 0.00576 26940 ## 7 51 0.00632 24540 ## 8 51 0.00693 22360 ## 9 51 0.00760 20380 ## 10 51 0.00833 18570 ## 11 51 0.00914 16920 ## 12 51 0.01002 15410 ## 13 51 0.01098 14050 ## 14 51 0.01204 12800 ## 15 51 0.01320 11660 ## 16 51 0.01447 10620 ## 17 51 0.01586 9681 ## 18 51 0.01737 8821 ## 19 51 0.01904 8037 ## 20 51 0.02085 7323 ## 21 51 0.02284 6673 ## 22 51 0.02501 6080 ## 23 51 0.02738 5540 ## 24 51 0.02997 5048 ## 25 51 0.03279 4599 ## 26 51 0.03587 4191 ## 27 51 0.03923 3818 ## 28 51 0.04289 3479 ## 29 51 0.04687 3170 ## 30 51 0.05120 2888 ## 31 51 0.05591 2632 ## 32 51 0.06102 2398 ## 33 51 0.06657 2185 ## 34 51 0.07258 1991 ## 35 51 0.07908 1814 ## 36 51 0.08612 1653 ## 37 51 0.09371 1506 ## 38 51 0.10190 1372 ## 39 51 0.11070 1250 ## 40 51 0.12020 1139 ## 41 51 0.13030 1038 ## 42 51 0.14120 946 ## 43 51 0.15280 862 ## 44 51 0.16520 785 ## 45 51 0.17830 716 ## 46 51 0.19220 652 ## 47 51 0.20700 594 ## 48 51 0.22250 541 ## 49 51 0.23880 493 ## 50 51 0.25590 449 ## 51 51 0.27370 409 ## 52 51 0.29230 373 ## 53 51 0.31140 340 ## 54 51 0.33120 310 ## 55 51 0.35150 282 ## 56 51 0.37230 257 ## 57 51 0.39340 234 ## 58 51 0.41470 214 ## 59 51 0.43620 194 ## 60 51 0.45780 177 ## 61 51 0.47920 162 ## 62 51 0.50050 147 ## 63 51 0.52150 134 ## 64 51 0.54200 122 ## 65 51 0.56200 111 ## 66 51 0.58140 101 ## 67 51 0.60010 92 ## 68 51 0.61800 84 ## 69 51 0.63500 77 ## 70 51 0.65110 70 ## 71 51 0.66630 64 ## 72 51 0.68050 58 ## 73 51 0.69370 53 ## 74 51 0.70600 48 ## 75 51 0.71730 44 ## 76 51 0.72770 40 ## 77 51 0.73720 36 ## 78 51 0.74580 33 ## 79 51 0.75360 30 ## 80 51 0.76060 28 ## 81 51 0.76700 25 ## 82 51 0.77260 23 ## 83 51 0.77760 21 ## 84 51 0.78210 19 ## 85 51 0.78610 17 ## 86 51 0.78960 16 ## 87 51 0.79270 14 ## 88 51 0.79540 13 ## 89 51 0.79780 12 ## 90 51 0.79990 11 ## 91 51 0.80180 10 ## 92 51 0.80340 9 ## 93 51 0.80480 8 ## 94 51 0.80610 7 ## 95 51 0.80720 7 ## 96 51 0.80810 6 ## 97 51 0.80890 6 ## 98 51 0.80970 5 ## 99 51 0.81030 5 ## 100 51 0.81090 4 ``` --- In a single pipe: ```r mod_lasso <- linear_reg(mixture = 1, penalty = 0.001) %>% set_engine(engine = "glmnet") %>% fit(best3bench_kg ~ ., data = ipf_tr) mod_lasso ``` ``` ## parsnip model object ## ## Fit in: 50ms ## Call: glmnet::glmnet(x = as.matrix(x), y = y, family = "gaussian", alpha = ~1) ## ## Df %Dev Lambda ## 1 0 0.00000 42.890 ## 2 1 0.08402 39.080 ## 3 2 0.16380 35.610 ## 4 2 0.24350 32.450 ## 5 2 0.30960 29.560 ## 6 2 0.36450 26.940 ## 7 2 0.41000 24.540 ## 8 2 0.44790 22.360 ## 9 2 0.47930 20.380 ## 10 2 0.50540 18.570 ## 11 2 0.52700 16.920 ## 12 2 0.54500 15.410 ## 13 2 0.55990 14.050 ## 14 2 0.57230 12.800 ## 15 4 0.59030 11.660 ## 16 4 0.61170 10.620 ## 17 6 0.63330 9.681 ## 18 6 0.65230 8.821 ## 19 6 0.66810 8.037 ## 20 7 0.68240 7.323 ## 21 7 0.69490 6.673 ## 22 8 0.70840 6.080 ## 23 8 0.72340 5.540 ## 24 8 0.73580 5.048 ## 25 9 0.74620 4.599 ## 26 9 0.75480 4.191 ## 27 12 0.76240 3.818 ## 28 12 0.76930 3.479 ## 29 13 0.77540 3.170 ## 30 13 0.78060 2.888 ## 31 13 0.78500 2.632 ## 32 14 0.78870 2.398 ## 33 15 0.79200 2.185 ## 34 15 0.79490 1.991 ## 35 16 0.79740 1.814 ## 36 18 0.79950 1.653 ## 37 19 0.80150 1.506 ## 38 19 0.80320 1.372 ## 39 19 0.80460 1.250 ## 40 19 0.80580 1.139 ## 41 19 0.80680 1.038 ## 42 21 0.80760 0.946 ## 43 24 0.80840 0.862 ## 44 24 0.80920 0.785 ## 45 25 0.80990 0.716 ## 46 26 0.81040 0.652 ## 47 26 0.81090 0.594 ## 48 27 0.81140 0.541 ## 49 27 0.81170 0.493 ## 50 28 0.81200 0.449 ## 51 29 0.81230 0.409 ## 52 30 0.81250 0.373 ## 53 30 0.81270 0.340 ## 54 31 0.81280 0.310 ## 55 34 0.81300 0.282 ## 56 35 0.81310 0.257 ## 57 36 0.81320 0.234 ## 58 38 0.81330 0.214 ## 59 40 0.81340 0.194 ## 60 40 0.81360 0.177 ## 61 40 0.81410 0.162 ## 62 40 0.81450 0.147 ## 63 42 0.81490 0.134 ## 64 44 0.81520 0.122 ## 65 41 0.81550 0.111 ## 66 41 0.81570 0.101 ## 67 42 0.81590 0.092 ## 68 42 0.81620 0.084 ## 69 44 0.81630 0.077 ## 70 45 0.81650 0.070 ## 71 47 0.81660 0.064 ## 72 48 0.81670 0.058 ## 73 49 0.81690 0.053 ## 74 46 0.81700 0.048 ## 75 47 0.81700 0.044 ## 76 47 0.81710 0.040 ## 77 49 0.81710 0.036 ## 78 48 0.81720 0.033 ## 79 48 0.81720 0.030 ## 80 49 0.81730 0.028 ## 81 50 0.81730 0.025 ## 82 50 0.81740 0.023 ## 83 50 0.81740 0.021 ## 84 50 0.81740 0.019 ``` --- Can also use `fit_xy()` a-la `sklearn`: ```r mod_rf <- rand_forest(mode = "regression", mtry = 4, trees = 50, min_n = 30) %>% set_engine("ranger") %>% fit_xy(x = ipf_tr[, -7], y = ipf_tr$best3bench_kg) mod_rf ``` ``` ## parsnip model object ## ## Fit in: 720msRanger result ## ## Call: ## ranger::ranger(formula = formula, data = data, mtry = ~4, num.trees = ~50, min.node.size = ~30, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1)) ## ## Type: Regression ## Number of trees: 50 ## Sample size: 19229 ## Number of independent variables: 12 ## Mtry: 4 ## Target node size: 30 ## Variable importance mode: none ## Splitrule: variance ## OOB prediction error (MSE): 560.9305 ## R squared (OOB): 0.8491322 ``` --- Notice how easy it is to get the model's results in a tidy way using the `tidy()` function: ```r tidy(mod_ridge) ``` ``` ## # A tibble: 5,200 x 5 ## term step estimate lambda dev.ratio ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1 1.48e+ 2 42893. 2.58e-36 ## 2 sexM 1 8.47e-35 42893. 2.58e-36 ## 3 eventSB 1 -8.25e-36 42893. 2.58e-36 ## 4 eventSBD 1 -2.53e-35 42893. 2.58e-36 ## 5 equipmentSingle-ply 1 2.93e-35 42893. 2.58e-36 ## 6 age 1 -5.11e-37 42893. 2.58e-36 ## 7 divisionJuniors 1 -5.96e-36 42893. 2.58e-36 ## 8 divisionLight 1 -1.34e-35 42893. 2.58e-36 ## 9 divisionMasters 1 1 -9.26e-37 42893. 2.58e-36 ## 10 divisionMasters 2 1 -1.42e-35 42893. 2.58e-36 ## # ... with 5,190 more rows ``` --- ### Predicting ```r results_test <- mod_ridge %>% predict(new_data = ipf_te, penalty = 0.001) %>% mutate( truth = ipf_te$best3bench_kg, method = "Ridge" ) %>% bind_rows(mod_lasso %>% predict(new_data = ipf_te) %>% mutate( truth = ipf_te$best3bench_kg, method = "Lasso" )) %>% bind_rows(mod_rf %>% predict(new_data = ipf_te) %>% mutate( truth = ipf_te$best3bench_kg, method = "RF" )) dim(results_test) ``` ``` ## [1] 38454 3 ``` --- ### Comparing Models The package `yardstick` has tons of performance [metrics](https://tidymodels.github.io/yardstick/articles/metric-types.html): ```r results_test %>% group_by(method) %>% yardstick::rmse(truth = truth, estimate = .pred) ``` ``` ## # A tibble: 3 x 4 ## method .metric .estimator .estimate ## <chr> <chr> <chr> <dbl> ## 1 Lasso rmse standard 26.4 ## 2 RF rmse standard 23.7 ## 3 Ridge rmse standard 26.9 ``` --- ```r results_test %>% ggplot(aes(.pred, truth)) + geom_point(color = "red", alpha = 0.5) + geom_abline(slope = 1, intercept = 0) + facet_wrap(~ method) + theme_bw() ``` <img src="images/Tidymodels-Pred-vs-True-1.png" width="100%" /> --- ### Tuning Define your model spec, using `tune()` from the `tune` package for a parameter you wish to tune: ```r library(tune) mod_rf_spec <- rand_forest(mode = "regression", mtry = tune(), min_n = tune(), trees = 100) %>% set_engine("ranger") ``` --- Define the `grid` on which you train your params, with the `dials` package: ```r rf_grid <- grid_regular(mtry(range(2, 10)), min_n(range(10, 30)), levels = c(5, 3)) rf_grid ``` ``` ## # A tibble: 15 x 2 ## mtry min_n ## <int> <int> ## 1 2 10 ## 2 4 10 ## 3 6 10 ## 4 8 10 ## 5 10 10 ## 6 2 20 ## 7 4 20 ## 8 6 20 ## 9 8 20 ## 10 10 20 ## 11 2 30 ## 12 4 30 ## 13 6 30 ## 14 8 30 ## 15 10 30 ``` --- Split your data into a few folds for Cross Validation with `vfold_cv()` from the `rsample` package: ```r cv_splits <- vfold_cv(ipf_tr, v = 5) cv_splits ``` ``` ## # 5-fold cross-validation ## # A tibble: 5 x 2 ## splits id ## <named list> <chr> ## 1 <split [15.4K/3.8K]> Fold1 ## 2 <split [15.4K/3.8K]> Fold2 ## 3 <split [15.4K/3.8K]> Fold3 ## 4 <split [15.4K/3.8K]> Fold4 ## 5 <split [15.4K/3.8K]> Fold5 ``` --- Now perform cross validation with `tune_grid()` from the `tune` package: ```r tune_res <- tune_grid(recipe(best3bench_kg ~ ., data = ipf_tr), model = mod_rf_spec, resamples = cv_splits, grid = rf_grid, metrics = metric_set(rmse)) tune_res ``` ``` ## # 5-fold cross-validation ## # A tibble: 5 x 4 ## splits id .metrics .notes ## * <list> <chr> <list> <list> ## 1 <split [15.4K/3.8K]> Fold1 <tibble [15 x 5]> <tibble [0 x 1]> ## 2 <split [15.4K/3.8K]> Fold2 <tibble [15 x 5]> <tibble [0 x 1]> ## 3 <split [15.4K/3.8K]> Fold3 <tibble [15 x 5]> <tibble [0 x 1]> ## 4 <split [15.4K/3.8K]> Fold4 <tibble [15 x 5]> <tibble [0 x 1]> ## 5 <split [15.4K/3.8K]> Fold5 <tibble [15 x 5]> <tibble [0 x 1]> ``` --- ```r tune_res$.metrics[[1]] ``` ``` ## # A tibble: 15 x 5 ## mtry min_n .metric .estimator .estimate ## <int> <int> <chr> <chr> <dbl> ## 1 2 10 rmse standard 25.4 ## 2 4 10 rmse standard 24.4 ## 3 6 10 rmse standard 24.6 ## 4 8 10 rmse standard 24.8 ## 5 10 10 rmse standard 24.9 ## 6 2 20 rmse standard 25.4 ## 7 4 20 rmse standard 24.3 ## 8 6 20 rmse standard 24.4 ## 9 8 20 rmse standard 24.5 ## 10 10 20 rmse standard 24.6 ## 11 2 30 rmse standard 25.5 ## 12 4 30 rmse standard 24.2 ## 13 6 30 rmse standard 24.3 ## 14 8 30 rmse standard 24.4 ## 15 10 30 rmse standard 24.4 ``` --- Collect the mean metric across folds: ```r estimates <- collect_metrics(tune_res) estimates ``` ``` ## # A tibble: 15 x 7 ## mtry min_n .metric .estimator mean n std_err ## <int> <int> <chr> <chr> <dbl> <int> <dbl> ## 1 2 10 rmse standard 24.7 5 0.197 ## 2 2 20 rmse standard 24.7 5 0.190 ## 3 2 30 rmse standard 24.8 5 0.192 ## 4 4 10 rmse standard 23.7 5 0.186 ## 5 4 20 rmse standard 23.6 5 0.192 ## 6 4 30 rmse standard 23.5 5 0.181 ## 7 6 10 rmse standard 23.9 5 0.191 ## 8 6 20 rmse standard 23.7 5 0.186 ## 9 6 30 rmse standard 23.6 5 0.186 ## 10 8 10 rmse standard 24.0 5 0.203 ## 11 8 20 rmse standard 23.7 5 0.185 ## 12 8 30 rmse standard 23.6 5 0.185 ## 13 10 10 rmse standard 24.1 5 0.206 ## 14 10 20 rmse standard 23.8 5 0.181 ## 15 10 30 rmse standard 23.7 5 0.183 ``` --- Choose best paramter: ```r estimates %>% mutate(min_n = factor(min_n)) %>% ggplot(aes(x = mtry, y = mean, color = min_n)) + geom_point() + geom_line() + labs(y = "Mean RMSE") + theme_classic() ``` <img src="images/Tidymodels-RMSE-Comp-1.png" width="100%" /> --- There are of course also methods for helping us choose best params and final model. ```r best_rmse <- tune_res %>% select_best(metric = "rmse", maximize = FALSE) best_rmse ``` ``` ## # A tibble: 1 x 2 ## mtry min_n ## <int> <int> ## 1 4 30 ``` See also `?select_by_one_std_err`. ```r mod_rf_final <- finalize_model(mod_rf_spec, best_rmse) mod_rf_final ``` ``` ## Random Forest Model Specification (regression) ## ## Main Arguments: ## mtry = 4 ## trees = 100 ## min_n = 30 ## ## Computational engine: ranger ``` --- ```r mod_rf_final %>% fit(best3bench_kg ~ ., data = ipf_tr) %>% predict(new_data = ipf_te) %>% mutate(truth = ipf_te$best3bench_kg) ``` ``` ## # A tibble: 12,818 x 2 ## .pred truth ## <dbl> <dbl> ## 1 62.5 62.5 ## 2 63.0 60 ## 3 65.3 65 ## 4 65.9 62.5 ## 5 72.5 67.5 ## 6 70.3 52.5 ## 7 81.0 82.5 ## 8 77.0 72.5 ## 9 76.5 62.5 ## 10 80.9 105 ## # ... with 12,808 more rows ``` --- class: section-slide # `infer`: Tidy Statistics --- ### Statistical Q1 Is there a relation between men and women and the type of equipment they use in 2019? Assume observations are independent. ```r sex_vs_equipment <- ipf_lifts %>% filter(year == 2019) %>% select(sex, equipment) %>% table() sex_vs_equipment ``` ``` ## equipment ## sex Raw Single-ply ## F 678 186 ## M 854 287 ``` ```r prop.table(sex_vs_equipment, margin = 1) ``` ``` ## equipment ## sex Raw Single-ply ## F 0.7847222 0.2152778 ## M 0.7484663 0.2515337 ``` --- ### Statistical Q2 Is there a difference between men and women age in 2019? Assume observations are independent. ```r ipf_lifts %>% filter(year == 2019) %>% group_by(sex) %>% summarise(avg = mean(age), sd = sd(age), n = n()) ``` ``` ## # A tibble: 2 x 4 ## sex avg sd n ## <fct> <dbl> <dbl> <int> ## 1 F 36.0 15.5 864 ## 2 M 38.8 16.7 1141 ``` --- ### Same Problem! Varied interface, varied output. ```r prop.test(sex_vs_equipment[,1], rowSums(sex_vs_equipment)) ``` ``` ## ## 2-sample test for equality of proportions with continuity ## correction ## ## data: sex_vs_equipment[, 1] out of rowSums(sex_vs_equipment) ## X-squared = 3.3872, df = 1, p-value = 0.0657 ## alternative hypothesis: two.sided ## 95 percent confidence interval: ## -0.001975717 0.074487646 ## sample estimates: ## prop 1 prop 2 ## 0.7847222 0.7484663 ``` --- ```r t.test(age ~ sex, data = ipf_lifts %>% filter(year == 2019)) ``` ``` ## ## Welch Two Sample t-test ## ## data: age by sex ## t = -3.8797, df = 1921.8, p-value = 0.0001081 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -4.228319 -1.388844 ## sample estimates: ## mean in group F mean in group M ## 35.97801 38.78659 ``` --- ### The `generics::tidy()` Approach (Also available when you load several other packages, like `broom` and `yardstick`) ```r tidy(prop.test(sex_vs_equipment[,1], rowSums(sex_vs_equipment))) ``` ``` ## # A tibble: 1 x 9 ## estimate1 estimate2 statistic p.value parameter conf.low conf.high method ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> ## 1 0.785 0.748 3.39 0.0657 1 -0.00198 0.0745 2-sam~ ## # ... with 1 more variable: alternative <chr> ``` ```r tidy(t.test(age ~ sex, data = ipf_lifts %>% filter(year == 2019))) ``` ``` ## # A tibble: 1 x 10 ## estimate estimate1 estimate2 statistic p.value parameter conf.low ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 -2.81 36.0 38.8 -3.88 1.08e-4 1922. -4.23 ## # ... with 3 more variables: conf.high <dbl>, method <chr>, ## # alternative <chr> ``` --- ### The `infer` Approach > infer implements an expressive grammar to perform statistical inference that coheres with the tidyverse design framework 4 main verbs for a typical flow: * `specify()` - dependent/independent variables, formula * `hypothesize()` - declare the null hypothesis * `generate()` - generate data reflecting the null hypothesis (the permutation/bootstrap approach) * `calculate()` - calculate a distribution of statistics from the generated data, from which you can extract conclusion based on a p-value for example --- ### `infer` Diff in Proportions Test Get the observed statistic (here manually in order to not confuse you, there *is* a way via `infer`): ```r # equipment # sex Raw Single-ply # F 678 186 # M 854 287 p_F <- sex_vs_equipment[1, 1] / (sum(sex_vs_equipment[1, ])) p_M <- sex_vs_equipment[2, 1] / (sum(sex_vs_equipment[2, ])) obs_diff <- p_F - p_M obs_diff ``` ``` ## [1] 0.03625596 ``` --- Get distribution of the difference in proportions under null hypothesis ```r diff_null_perm <- ipf_lifts %>% filter(year == 2019) %>% specify(equipment ~ sex, success = "Raw") %>% hypothesize(null = "independence") %>% generate(reps = 200, type = "permute") %>% calculate(stat = "diff in props", order = c("F", "M")) diff_null_perm ``` ``` ## # A tibble: 200 x 2 ## replicate stat ## <int> <dbl> ## 1 1 0.00575 ## 2 2 0.0363 ## 3 3 -0.00645 ## 4 4 0.0118 ## 5 5 -0.0532 ## 6 6 0.00982 ## 7 7 0.00168 ## 8 8 -0.0105 ## 9 9 0.0403 ## 10 10 0.0200 ## # ... with 190 more rows ``` --- Visualize the permuted difference null distribution and the p-value ```r visualize(diff_null_perm) + shade_p_value(obs_stat = obs_diff, direction = "two_sided") ``` <img src="images/Diff-in-Props-Null-1.png" width="50%" /> --- Get the actual p-value: ```r diff_null_perm %>% get_p_value(obs_stat = obs_diff, direction = "two_sided") ``` ``` ## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0.1 ``` --- ### `infer` t Test (independent samples) Get the observed statistic (here via `infer`): ```r obs_t <- ipf_lifts %>% filter(year == 2019) %>% specify(age ~ sex) %>% calculate(stat = "t", order = c("F", "M")) obs_t ``` ``` ## # A tibble: 1 x 1 ## stat ## <dbl> ## 1 -3.88 ``` --- Get distribution of the t statistic under null hypothesis ```r t_null_perm <- ipf_lifts %>% filter(year == 2019) %>% specify(age ~ sex) %>% hypothesize(null = "independence") %>% generate(reps = 100, type = "permute") %>% calculate(stat = "t", order = c("F", "M")) t_null_perm ``` ``` ## # A tibble: 100 x 2 ## replicate stat ## <int> <dbl> ## 1 1 -0.658 ## 2 2 -0.923 ## 3 3 0.0529 ## 4 4 1.67 ## 5 5 0.802 ## 6 6 1.27 ## 7 7 1.20 ## 8 8 1.31 ## 9 9 0.0975 ## 10 10 -2.20 ## # ... with 90 more rows ``` --- Visualize the permuted t statistic null distribution and the two-sided p-value ```r visualize(t_null_perm) + shade_p_value(obs_stat = obs_t, direction = "two_sided") ``` <img src="images/T-Null-1.png" width="50%" /> --- Get the actual p-value: ```r t_null_perm %>% get_p_value(obs_stat = obs_t, direction = "two_sided") ``` ``` ## # A tibble: 1 x 1 ## p_value ## <dbl> ## 1 0 ```