class: logo-slide --- class: title-slide ## Feature Engineering ### Applications of Data Science - Class 15 ### Giora Simchoni #### `gsimchoni@gmail.com and add #dsapps in subject` ### Stat. and OR Department, TAU ### 2020-05-29 --- layout: true <div class="my-footer"> <span> <a href="https://dsapps-2020.github.io/Class_Slides/" target="_blank">Applications of Data Science </a> </span> </div> --- class: section-slide # What can you possibly teach us about Feature Engineering? --- Our two previous modeling attempts, with [Modeling in the Tidyverse](https://dsapps-2020.github.io/Class_Slides/u1_d05-modeling/u1_d05-modeling.html) and in [The Trees](https://dsapps-2020.github.io/Class_Slides/u3_d01-trees/u3_d01-trees.html) where problematic. - Using a "sinked" split, not using `set.seed()` - reproducible but wasteful - "Lazy" feature engineering, not using the entire data, not creative - No interactions - Test once, on the same validation set, for everything - Data "leakage" from training set to validation set (when imputing missing values) - Sometimes tuning params, sometimes not - No pre-processing of features, if at all by a hunch - Not thinking about Multicollinearity where it hurts (regession) - Almost no treatment of class imbalance in classification - Chaotic feature selection --- Reminder: We have two tasks, Regression and Classification: 1. Predict the income of an OKCupid user (from those who reported income) 2. Predict if an OKCupid user is a Cats or Dogs person, where: ```r cats_categories <- c("has cats", "likes cats", "dislikes dogs and has cats", "dislikes dogs and likes cats") dogs_categories <- c("has dogs", "likes dogs", "has dogs and dislikes cats", "likes dogs and dislikes cats") ``` And 80% of users are Dogs people, so we have class imbalance. --- On our validation set we got... - Regression Task Model | RMSE | R ------------|------|-- LM | 0.35 | 0.5 LM-L2 | 0.35 | 0.5 LM-L1 | 0.35 | 0.5 CART | 0.36 | 0.45 Bagging | 0.37 | 0.44 RF | 0.35 | 0.49 GBT-gbm | 0.35 | 0.5 GBT-XGBOOST | 0.35 | 0.5 --- - Classification Task Model | AUC | Accuracy | Recall Dogs | Precision Dogs | Recall Cats | Precision Cats ----------|-----|----------|-------------|----------------|-------------|---------- GLM | 0.74| 0.68 | 0.68 | 0.92 | 0.66 | 0.26 GLM-EN | 0.74| 0.67 | 0.67 | 0.92 | 0.67 | 0.26 CART | 0.67| 0.75 | 0.79 | 0.9 | 0.49 | 0.28 GBT+Cutoff| 0.74| 0.70 | 0.71 | 0.9 | 0.65 | 0.32 And we were not impressed. --- class: section-slide # Resampling/Cross-Validation --- ### Resampling done right The main goal of resampling: No Surprises. (By generating multiple versions of our data, and watching how our model behaves on unseen fresh data again. And again. And again.) - Resampling is a key principle when developing a ML solution - All ML practitioners are aware of the menace of overfitting - But resampling should also be made for: - Feature engineering and transformation .font80percent[(should I use the Box-Cox or Yeo-Johnson transformation?)] - Feature selection .font80percent[(Should I use 2nd-order intearctions here?)] - Choosing imputation strategy .font80percent[(Simply use the median or regress?)] - Tuning parameters .font80percent[(100 or 500 trees?)] - Model selection .font80percent[(LM or MARS?)] --- - A good resampling scheme will: - Have more than one train/valid/test samples!* - Be stratified (usually by dependent variable distribution) - Use a (paired/repeated/within) statistical test on resamples to reach a decision, or at least a proper plot - Avoid data "leakage" and not include the test set! - Encompass all steps in model building - Ideally (when data is Big) perform each step on a different part *of the training data* .insight[ 💡 * How many resamples? `-` The higher the number the lower the variance of performance metric `-` The higher the number the less data used in each resample, the higher the bias of performance metric ] --- ### Ideally (yes, I know) <img src = "images/resampling.png" style="width: 100%"> --- ### And don't forget Only once we decide on the final pre-processing, features, model and parameters, will we fit the model two more times: 1. On the entire training data, then test on the testing data and get a realistic metric of performance 2. On the *entire* data, right before deployment to production --- ### Many approaches to resampling But two most common: 1. V-fold Cross Validation (+ optionally repeating n times) 2. Bootstrap Samples --- #### V-fold Cross Validation <img src = "images/cv_viz.png" style="width: 90%"> .font80percent[ [Applied Machine Learning Workshop at Rstudio-conf 2019 / Max Kuhn](https://github.com/topepo/rstudio-conf-2019) ] --- #### Bootstrap Samples <img src = "images/boot_viz.png" style="width: 90%"> .insight[ 💡 What is the probability of entering a fold (at least once)? ] --- class: section-slide # Let's Start from the Very Beginning --- #### OkCupid from scratch ```r okcupid <- read_csv("../data/okcupid.csv.zip") cats_categories <- c("has cats", "likes cats", "dislikes dogs and has cats", "dislikes dogs and likes cats") dogs_categories <- c("has dogs", "likes dogs", "has dogs and dislikes cats", "likes dogs and dislikes cats") okcupid <- okcupid %>% mutate(pets = case_when( pets %in% cats_categories ~ "cats", pets %in% dogs_categories ~ "dogs", TRUE ~ NA_character_)) %>% drop_na(pets) %>% na_if(-1) %>% select(-last_online) okcupid[which(is.na(okcupid$height)), "height"] <- 70 # you don't wanna know ``` --- Using `rsample::initial_split(okcupid, strata = pets)` would have been better, but if we do it now we won't have the same test data to compare. ```r idx <- read_rds("../data/okcupid3_idx.rda") train_idx <- idx$train_idx valid_idx <- idx$valid_idx test_idx <- idx$test_idx okcupid_train <- okcupid[train_idx, ] okcupid_valid <- okcupid[valid_idx, ] okcupid_test <- okcupid[test_idx, ] ``` And let us bind `okcupid_train` and `okcupid_valid` together because we are going to use resampling anyway: ```r okcupid_train <- bind_rows(okcupid_train, okcupid_valid) glue("train no. of rows: {nrow(okcupid_train)} test no. of rows: {nrow(okcupid_test)}") ``` ``` ## train no. of rows: 12729 ## test no. of rows: 4000 ``` --- See that the class balance is more or less the same: ```r okcupid_train %>% count(pets) %>% mutate(pct = n / sum(n)) ``` ``` ## # A tibble: 2 x 3 ## pets n pct ## <chr> <int> <dbl> ## 1 cats 2097 0.165 ## 2 dogs 10632 0.835 ``` ```r okcupid_test %>% count(pets) %>% mutate(pct = n / sum(n)) ``` ``` ## # A tibble: 2 x 3 ## pets n pct ## <chr> <int> <dbl> ## 1 cats 693 0.173 ## 2 dogs 3307 0.827 ``` Good enough .font80percent[(we could also drop a few cat people at random if they're too much in the test group)]. We stash our test data aside, kiss it good bye, we shall not see it again before the final assessment of our model. --- #### That manager again. Manager mentions casually: oh, and by the way, this model thingy of yours. Yeah, we need it to run in, like 10 milliseconds per user. Marketing team needs that, don't ask me why, and we can't have the users wait. You: penalized logistic regression it is 😏 No one mentioned cats/dogs recall/precision to be of any particular importance, so we decide to maximize AUC and take care of cutoff later. --- We explore the training data .font80percent[(skipping here at least a couple of hours in which we actually explore the data, I'm assuming we already know quite a bit!)]: ```r glimpse(okcupid_train) ``` ``` ## Rows: 12,729 ## Columns: 30 ## $ age <dbl> 39, 19, 27, 24, 29, 24, 25, 25, 21, 25, 28, 25, 21... ## $ body_type <chr> "fit", "curvy", NA, "fit", "fit", "fit", "athletic... ## $ diet <chr> "mostly anything", "mostly vegetarian", "mostly an... ## $ drinks <chr> "socially", "rarely", NA, "socially", "socially", ... ## $ drugs <chr> "never", "sometimes", NA, NA, "never", NA, "never"... ## $ education <chr> "graduated from college/university", NA, NA, "grad... ## $ essay0 <chr> "i am a very positive person who loves to laugh an... ## $ essay1 <chr> "i work in the health and wellness industry and i ... ## $ essay2 <chr> "making people smile or laugh.", "ummmm everything... ## $ essay3 <chr> "my eyes", "my boobs. definitely.<br />\nand the f... ## $ essay4 <chr> "i enjoy malcolm gladwell, blink, tipping point, o... ## $ essay5 <chr> "water,steak,chocolate,comfy jeans,transportation,... ## $ essay6 <chr> "food, naps, women, and health", "caffeine, my bab... ## $ essay7 <chr> "doing whatever has come up or has been planned. d... ## $ essay8 <chr> "i am totally deaf in my right ear", "i love anima... ## $ essay9 <chr> "you enjoy having fun, love to laugh, have a posit... ## $ ethnicity <chr> "white", "white", NA, "white", "asian, pacific isl... ## $ height <dbl> 70, 60, 72, 65, 69, 72, 70, 67, 66, 67, 63, 63, 62... ## $ income <dbl> 8e+04, NA, NA, NA, 3e+04, 5e+04, 1e+05, NA, 2e+04,... ## $ job <chr> "medicine / health", "sales / marketing / biz dev"... ## $ location <chr> "mountain view, california", "pleasant hill, calif... ## $ offspring <chr> "doesn’t have kids, but might want them", NA... ## $ orientation <chr> "straight", "straight", "straight", "straight", "s... ## $ pets <chr> "dogs", "dogs", "dogs", "dogs", "dogs", "dogs", "d... ## $ religion <chr> "atheism", "agnosticism and laughing about it", "a... ## $ sex <chr> "m", "f", "m", "f", "m", "m", "m", "m", "f", "m", ... ## $ sign <chr> "capricorn", "libra and it matters a lot", NA, NA,... ## $ smokes <chr> "no", "yes", NA, "no", "no", NA, "no", "sometimes"... ## $ speaks <chr> "english", "english (fluently)", "english", "engli... ## $ status <chr> "single", "seeing someone", "single", "single", "s... ``` --- ```r library(naniar) vis_miss(okcupid_train %>% sample_frac(0.2) %>% select(-starts_with("essay"))) ``` <img src="images/Ok_Naniar-1.png" width="100%" /> --- #### Q & A Q: What type of variables do we have? A: Numeric: age, height income; Text: essays; Categorical: all others Q: Can our model(s) handle missing values? A: Probably gonna use `glmnet` so: NO. Q: Are missing values a serious issue here? A: Oh yes. `income` 80+% missing, `offspring` 50+% missing. Gonna have to take care of it. --- #### Q & A Q: Can our model(s) handle categorical values? A: Probably gonna use `glmnet` so: NO. Q: Categorical variables with many levels? A: Oh yes. `speaks` has over 2K. And there are those `essay`s. Q: Numeric variables skewed? A: Income, but all you need is log. Q: Classes imbalance scenario? A: Pretty imbalanced, 5 dogs people for every 1 cat person. Gonna have to take care of it. --- #### You have to have a strategy Strategy: Divide training data into 6 samples for 6 main decision/steps/tunings: 1. Missing data imputation 2. Class imbalance treatment 3. Feature Engineering for `essay` data 4. Looking into interactions 4. Model Tuning 5. Cutoff choice .insight[ 💡 We assume these decisions are additive or independent of each other. Does it have to be and what can we do if we suspect not? (Remember we are Statisticians!) ] --- Looking at our strategy, we decide the 12.7K observations need not split evenly between steps. - Cutoff choice is "peanuts" compared to the other decisions, let's decide in advance that at this stage we would fit the chosen model on ~12K previously seen observations and choose a cutoff by looking at the remaining ~700 observations - Missing data imputation, class imbalance and interaction steps will each be given 2K observations - Model tuning 3K observations - Text feature engineering and selection 3K observations - Total: 12,729 training observations --- We wish to make sure our samples are stratified by `pets`, so we'll use `initial_split()` in stages: ```r library(tidymodels) set.seed(42) ok_cutoff_split <- initial_split(okcupid_train, strata = pets, prop = (729 - 1)/12729) ok_tr_cutoff <- training(ok_cutoff_split) ok_train <- testing(ok_cutoff_split) set.seed(42) ok_missing_split <- initial_split(ok_train, strata = pets, prop = (2000 - 1)/12000) ok_tr_missing <- training(ok_missing_split) ok_train <- testing(ok_missing_split) set.seed(42) ok_imbalance_split <- initial_split(ok_train, strata = pets, prop = (2000 - 1)/10000) ok_tr_imbalance <- training(ok_imbalance_split) ok_train <- testing(ok_imbalance_split) ``` --- ```r set.seed(42) ok_interaction_split <- initial_split(ok_train, strata = pets, prop = (2000 - 1)/8000) ok_tr_interaction <- training(ok_interaction_split) ok_train <- testing(ok_interaction_split) set.seed(42) ok_tuning_split <- initial_split(ok_train, strata = pets, prop = (3000 - 1)/6000) ok_tr_tuning <- training(ok_tuning_split) ok_tr_eng <- testing(ok_tuning_split) ``` ``` ## # A tibble: 7 x 4 ## name df n_rows cats_rate ## <chr> <list> <dbl> <dbl> ## 1 missing <tibble [2,000 x 30]> 2000 0.165 ## 2 imbalance <tibble [2,000 x 30]> 2000 0.165 ## 3 engineering <tibble [3,000 x 30]> 3000 0.164 ## 4 interaction <tibble [2,000 x 30]> 2000 0.165 ## 5 tuning <tibble [3,000 x 30]> 3000 0.165 ## 6 cutoff <tibble [729 x 30]> 729 0.165 ## 7 test <tibble [4,000 x 30]> 4000 0.173 ``` --- class: section-slide # Dealing with Missing Data --- ### Missing Data .font80percent[We could fill an entire semester on this subject!] - What causes a missing datum? - Survey data - Merging data sources - Failure of measurement --- - What is the missing data mechanism? - See [Little & Rubin (2019)](https://www.amazon.com/Statistical-Analysis-Missing-Probability-Statistics-ebook/dp/B07Q25CNSD/ref=sr_1_1?crid=2P6ERH22YRKUL&dchild=1&keywords=statistical+analysis+with+missing+data&qid=1587103544&s=books&sprefix=statistical+ana%2Cstripbooks-intl-ship%2C286&sr=1-1) seminal work - MCAR: "on some days the weighing scale batteries drained and we couldn't take the mice weight" - MAR: "on the 2024 elections poll a massive Facebook campaign called all women to refuse to cooperate with the Seker-Sheker company due to allegations of sexual harassment by the company's head Dani Marom" - MNAR: "some of the drug addicts under study missed the final few meetings in the treatment group and their data were discarded" .insight[ 💡 How would you asses the missing data mechanism in your dataset? ] --- Exploring percent missingness ```r gg_miss_var(ok_tr_missing %>% select(-starts_with("essay")), show_pct = TRUE) ``` <img src="images/Ok-Miss1-1.png" width="100%" /> --- Exploring percent missingness by a predictor ```r gg_miss_var(ok_tr_missing %>% select(-starts_with("essay")), facet = sex, show_pct = TRUE) ``` <img src="images/Ok-Miss2-1.png" width="100%" /> --- Exploring missingness between a predictor and the response (categorical) variable ```r ok_tr_missing %>% mutate(income_missing = is.na(income)) %>% count(income_missing, pets) %>% pivot_wider(id_cols = pets, names_from = income_missing, values_from = n) %>% mutate(pct_missing = `TRUE` / (`TRUE` + `FALSE`)) ``` ``` ## # A tibble: 2 x 4 ## pets `FALSE` `TRUE` pct_missing ## <chr> <int> <int> <dbl> ## 1 cats 74 256 0.776 ## 2 dogs 305 1365 0.817 ``` --- But we can do better <img src="images/Ok-Miss3-1.png" width="100%" /> --- Exploring missingness between two (categorical) predictors ```r library(ggmosaic) ok_tr_missing %>% mutate(body_type_missing = is.na(body_type)) %>% ggplot() + geom_mosaic(aes(x = product(body_type_missing, status), fill = body_type_missing)) + labs(x = NULL, y = NULL) + theme_light() ``` <img src="images/Ok-Miss4-1.png" width="90%" /> --- You don't always need a plot... ```r ok_tr_missing %>% mutate(body_type_missing = is.na(body_type)) %>% count(body_type_missing, status) %>% pivot_wider(id_cols = status, names_from = body_type_missing, values_from = n) %>% mutate(pct_missing = `TRUE` / (`TRUE` + `FALSE`)) ``` ``` ## # A tibble: 4 x 4 ## status `FALSE` `TRUE` pct_missing ## <chr> <int> <int> <dbl> ## 1 available 46 8 0.148 ## 2 married 7 1 0.125 ## 3 seeing someone 42 5 0.106 ## 4 single 1738 153 0.0809 ``` --- The same with `body_type` missingness by `drugs`: ```r ok_tr_missing %>% mutate(body_type_missing = is.na(body_type), drugs_missing = is.na(drugs)) %>% count(body_type_missing, drugs_missing) %>% pivot_wider(id_cols = drugs_missing, names_from = body_type_missing, values_from = n) %>% mutate(pct_missing = `TRUE` / (`TRUE` + `FALSE`)) ``` ``` ## # A tibble: 2 x 4 ## drugs_missing `FALSE` `TRUE` pct_missing ## <lgl> <int> <int> <dbl> ## 1 FALSE 1440 122 0.0781 ## 2 TRUE 393 45 0.103 ``` --- But we can do better with a co-occurrence plot or a heatmap ```r ok_tr_missing %>% count(drugs, body_type) %>% ggplot(aes(drugs, body_type)) + geom_point(aes(size = n)) + theme_light() ``` <img src="images/Ok-Miss5-1.png" width="100%" /> --- Exploring missingness between numerical and categorical predictors ```r ok_tr_missing %>% mutate(bodytype_missing = is.na(body_type)) %>% ggplot(aes(bodytype_missing, age)) + geom_violin() + facet_wrap(. ~ sex) + theme_light() ``` <img src="images/Ok-Miss6-1.png" width="100%" /> --- Exploring missingness between two numerical predictors (borrowing `airquality` data) ```r ggplot(airquality, aes(Ozone, Solar.R)) + naniar::geom_miss_point() + theme_light() ``` <img src="images/Ok-Miss7-1.png" width="50%" /> --- When the data is large you could perform PCA on the binary missing/non-missing matrix: ```r binary_mat <- ok_tr_missing %>% select(-pets) %>% mutate_all(is.na) prcomp(binary_mat) %>% pluck("rotation") %>% as_tibble() %>% select(PC1, PC2) %>% bind_cols( tibble( predictor = colnames(binary_mat), pct_missing = binary_mat %>% colSums() / nrow(ok_tr_missing) ) ) %>% mutate(label = ifelse(PC1 < -0.3 | PC2 > 0.3, predictor, "")) %>% ggplot(aes(PC1, PC2)) + geom_point(aes(size = pct_missing), color = "lightblue", alpha = 0.5) + ggrepel::geom_text_repel(aes(label = label)) + theme_light() ``` --- <img src="images/Ok-Miss8-1.png" width="80%" /> --- #### Definitely not in MCAR land - Keep the data missing, drop it all or impute values? - Very much depends on the mechanism of missing data - "Drop it all" only if you're "data rich" and assume MCAR - Keep or impute: can your model handle missing values? - "It's not a bug, it's a feature!": a new category or an additional variable - Use resampling to test the best strategy! --- - How to impute? - Very much depends on the mechanism of missing data - Mean/median value if MCAR - kNN imputation - Iterative Regression/Classification algos (EM algorithm) - Use resampling to test the best strategy! - What do you do with missing values in the response variable? - Semi-supervised Learning - Active Learning --- Let's choose between 2 x 3 x 3 = 18 strategies: 1. Missing categorical features - adding an "unknown" category - imputing the most common category 2. The `income` feature: - Dropping it (it is 80% missing!!) - Log transofrmation, imputing the mean - Log transformation, kNN Imputation 3. The `essay` features: - Dropping them - Adding `essay` length where a missing `essay` has length 0 - Adding `essay` length where a missing `essay` has length 0 and adding "is_missing" feature to each `essay` --- Use `step_mutate_*()` or add what you need beforehand. ```r ok_tr_missing2 <- ok_tr_missing %>% select(-starts_with("essay")) %>% bind_cols( ok_tr_missing %>% transmute_at(vars(essay0:essay9), list("len" = ~ifelse(is.na(.x), 0, str_length(.x)))) ) %>% bind_cols( ok_tr_missing %>% transmute_at(vars(essay0:essay9), list("isna" = ~ifelse(is.na(.x), 1, 0))) ) %>% rename_at(vars(contains("_len")), list(~paste("len", gsub("_len", "", .), sep = "_") ) ) %>% rename_at(vars(contains("_isna")), list(~paste("isna", gsub("_isna", "", .), sep = "_") ) ) %>% mutate_if(is.character, as.factor) ``` --- Defining recipes .font80percent[(without `prep()`!)], let's start with 2 to simplify. ```r rec_miss_unk_none_none <- recipe(pets ~ ., data = ok_tr_missing2) %>% update_role(income, contains("essay"), new_role = "discarded") %>% step_normalize(all_numeric(), -has_role("discarded")) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_unknown(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) rec_miss_mode_none_none <- recipe(pets ~ ., data = ok_tr_missing2) %>% update_role(income, contains("essay"), new_role = "discarded") %>% step_normalize(all_numeric(), -has_role("discarded")) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) ``` .font80percent[(pretty sure `step_rm` is a way to discard predictors altogether)] --- Defining resamples ```r set.seed(42) cv_splits_missing <- vfold_cv(ok_tr_missing2, v = 10, strata = pets) cv_splits_missing ``` ``` ## # 10-fold cross-validation using stratification ## # A tibble: 10 x 2 ## splits id ## <named list> <chr> ## 1 <split [1.8K/200]> Fold01 ## 2 <split [1.8K/200]> Fold02 ## 3 <split [1.8K/200]> Fold03 ## 4 <split [1.8K/200]> Fold04 ## 5 <split [1.8K/200]> Fold05 ## 6 <split [1.8K/200]> Fold06 ## 7 <split [1.8K/200]> Fold07 ## 8 <split [1.8K/200]> Fold08 ## 9 <split [1.8K/200]> Fold09 ## 10 <split [1.8K/200]> Fold10 ``` --- Applying (two) recipes to all splits ```r cv_splits_missing$rec_miss_unk_none_none <- map( cv_splits_missing$splits, prepper, recipe = rec_miss_unk_none_none) cv_splits_missing$rec_miss_mode_none_none <- map( cv_splits_missing$splits, prepper, recipe = rec_miss_mode_none_none) map_dbl(cv_splits_missing$rec_miss_unk_none_none, ~nrow(.$term_info)) ``` ``` ## 1 2 3 4 5 6 7 8 9 10 ## 112 113 113 113 113 113 113 112 112 112 ``` ```r cv_splits_missing$rec_miss_unk_none_none[[1]]$term_info %>% filter(str_detect(variable, "speaks")) ``` ``` ## # A tibble: 4 x 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 speaks_english..fluently. numeric predictor derived ## 2 speaks_other1 numeric predictor derived ## 3 speaks_new numeric predictor derived ## 4 speaks_unknown numeric predictor derived ``` --- Defining the model specification ```r mod_lr <- logistic_reg() %>% set_engine("glmnet") ``` Fitting the model to all splits ```r fit_lr <- function(rec_obj) fit(mod_lr, pets ~ ., data = juice(rec_obj, all_predictors(), all_outcomes())) cv_splits_missing$mod_lr_unk <- map( cv_splits_missing$rec_miss_unk_none_none, fit_lr) cv_splits_missing$mod_lr_mode <- map( cv_splits_missing$rec_miss_mode_none_none, fit_lr) ``` --- Predicting on assessment data of each split ```r pred_lr <- function(split_obj, rec_obj, model_obj) { mod_data <- bake(rec_obj, new_data = assessment(split_obj), all_predictors(), all_outcomes()) out <- mod_data %>% select(pets) out$predicted <- predict(model_obj, mod_data, type = "prob", penalty = 0.0001)$.pred_cats out } cv_splits_missing$pred_unk <- pmap( lst( split_obj = cv_splits_missing$splits, rec_obj = cv_splits_missing$rec_miss_unk_none_none, model_obj = cv_splits_missing$mod_lr_unk ), pred_lr) cv_splits_missing$pred_mode <- pmap(lst( split_obj = cv_splits_missing$splits, rec_obj = cv_splits_missing$rec_miss_mode_none_none, model_obj = cv_splits_missing$mod_lr_mode ), pred_lr) ``` --- Measuring the AUC of each split ```r cv_splits_missing$step_unknown <- map_dfr( cv_splits_missing$pred_unk, roc_auc, pets, predicted)$.estimate cv_splits_missing$step_mode <- map_dfr( cv_splits_missing$pred_mode, roc_auc, pets, predicted)$.estimate ``` --- Comparing AUCs with a plot ```r cv_splits_missing %>% select(id, step_unknown, step_mode) %>% pivot_longer(cols = c(step_unknown, step_mode), names_to = "recipe", values_to = "auc") %>% ggplot(aes(recipe, auc, group = id, color = id)) + geom_line() + geom_point() + guides(color = FALSE) + theme_light() ``` <img src="images/Ok-Miss9-1.png" width="90%" /> --- Comparing AUCs with a statistical test (paired T test) ```r t.test(cv_splits_missing$step_unknown, cv_splits_missing$step_mode, paired = TRUE) ``` ``` ## ## Paired t-test ## ## data: cv_splits_missing$step_unknown and cv_splits_missing$step_mode ## t = -0.31928, df = 9, p-value = 0.7568 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -0.01364404 0.01026897 ## sample estimates: ## mean of the differences ## -0.001687534 ``` --- Let's move on to 2 x 2 recipes, adding 2 more recipes for treating the `income` variable with mean imputation, with `step_unknown()` and `step_modeimpute()`. ```r rec_miss_unk_mean_none <- recipe(pets ~ ., data = ok_tr_missing2) %>% update_role(contains("essay"), new_role = "discarded") %>% step_log(income, offset = 1) %>% step_meanimpute(income) %>% step_normalize(all_numeric(), -has_role("discarded")) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_unknown(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) rec_miss_mode_mean_none <- recipe(pets ~ ., data = ok_tr_missing2) %>% update_role(contains("essay"), new_role = "discarded") %>% step_log(income, offset = 1) %>% step_meanimpute(income) %>% step_normalize(all_numeric(), -has_role("discarded")) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) ``` --- Our approach with 2 recipes does not scale, and we don't need to keep all data just final AUC for each recipe. We are writing a `train_recipe()` function! ```r lst_recs <- list("unknown_none" = rec_miss_unk_none_none, "mode_none" = rec_miss_mode_none_none, "unknown_mean" = rec_miss_unk_mean_none, "mode_mean" = rec_miss_mode_mean_none) train_recipe <- function(rec, rec_name, splits) { splits_prepped <- map(splits, prepper, recipe = rec) splits_fit <- map(splits_prepped, fit_lr) splits_pred <- pmap( lst( split_obj = splits, rec_obj = splits_prepped, model_obj = splits_fit ), pred_lr ) res <- map_dfr(splits_pred, roc_auc, pets, predicted)$.estimate names(res) <- rec_name res } ``` --- Training multiple recipes, saving the AUC for each recipe, each split. ```r cv_splits_missing <- cv_splits_missing %>% bind_cols( map2_dfc(lst_recs, names(lst_recs), train_recipe, splits = cv_splits_missing$splits) ) ``` Processing the results into a nice table: ```r cv_res_missing <- cv_splits_missing %>% pivot_longer(cols = names(lst_recs), names_to = "recipe", values_to = "AUC") %>% select(id, recipe, AUC) %>% separate(recipe, c("categorical", "income")) cv_res_missing ``` ``` ## # A tibble: 40 x 4 ## id categorical income AUC ## <chr> <chr> <chr> <dbl> ## 1 Fold01 unknown none 0.760 ## 2 Fold01 mode none 0.751 ## 3 Fold01 unknown mean 0.770 ## 4 Fold01 mode mean 0.760 ## 5 Fold02 unknown none 0.676 ## 6 Fold02 mode none 0.670 ## 7 Fold02 unknown mean 0.683 ## 8 Fold02 mode mean 0.675 ## 9 Fold03 unknown none 0.670 ## 10 Fold03 mode none 0.693 ## # ... with 30 more rows ``` --- Comparing AUCs with a plot ```r cv_res_missing %>% ggplot(aes(income, AUC, group = id, color = id)) + geom_line() + geom_line(data = cv_res_missing_meds, color = "black", alpha = 0.2, lwd = 2) + geom_point() + geom_point(data = cv_res_missing_meds, color = "black", alpha = 0.2, size = 5, pch = 17) + guides(color = FALSE) + facet_wrap(. ~ categorical) + theme_light() ``` <img src="images/Ok-Miss10-1.png" width="100%" /> --- Comparing AUCs with a statistical test (2x2 Repeated measures ANOVA) ```r aov_missing <- aov( AUC ~ factor(categorical) * factor(income) + Error(factor(id)), data = cv_res_missing) print(summary(aov_missing)) ``` ``` ## ## Error: factor(id) ## Df Sum Sq Mean Sq F value Pr(>F) ## Residuals 9 0.03699 0.00411 ## ## Error: Within ## Df Sum Sq Mean Sq F value Pr(>F) ## factor(categorical) 1 0.0000178 1.779e-05 0.165 0.688 ## factor(income) 1 0.0001487 1.487e-04 1.379 0.251 ## factor(categorical):factor(income) 1 0.0000013 1.250e-06 0.012 0.915 ## Residuals 27 0.0029116 1.078e-04 ``` --- ```r with(cv_res_missing, interaction.plot(categorical, income, AUC)) ``` <img src="images/Ok-Miss11-1.png" width="50%" /> --- Ok, now let's go back to our original idea of 18 recipes... see code in slides Rmd file. ```r lst_recs <- list("unknown_none_none" = rec_miss_unk_none_none, "mode_none_none" = rec_miss_mode_none_none, "unknown_mean_none" = rec_miss_unk_mean_none, "mode_mean_none" = rec_miss_mode_mean_none, "unknown_knn_none" = rec_miss_unk_knn_none, "mode_knn_none" = rec_miss_mode_knn_none, "unknown_none_len" = rec_miss_unk_none_len, "mode_none_len" = rec_miss_mode_none_len, "unknown_mean_len" = rec_miss_unk_mean_len, "mode_mean_len" = rec_miss_mode_mean_len, "unknown_knn_len" = rec_miss_unk_knn_len, "mode_knn_len" = rec_miss_mode_knn_len, "unknown_none_isna" = rec_miss_unk_none_isna, "mode_none_isna" = rec_miss_mode_none_isna, "unknown_mean_isna" = rec_miss_unk_mean_isna, "mode_mean_isna" = rec_miss_mode_mean_isna, "unknown_knn_isna" = rec_miss_unk_knn_isna, "mode_knn_isna" = rec_miss_mode_knn_isna ) ``` --- ```r set.seed(42) cv_splits_missing <- vfold_cv(ok_tr_missing2, v = 10, strata = pets) ``` ```r cv_splits_missing <- cv_splits_missing %>% bind_cols( map2_dfc(lst_recs, names(lst_recs), train_recipe, splits = cv_splits_missing$splits) ) ``` ```r cv_res_missing <- cv_splits_missing %>% pivot_longer(cols = names(lst_recs), names_to = "recipe", values_to = "AUC") %>% select(id, recipe, AUC) %>% separate(recipe, c("categorical", "income", "essay")) ``` --- Again, see the plot .font80percent[(how would you plot a 2x3x3 experiment results?)]. ```r cv_res_missing %>% group_by(id, income, essay) %>% summarise(AUC = mean(AUC)) %>% ggplot(aes(income, AUC, group = id, color = id)) + geom_line() + geom_line(data = cv_res_missing_meds, color = "black", alpha = 0.2, lwd = 2) + geom_point() + geom_point(data = cv_res_missing_meds, color = "black", alpha = 0.2, size = 5, pch = 17) + guides(color = FALSE) + facet_wrap(. ~ essay) + theme_light() ``` <img src="images/Ok-Miss12-1.png" width="90%" /> --- Repeated measures ANOVA ```r aov_missing <- aov( AUC ~ factor(categorical) * factor(income) + factor(essay) + Error(factor(id)), data = cv_res_missing) summary(aov_missing) ``` ``` ## ## Error: factor(id) ## Df Sum Sq Mean Sq F value Pr(>F) ## Residuals 9 0.1595 0.01772 ## ## Error: Within ## Df Sum Sq Mean Sq F value Pr(>F) ## factor(categorical) 1 0.000191 0.0001911 1.975 0.16179 ## factor(income) 2 0.000483 0.0002415 2.497 0.08549 ## factor(essay) 2 0.001101 0.0005503 5.689 0.00409 ## factor(categorical):factor(income) 2 0.000007 0.0000036 0.038 0.96300 ## Residuals 163 0.015766 0.0000967 ## ## factor(categorical) ## factor(income) . ## factor(essay) ** ## factor(categorical):factor(income) ## Residuals ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- Or you can just look at the mean numbers... ```r cv_res_missing %>% group_by(categorical, income, essay) %>% summarise(AUC = mean(AUC)) %>% arrange(-AUC) %>% head(10) ``` ``` ## # A tibble: 10 x 4 ## # Groups: categorical, income [6] ## categorical income essay AUC ## <chr> <chr> <chr> <dbl> ## 1 mode mean len 0.719 ## 2 mode knn len 0.719 ## 3 unknown knn len 0.718 ## 4 unknown mean len 0.718 ## 5 mode mean none 0.716 ## 6 mode none len 0.716 ## 7 mode mean isna 0.715 ## 8 unknown mean none 0.715 ## 9 mode knn none 0.715 ## 10 unknown none len 0.715 ``` .insight[ 💡 So what would you do? ] --- class: section-slide # Class Imbalance --- ### Class Imbalance .font80percent[(We already gave a decent review in [last class](https://dsapps-2020.github.io/Class_Slides/u3_d02-classtopics/u3_d02-classtopics.html))] We'll choose between 3 strategies: - Upsample - Downsample - SMOTE and Downsample --- ```r ok_tr_imbalance2 <- ok_tr_imbalance %>% select(-starts_with("essay")) %>% bind_cols( ok_tr_imbalance %>% transmute_at(vars(essay0:essay9), list("len" = ~ifelse(is.na(.x), 0, str_length(.x)))) ) %>% bind_cols( ok_tr_imbalance %>% transmute_at(vars(essay0:essay9), list("isna" = ~ifelse(is.na(.x), 1, 0))) ) %>% rename_at(vars(contains("_len")), list(~paste("len", gsub("_len", "", .), sep = "_"))) %>% rename_at(vars(contains("_isna")), list(~paste("isna", gsub("_isna", "", .), sep = "_"))) %>% mutate_if(is.character, as.factor) ``` --- ```r library(themis) rec_imb_upsample <- recipe(pets ~ ., data = ok_tr_imbalance2) %>% update_role(starts_with("isna"), new_role = "discarded") %>% step_log(income, starts_with("len"), offset = 1) %>% step_meanimpute(income) %>% step_normalize(all_numeric(), -has_role("discarded")) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_upsample(pets, over_ratio = 1, seed = 42) rec_imb_downsample <- recipe(pets ~ ., data = ok_tr_imbalance2) %>% update_role(starts_with("isna"), new_role = "discarded") %>% step_log(income, starts_with("len"), offset = 1) %>% step_meanimpute(income) %>% step_normalize(all_numeric(), -has_role("discarded")) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_downsample(pets, under_ratio = 1, seed = 42) ``` --- ```r rec_imb_smote <- recipe(pets ~ ., data = ok_tr_imbalance2) %>% update_role(starts_with("isna"), new_role = "discarded") %>% step_log(income, starts_with("len"), offset = 1) %>% step_meanimpute(income) %>% step_normalize(all_numeric(), -has_role("discarded")) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_downsample(pets, under_ratio = 1.5) %>% step_smote(pets, over_ratio = 1, seed = 42) ``` --- CV splits ```r set.seed(42) cv_splits_imbalance <- vfold_cv(ok_tr_imbalance2, v = 10, strata = pets) lst_recs <- list("upsample" = rec_imb_upsample, "downsample" = rec_imb_downsample, "smote" = rec_imb_smote) ``` ```r cv_splits_imbalance <- cv_splits_imbalance %>% bind_cols( map2_dfc(lst_recs, names(lst_recs), train_recipe, splits = cv_splits_imbalance$splits) ) ``` --- ```r cv_res_imbalance <- cv_splits_imbalance %>% pivot_longer(cols = names(lst_recs), names_to = "recipe", values_to = "AUC") %>% select(id, recipe, AUC) ``` Repeated measures ANOVA: ```r aov_imbalance <- aov( AUC ~ factor(recipe) + Error(factor(id)), data = cv_res_imbalance) print(summary(aov_imbalance)) ``` ``` ## ## Error: factor(id) ## Df Sum Sq Mean Sq F value Pr(>F) ## Residuals 9 0.07173 0.00797 ## ## Error: Within ## Df Sum Sq Mean Sq F value Pr(>F) ## factor(recipe) 2 0.001422 0.0007111 2.202 0.139 ## Residuals 18 0.005812 0.0003229 ``` --- Plotting... ```r cv_res_imbalance %>% ggplot(aes(recipe, AUC, group = id, color = id)) + geom_line() + geom_line(data = cv_res_imbalance_meds, color = "black", alpha = 0.2, lwd = 2) + geom_point() + geom_point(data = cv_res_imbalance_meds, color = "black", alpha = 0.2, size = 5, pch = 17) + guides(color = FALSE) + theme_light() ``` <img src="images/Ok-Imb1-1.png" width="100%" /> --- ```r print(cv_res_imbalance %>% group_by(recipe) %>% summarise(mean_auc = mean(AUC))) ``` ``` ## # A tibble: 3 x 2 ## recipe mean_auc ## <chr> <dbl> ## 1 downsample 0.736 ## 2 smote 0.725 ## 3 upsample 0.741 ``` .insight[ 💡 So what would you do? ] --- class: section-slide # Feat. Engineering and Selection --- ### If anything, remember: 1. Don't be lazy. 2. Be creative. 3. Use resampling! Let's divide to: 1. Categorical Features 2. Continuous Features 3. Interactions --- class: section-slide # Categorical Variables --- ### Making dummy variables .font80percent[(In the case your model can accept categorical levels, skip to the next section - Dealing with categorical variables with many levels)] - One hot encoding: `\(k\)` dummy vars for `\(k\)` levels - Non one hot encoding: `\(k - 1\)` dummy vars for `\(k\)` levels - Different Contrasts schemes .font80percent[(e.g. for ordered categorical variables ("low", "medium", "high"))] - Feature Hashing --- ```r set.seed(42) df <- tibble( x1 = factor(c(rep("a", 4), rep("b", 4))), y = runif(8) ) one_hot <- recipe(y ~ x1, data = df) %>% step_dummy(all_nominal(), one_hot = TRUE) %>% prep(df) %>% juice() non_one_hot <- recipe(y ~ x1, data = df) %>% step_dummy(all_nominal(), one_hot = FALSE) %>% prep(df) %>% juice() ``` --- ```r head(one_hot) ``` ``` ## # A tibble: 6 x 3 ## y x1_a x1_b ## <dbl> <dbl> <dbl> ## 1 0.915 1 0 ## 2 0.937 1 0 ## 3 0.286 1 0 ## 4 0.830 1 0 ## 5 0.642 0 1 ## 6 0.519 0 1 ``` ```r head(non_one_hot) ``` ``` ## # A tibble: 6 x 2 ## y x1_b ## <dbl> <dbl> ## 1 0.915 0 ## 2 0.937 0 ## 3 0.286 0 ## 4 0.830 0 ## 5 0.642 1 ## 6 0.519 1 ``` --- When dealing with linear regression, you should be very careful what you input into the `lm` engine of your choice and how you interpret results! ```r summary(lm(y ~ ., data = non_one_hot)) ``` ``` ## ## Call: ## lm(formula = y ~ ., data = non_one_hot) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.45598 -0.08504 0.11103 0.17826 0.22856 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.7421 0.1434 5.177 0.00206 ** ## x1_b -0.2341 0.2027 -1.155 0.29212 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2867 on 6 degrees of freedom ## Multiple R-squared: 0.1818, Adjusted R-squared: 0.04546 ## F-statistic: 1.333 on 1 and 6 DF, p-value: 0.2921 ``` ```r print(mean(df$y) + (mean(df$y[df$x1 == "a"]) - mean(df$y))) ``` ``` ## [1] 0.7421172 ``` ```r print(mean(df$y[df$x1 == "a"])) ``` ``` ## [1] 0.7421172 ``` ```r print(mean(df$y[df$x1 == "b"]) - mean(df$y[df$x1 == "a"])) ``` ``` ## [1] -0.2340931 ``` --- This is like: ```r summary(lm(y ~ ., data = df)) ``` ``` ## ## Call: ## lm(formula = y ~ ., data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.45598 -0.08504 0.11103 0.17826 0.22856 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.7421 0.1434 5.177 0.00206 ** ## x1b -0.2341 0.2027 -1.155 0.29212 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2867 on 6 degrees of freedom ## Multiple R-squared: 0.1818, Adjusted R-squared: 0.04546 ## F-statistic: 1.333 on 1 and 6 DF, p-value: 0.2921 ``` --- But: ```r summary(lm(y ~ ., data = one_hot)) ``` ``` ## ## Call: ## lm(formula = y ~ ., data = one_hot) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.45598 -0.08504 0.11103 0.17826 0.22856 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.5080 0.1434 3.544 0.0122 * ## x1_a 0.2341 0.2027 1.155 0.2921 ## x1_b NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2867 on 6 degrees of freedom ## Multiple R-squared: 0.1818, Adjusted R-squared: 0.04546 ## F-statistic: 1.333 on 1 and 6 DF, p-value: 0.2921 ``` --- ```r summary(lm(y ~ . - 1, data = one_hot)) ``` ``` ## ## Call: ## lm(formula = y ~ . - 1, data = one_hot) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.45598 -0.08504 0.11103 0.17826 0.22856 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## x1_a 0.7421 0.1434 5.177 0.00206 ** ## x1_b 0.5080 0.1434 3.544 0.01216 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2867 on 6 degrees of freedom ## Multiple R-squared: 0.8677, Adjusted R-squared: 0.8236 ## F-statistic: 19.68 on 2 and 6 DF, p-value: 0.002314 ``` --- This is like: ```r summary(lm(y ~ . - 1, data = df)) ``` ``` ## ## Call: ## lm(formula = y ~ . - 1, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.45598 -0.08504 0.11103 0.17826 0.22856 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## x1a 0.7421 0.1434 5.177 0.00206 ** ## x1b 0.5080 0.1434 3.544 0.01216 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2867 on 6 degrees of freedom ## Multiple R-squared: 0.8677, Adjusted R-squared: 0.8236 ## F-statistic: 19.68 on 2 and 6 DF, p-value: 0.002314 ``` --- Plot thickens (check your `options("contrasts")`!): ```r summary(lm(y ~ ., data = df, contrasts = list(x1 = "contr.sum"))) ``` ``` ## ## Call: ## lm(formula = y ~ ., data = df, contrasts = list(x1 = "contr.sum")) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.45598 -0.08504 0.11103 0.17826 0.22856 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.6251 0.1014 6.167 0.000835 *** ## x11 0.1170 0.1014 1.155 0.292118 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2867 on 6 degrees of freedom ## Multiple R-squared: 0.1818, Adjusted R-squared: 0.04546 ## F-statistic: 1.333 on 1 and 6 DF, p-value: 0.2921 ``` --- Two variables... ```r set.seed(42) df <- tibble( x1 = factor(c(rep("a", 6), rep("b", 6))), x2 = factor(rep(c(rep("c", 2), rep("d", 2), rep("e", 2)), 2)), y = runif(12) ) one_hot <- recipe(y ~ x1 + x2, data = df) %>% step_dummy(all_nominal(), one_hot = TRUE) %>% prep(df) %>% juice() non_one_hot <- recipe(y ~ x1 + x2, data = df) %>% step_dummy(all_nominal(), one_hot = FALSE) %>% prep(df) %>% juice() ``` --- ```r head(one_hot) ``` ``` ## # A tibble: 6 x 6 ## y x1_a x1_b x2_c x2_d x2_e ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0.915 1 0 1 0 0 ## 2 0.937 1 0 1 0 0 ## 3 0.286 1 0 0 1 0 ## 4 0.830 1 0 0 1 0 ## 5 0.642 1 0 0 0 1 ## 6 0.519 1 0 0 0 1 ``` ```r head(non_one_hot) ``` ``` ## # A tibble: 6 x 4 ## y x1_b x2_d x2_e ## <dbl> <dbl> <dbl> <dbl> ## 1 0.915 0 0 0 ## 2 0.937 0 0 0 ## 3 0.286 0 1 0 ## 4 0.830 0 1 0 ## 5 0.642 0 0 1 ## 6 0.519 0 0 1 ``` --- ```r summary(lm(y ~ ., data = non_one_hot)) ``` ``` ## ## Call: ## lm(formula = y ~ ., data = non_one_hot) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.48619 -0.08138 0.10650 0.15667 0.19636 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.74071 0.15610 4.745 0.00145 ** ## x1_b -0.11986 0.15610 -0.768 0.46465 ## x2_d -0.06112 0.19118 -0.320 0.75738 ## x2_e -0.09636 0.19118 -0.504 0.62783 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2704 on 8 degrees of freedom ## Multiple R-squared: 0.09602, Adjusted R-squared: -0.243 ## F-statistic: 0.2832 on 3 and 8 DF, p-value: 0.8362 ``` --- Same as: ```r summary(lm(y ~ ., data = df)) ``` ``` ## ## Call: ## lm(formula = y ~ ., data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.48619 -0.08138 0.10650 0.15667 0.19636 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.74071 0.15610 4.745 0.00145 ** ## x1b -0.11986 0.15610 -0.768 0.46465 ## x2d -0.06112 0.19118 -0.320 0.75738 ## x2e -0.09636 0.19118 -0.504 0.62783 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2704 on 8 degrees of freedom ## Multiple R-squared: 0.09602, Adjusted R-squared: -0.243 ## F-statistic: 0.2832 on 3 and 8 DF, p-value: 0.8362 ``` --- ```r print(mean(df$y) + (mean(df$y[df$x1 == "a"]) - mean(df$y)) + (mean(df$y[df$x2 == "c"]) - mean(df$y))) ``` ``` ## [1] 0.7407128 ``` ```r print(mean(df$y[df$x1 == "a"]) + mean(df$y[df$x2 == "c"]) - mean(df$y)) ``` ``` ## [1] 0.7407128 ``` ```r print(mean(df$y[df$x1 == "b"]) - mean(df$y[df$x1 == "a"])) ``` ``` ## [1] -0.1198573 ``` ```r print(mean(df$y[df$x2 == "d"]) - mean(df$y[df$x2 == "c"])) ``` ``` ## [1] -0.06112303 ``` ```r print(mean(df$y[df$x2 == "e"]) - mean(df$y[df$x2 == "c"])) ``` ``` ## [1] -0.09636022 ``` --- ```r summary(lm(y ~ ., data = one_hot)) ``` ``` ## ## Call: ## lm(formula = y ~ ., data = one_hot) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.48619 -0.08138 0.10650 0.15667 0.19636 ## ## Coefficients: (2 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.52450 0.15610 3.360 0.00993 ** ## x1_a 0.11986 0.15610 0.768 0.46465 ## x1_b NA NA NA NA ## x2_c 0.09636 0.19118 0.504 0.62783 ## x2_d 0.03524 0.19118 0.184 0.85836 ## x2_e NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2704 on 8 degrees of freedom ## Multiple R-squared: 0.09602, Adjusted R-squared: -0.243 ## F-statistic: 0.2832 on 3 and 8 DF, p-value: 0.8362 ``` --- ```r print(mean(df$y[df$x1 == "b"]) + mean(df$y[df$x2 == "e"]) - mean(df$y)) ``` ``` ## [1] 0.5244952 ``` ```r print(mean(df$y[df$x1 == "a"]) - mean(df$y[df$x1 == "b"])) ``` ``` ## [1] 0.1198573 ``` ```r print(mean(df$y[df$x2 == "c"]) - mean(df$y[df$x2 == "e"])) ``` ``` ## [1] 0.09636022 ``` ```r print(mean(df$y[df$x2 == "d"]) - mean(df$y[df$x2 == "e"])) ``` ``` ## [1] 0.03523718 ``` --- ```r summary(lm(y ~ . - 1, data = one_hot)) ``` ``` ## ## Call: ## lm(formula = y ~ . - 1, data = one_hot) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.48619 -0.08138 0.10650 0.15667 0.19636 ## ## Coefficients: (1 not defined because of singularities) ## Estimate Std. Error t value Pr(>|t|) ## x1_a 0.64435 0.15610 4.128 0.00331 ** ## x1_b 0.52450 0.15610 3.360 0.00993 ** ## x2_c 0.09636 0.19118 0.504 0.62783 ## x2_d 0.03524 0.19118 0.184 0.85836 ## x2_e NA NA NA NA ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.2704 on 8 degrees of freedom ## Multiple R-squared: 0.8914, Adjusted R-squared: 0.8371 ## F-statistic: 16.41 on 4 and 8 DF, p-value: 0.0006356 ``` --- ```r print(mean(df$y[df$x1 == "a"]) + mean(df$y[df$x2 == "e"]) - mean(df$y)) ``` ``` ## [1] 0.6443525 ``` ```r print(mean(df$y[df$x1 == "b"]) + mean(df$y[df$x2 == "e"]) - mean(df$y)) ``` ``` ## [1] 0.5244952 ``` ```r print(mean(df$y[df$x2 == "c"]) - mean(df$y[df$x2 == "e"])) ``` ``` ## [1] 0.09636022 ``` ```r print(mean(df$y[df$x2 == "d"]) - mean(df$y[df$x2 == "e"])) ``` ``` ## [1] 0.03523718 ``` --- ### Categorical variables with many levels ```r ok_tr_eng %>% count(location) ``` ``` ## # A tibble: 71 x 2 ## location n ## <chr> <int> ## 1 alameda, california 53 ## 2 albany, california 9 ## 3 ashland, california 1 ## 4 atherton, california 8 ## 5 belmont, california 12 ## 6 belvedere tiburon, california 7 ## 7 benicia, california 13 ## 8 berkeley, california 186 ## 9 brisbane, california 4 ## 10 burlingame, california 23 ## # ... with 61 more rows ``` --- #### Many dangers lurk - OVERFITTING! - Unseen levels - Zero variance variables - Size of `\(p\)` #### We'll talk about - The "Other" and "New" Strategy - Clustering - Feature Hashing - Supervised Numerical Encoding - Dimensionality Reduction (Embeddings) - Text Features --- #### The "Other" and "New" Strategy - "Other": in many scenarios in which a categorical feature has many levels, some of these are so scarce it is doubtful if they have any predictive power - consider grouping them into one "Other" level - "New": therefore there might be many levels not seen during training, only during testing - consider adding a (zero variance) "New" level --- What % threshold should you use for "Other"? - `step_other()` default is 5% (i.e. any level with less than 5% occurrence is grouped into one super-level "Other") - or tune it! ```r recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>% step_other(all_nominal(), threshold = 0.05) %>% step_novel(all_nominal()) %>% prep(ok_tr_eng) %>% juice() %>% count(location) ``` ``` ## # A tibble: 4 x 2 ## location n ## <fct> <int> ## 1 berkeley, california 186 ## 2 oakland, california 312 ## 3 san francisco, california 1610 ## 4 other 892 ``` --- You can also insert some creativeness (or plain common sense) in your "Other" strategy: ```r education2 = case_when( education == "graduated from high school" ~ "high_school", education == "graduated from two-year college" ~ "college", education == "graduated from college/university" ~ "degree1", education == "graduated from masters program" ~ "degree2", education == "graduated from ph.d program" ~ "degree3", education == "working on two-year college" ~ "student0", education == "working on college/university" ~ "student1", education == "working on masters program" ~ "student2", education == "working on ph.d program" ~ "student3", is.na(education) ~ NA_character_, TRUE ~ "other" ) ``` --- #### Clustering - Manual clustering, using expert knowledge and common sense - Unsupervised clustering, where it makes sense (e.g. Topic Modeling with [LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) on text data) .insight[ 💡 How would you cluster `location`? ] The `sign` feature is also a great example: ```r ok_tr_eng %>% count(sign) ``` ``` ## # A tibble: 47 x 2 ## sign n ## <chr> <int> ## 1 aquarius 61 ## 2 aquarius and it’s fun to think about 77 ## 3 aquarius but it doesn’t matter 83 ## 4 aries 73 ## 5 aries and it’s fun to think about 68 ## 6 aries but it doesn’t matter 69 ## 7 cancer 59 ## 8 cancer and it matters a lot 4 ## 9 cancer and it’s fun to think about 73 ## 10 cancer but it doesn’t matter 67 ## # ... with 37 more rows ``` --- ```r signs <- c("leo", "libra", "cancer", "virgo", "scorpio", "gemini", "taurus", "aries", "pisces", "aquarius", "sagittarius", "capricorn") recipe(pets ~ ., data = ok_tr_eng %>% select(pets, sign)) %>% step_mutate(sign = str_extract(sign, str_c(signs, collapse = "|"))) %>% step_unknown(sign) %>% prep(ok_tr_eng) %>% juice() %>% count(sign) ``` ``` ## # A tibble: 13 x 2 ## sign n ## <fct> <int> ## 1 capricorn 192 ## 2 libra 218 ## 3 taurus 247 ## 4 leo 226 ## 5 aries 210 ## 6 aquarius 221 ## 7 pisces 223 ## 8 sagittarius 176 ## 9 scorpio 214 ## 10 cancer 203 ## 11 gemini 218 ## 12 virgo 214 ## 13 unknown 438 ``` --- #### Feature Hashing Suppose I told you "God, just group the locations by the number of times the letter 'l' appears in them!" ```r recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>% step_mutate(location = factor(str_count(location, "l"))) %>% prep(ok_tr_eng) %>% juice() %>% count(location) ``` ``` ## # A tibble: 6 x 2 ## location n ## <fct> <int> ## 1 0 2 ## 2 1 1934 ## 3 2 871 ## 4 3 155 ## 5 4 23 ## 6 5 15 ``` --- ```r recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>% step_mutate(locationL = factor(str_count(location, "l"))) %>% step_dummy(locationL) %>% prep(ok_tr_eng) %>% juice() %>% select(starts_with("location")) ``` ``` ## # A tibble: 3,000 x 6 ## location locationL_X1 locationL_X2 locationL_X3 locationL_X4 ## <fct> <dbl> <dbl> <dbl> <dbl> ## 1 mountai~ 1 0 0 0 ## 2 oakland~ 0 1 0 0 ## 3 san fra~ 1 0 0 0 ## 4 san pab~ 0 1 0 0 ## 5 oakland~ 0 1 0 0 ## 6 san fra~ 1 0 0 0 ## 7 san raf~ 0 1 0 0 ## 8 san fra~ 1 0 0 0 ## 9 san fra~ 1 0 0 0 ## 10 alameda~ 0 1 0 0 ## # ... with 2,990 more rows, and 1 more variable: locationL_X5 <dbl> ``` --- It's not a good idea because - Differenet `location`s map to the same bin ("collision"), masking possible effect - The distribution isn't even But there are advantages too! - I *did* reduce the dimensionality from 70 to 6 dummy variables - Fast - Generalizes well: any new `location` can be mapped this way - Scales: I can do this for millions of `location`s, no dictionary needed to store in memory --- - Feature Hashing (or "hashing trick") does something similar more intelligently - It lets you choose the hashing algo and no. of features you'd like (the more - the less chance of collision) ```r library(textrecipes) recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>% step_tokenize(location) %>% step_texthash(location, signed = FALSE, num_terms = 2^4) %>% prep(ok_tr_eng) %>% juice() ``` ``` ## # A tibble: 3,000 x 17 ## location_hash01 location_hash02 location_hash03 location_hash04 ## <dbl> <dbl> <dbl> <dbl> ## 1 0 0 1 0 ## 2 0 1 1 0 ## 3 0 0 1 0 ## 4 0 0 1 1 ## 5 0 1 1 0 ## 6 0 0 1 0 ## 7 0 0 1 1 ## 8 0 0 1 0 ## 9 0 0 1 0 ## 10 0 0 1 0 ## # ... with 2,990 more rows, and 13 more variables: location_hash05 <dbl>, ## # location_hash06 <dbl>, location_hash07 <dbl>, location_hash08 <dbl>, ## # location_hash09 <dbl>, location_hash10 <dbl>, location_hash11 <dbl>, ## # location_hash12 <dbl>, location_hash13 <dbl>, location_hash14 <dbl>, ## # location_hash15 <dbl>, location_hash16 <dbl>, pets <fct> ``` --- - Particularly useful for fast and furious feature engineering of text (see later) - Beware of columns with zero variance, consider binning them to a "New" category - Powerful methods like `xgboost` can handle 2^20+ no. of terms easily .insight[ 💡 Why would this work in spite of collisions? What is still a major disadvantage? Hint: 🎁. ] --- #### Supervised Numerical Encoding Here's another crazy idea for you: - For regression, replace a category with `\(y\)`'s mean in this category - For classification, replace a category with the log odds of `\(y\)` in this category --- Thus, changing a categorical variable with many levels into a single numerical variable (or a few for a few classes). ```r library(embed) recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>% step_mutate(location2 = location) %>% step_lencode_glm(location, outcome = vars(pets)) %>% prep(ok_tr_eng) %>% juice() ``` ``` ## # A tibble: 3,000 x 3 ## location pets location2 ## <dbl> <fct> <fct> ## 1 -0.539 dogs mountain view, california ## 2 -1.20 dogs oakland, california ## 3 -1.84 dogs san francisco, california ## 4 -2.40 dogs san pablo, california ## 5 -1.20 dogs oakland, california ## 6 -1.84 dogs san francisco, california ## 7 -0.571 cats san rafael, california ## 8 -1.84 dogs san francisco, california ## 9 -1.84 dogs san francisco, california ## 10 -0.930 cats alameda, california ## # ... with 2,990 more rows ``` --- Make sure you get what just happened ```r mv <- ok_tr_eng$location == "mountain view, california" p <- mean(ok_tr_eng$pets[mv] == "cats") log(p / (1 - p)) ``` ``` ## [1] -0.5389965 ``` --- But... - What if a specific location has just 1-2 samples in training set? - "self-fulfilling prophecy" The first issue can be somewhat alleviated using a Bayesian approach, *shrinking* somewhat coefficients towards their prior when n is small: ```r recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>% step_mutate(location2 = location) %>% step_lencode_bayes(location, outcome = vars(pets)) %>% prep(ok_tr_eng %>% head(100)) %>% juice() ``` ``` ## # A tibble: 100 x 3 ## location pets location2 ## <dbl> <fct> <fct> ## 1 -1.50 dogs mountain view, california ## 2 -1.72 dogs oakland, california ## 3 -1.99 dogs san francisco, california ## 4 -1.78 dogs san pablo, california ## 5 -1.72 dogs oakland, california ## 6 -1.99 dogs san francisco, california ## 7 -1.35 cats san rafael, california ## 8 -1.99 dogs san francisco, california ## 9 -1.99 dogs san francisco, california ## 10 -1.36 cats alameda, california ## # ... with 90 more rows ``` --- #### Dimensionality Reduction (Embeddings) See our `word2vec` detour when we learned about [Networks Community Detection](https://dsapps-2020.github.io/Class_Slides/u2_d04-community/u2_d04-community.html). If you have `tensorflow` and `keras` installed, you could train your own embeddings to turn each `location` into a `num_terms` long vector of features: ```r recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>% step_mutate(location2 = location) %>% step_embed(location, outcome = vars(pets), num_terms = 10, options = embed_control(epochs = 10)) %>% prep(ok_tr_eng) %>% juice() ``` --- Alternatively you could use pre-trained embeddings for text variables such as `essay`s using `step_word_embedding()` from the `textrecipes` package (where `my_embeddings` are your pre-trained embeddings: ```r recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>% step_tokenize(location) %>% step_word_embeddings(location, embeddings = my_embeddings) %>% prep(ok_tr_eng) %>% juice() ``` --- Or, perform PCA on the dummay variables matrix, keeping only a few of the firt PCs as features: ```r recipe(pets ~ ., data = ok_tr_eng %>% select(pets, location)) %>% step_mutate(old_location = location) %>% step_dummy(location) %>% step_normalize(all_numeric()) %>% step_pca(starts_with("location"), num_comp = 10) %>% prep(ok_tr_eng) %>% juice() ``` ``` ## # A tibble: 3,000 x 12 ## pets old_location PC01 PC02 PC03 PC04 PC05 PC06 ## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 dogs mountain vi~ -1.27 -0.634 -0.985 0.433 -9.37e-2 0.538 ## 2 dogs oakland, ca~ -1.58 2.66 0.711 -0.0657 9.48e-3 -0.0482 ## 3 dogs san francis~ 1.15 0.0528 0.0410 -0.00550 8.27e-4 -0.00425 ## 4 dogs san pablo, ~ -1.26 -0.617 -0.937 0.372 -7.59e-2 0.427 ## 5 dogs oakland, ca~ -1.58 2.66 0.711 -0.0657 9.48e-3 -0.0482 ## 6 dogs san francis~ 1.15 0.0528 0.0410 -0.00550 8.27e-4 -0.00425 ## 7 cats san rafael,~ -1.28 -0.678 -1.13 0.720 -2.19e-1 1.50 ## 8 dogs san francis~ 1.15 0.0528 0.0410 -0.00550 8.27e-4 -0.00425 ## 9 dogs san francis~ 1.15 0.0528 0.0410 -0.00550 8.27e-4 -0.00425 ## 10 cats alameda, ca~ -0.447 -0.0513 -0.0430 0.00613 -9.29e-4 0.00478 ## # ... with 2,990 more rows, and 4 more variables: PC07 <dbl>, PC08 <dbl>, ## # PC09 <dbl>, PC10 <dbl> ``` --- #### Text Features Extracting textual features could be the subject of an entire class or even a course. Just to demonstrate how worthwhile being creative can get here let us just add two textual features here, `n_cat` and `n_dog` in all `essay`s: ```r ok_tr_eng2 <- ok_tr_eng %>% mutate_at(vars(essay0:essay9), list("len" = ~ifelse(is.na(.x), 0, str_length(.x)))) %>% rename_at(vars(contains("_len")), list(~paste("len", gsub("_len", "", .), sep = "_") ) ) ``` --- ```r rec_eng_notext <- recipe(pets ~ ., data = ok_tr_eng2) %>% step_log(income, starts_with("len"), offset = 1) %>% step_meanimpute(income) %>% step_normalize(all_numeric(), -has_role("discarded")) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_upsample(pets, over_ratio = 1, seed = 42) str_count_na <- function(string, pattern = "") { n <- str_count(string, pattern) ifelse(is.na(n), 0, n) } ``` --- ```r rec_eng_catdog <- recipe(pets ~ ., data = ok_tr_eng2) %>% step_mutate(essays = pmap_chr( list(essay0, essay1, essay2, essay3, essay4, essay5, essay6, essay7, essay8, essay9), str_c), n_cat = str_count_na(essays, "cat"), n_dog = str_count_na(essays, "dog")) %>% update_role(starts_with("essay"), new_role = "discarded") %>% step_log(income, starts_with("len"), offset = 1) %>% step_meanimpute(income) %>% step_normalize(all_numeric(), -has_role("discarded")) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_upsample(pets, over_ratio = 1, seed = 42) ``` --- <img src="images/Ok-FES1-1.png" width="100%" /> --- #### Text Features: some ideas for you - not just words: do cat people tend to use more the letter "z"? no. of commas, no. of dot-dot-dots, no. of links, no. of hashtags, no. of extra spaces, no. of digits - to stem or not to stem? - "plays", "playing", ["playa"](https://www.urbandictionary.com/define.php?term=Playa) - sentiment features - not just no. of "hate" vs. "love" - also no. of "you" and "I" - stop words or not to stop words? - TF-IDF or not to TF-IDF - Feature Hashing - n-grams: "I love my cats" - topics --- #### Using `textrecipes` + `textfeatures` - The `textrecipes` package lets you calculate a few features with a single `step_textfeature()` - The `extract_functions` parameter accepts a named list of functions you wish to define on your text data - By default it will accept `count_functions` list from the [`textfeatures`](https://textfeatures.mikewk.com/reference/textfeatures.html) by Mike Kearney, which gets you basic counts like `n_words`, `n_digits` etc. --- But `textfeatures` also has via the `textfeatures()` function: - basic sentiment features like `sent_bing` - first word2vec 2-200 dimensions! ```r print(textfeatures:::sentiment_bing("I hate cats!")) ``` ``` ## [1] -1 ``` ```r print(textfeatures:::sentiment_bing("I love cats! Hate dogs though.")) ``` ``` ## [1] 0 ``` ```r print(textfeatures::word_dims( "Why would Kim Jong-un insult me by calling me \"old,\"")) ``` ``` ## INFO [18:26:47.235] early stopping at 20 iteration ## INFO [18:26:47.294] early stopping at 20 iteration ## w1 w2 w3 w4 w5 w6 w7 w8 w9 w10 ## 1 0 0 0.05 0.05 0.1 0 0.05 0.7 0 0.05 ``` --- ```r ok_tr_eng2 <- ok_tr_eng2 %>% mutate(essays = pmap_chr( list(essay0, essay1, essay2, essay3, essay4, essay5, essay6, essay7, essay8, essay9), str_c)) my_text_funs <- c(textfeatures::count_functions, "sent_bing" = textfeatures:::sentiment_bing) ``` --- ```r recipe(pets ~ ., data = ok_tr_eng2 %>% select(pets, essays)) %>% update_role(essays, new_role = "discarded") %>% step_textfeature(essays, prefix = "t", extract_functions = my_text_funs) %>% step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>% prep(ok_tr_eng2 %>% head(10)) %>% juice() %>% select(t_essays_n_uq_words, t_essays_n_hashtags, t_essays_sent_bing) ``` ``` ## # A tibble: 10 x 3 ## t_essays_n_uq_words t_essays_n_hashtags t_essays_sent_bing ## <dbl> <dbl> <dbl> ## 1 114 0 18 ## 2 0 0 0 ## 3 116 0 11 ## 4 250 0 11 ## 5 356 0 8 ## 6 405 0 40 ## 7 0 0 0 ## 8 622 0 36 ## 9 0 0 0 ## 10 465 0 23 ``` --- #### Text Features: but the words are most important We'll take [Kuhn & Johnson (2019)](http://www.feat.engineering/index.html) approach, and: 1. Filter 1-gram tokens (words) with at least X occurrences 2. Calculate the odds ratio `\(\frac{P(w|cats)}{1- P(w|cats)} / \frac{P(w|dogs)}{1- P(w|dogs)}\)` where `\(P(w|cats)\)` is the % of cats people who used the word `\(w\)` at least once in all essays 3. Calculate a p-value with a FDR correction 4. Choose words with ~~OR of at least |2| and~~ FDR value of maximum `\(10^{-4}\)` Then use resampling to see if these have any effect on AUC. --- ```r library(tidytext) tidy_essays <- ok_tr_eng2 %>% mutate(id = 1:n()) %>% select(id, pets, essays) %>% unnest_tokens(word, essays) %>% filter(!word %in% stop_words$word, str_detect(word, "[a-z]")) tidy_essays %>% head() ``` ``` ## # A tibble: 6 x 3 ## id pets word ## <int> <chr> <chr> ## 1 1 dogs positive ## 2 1 dogs person ## 3 1 dogs loves ## 4 1 dogs laugh ## 5 1 dogs enjoy ## 6 1 dogs life ``` --- ```r min_n_words <- tidy_essays %>% count(id, word) %>% count(word) %>% filter(n > 50) %>% pull(word) print(c("cat", "dog", "cats", "dogs") %in% min_n_words) ``` ``` ## [1] TRUE TRUE TRUE TRUE ``` ```r n_cats_people <- sum(ok_tr_eng2$pets == "cats") n_dogs_people <- sum(ok_tr_eng2$pets == "dogs") fisher_test <- function(cats, dogs) { cont_tab <- matrix(c(cats, n_cats_people - cats, dogs, n_dogs_people - dogs), ncol = 2, nrow = 2) ft <- fisher.test(cont_tab) list(OR = ft$estimate, p_val = ft$p.value) } ``` --- ```r or_res <- tidy_essays %>% filter(word %in% min_n_words) %>% count(id, word, pets) %>% count(word, pets) %>% pivot_wider(id_cols = word, names_from = pets, values_from = n, values_fill = list(cats = 0, dogs = 0)) %>% mutate(fisher_obj = map2(cats, dogs, fisher_test)) or_res <- or_res %>% bind_cols( bind_rows(or_res$fisher_obj) ) %>% mutate(fdr = p.adjust(p_val, method = "fdr")) or_res ``` ``` ## # A tibble: 808 x 7 ## word cats dogs fisher_obj OR p_val fdr ## <chr> <int> <int> <list> <dbl> <dbl> <dbl> ## 1 ability 17 46 <named list [2]> 1.91 0.0369 0.112 ## 2 absolutely 18 62 <named list [2]> 1.49 0.167 0.317 ## 3 accent 14 38 <named list [2]> 1.90 0.0561 0.148 ## 4 action 23 96 <named list [2]> 1.23 0.378 0.559 ## 5 active 21 134 <named list [2]> 0.788 0.373 0.557 ## 6 activities 16 91 <named list [2]> 0.891 0.791 0.892 ## 7 add 17 46 <named list [2]> 1.91 0.0369 0.112 ## 8 admit 17 92 <named list [2]> 0.937 0.896 0.950 ## 9 adult 13 38 <named list [2]> 1.76 0.0861 0.195 ## 10 adventure 25 145 <named list [2]> 0.870 0.595 0.749 ## # ... with 798 more rows ``` --- ```r cd_words <- or_res %>% arrange(fdr) %>% filter(fdr < 10e-4) cd_words ``` ``` ## # A tibble: 37 x 7 ## word cats dogs fisher_obj OR p_val fdr ## <chr> <int> <int> <list> <dbl> <dbl> <dbl> ## 1 cat 55 37 <named list [2]> 8.37 1.06e-21 8.53e-19 ## 2 cats 43 21 <named list [2]> 11.3 5.37e-20 2.17e-17 ## 3 class 73 151 <named list [2]> 2.71 4.72e-10 9.54e- 8 ## 4 href 59 106 <named list [2]> 3.08 4.40e-10 9.54e- 8 ## 5 ilink 57 101 <named list [2]> 3.11 6.79e-10 1.10e- 7 ## 6 dog 12 247 <named list [2]> 0.228 2.19e- 9 2.95e- 7 ## 7 writing 62 127 <named list [2]> 2.69 1.08e- 8 1.25e- 6 ## 8 strange 24 28 <named list [2]> 4.53 4.03e- 7 4.07e- 5 ## 9 dr 24 29 <named list [2]> 4.37 6.26e- 7 5.62e- 5 ## 10 dogs 5 138 <named list [2]> 0.176 1.31e- 6 1.03e- 4 ## # ... with 27 more rows ``` --- So these will be our four options for handling text: 1. Not 2. Text features (zero for `NA`) 3. Text features + our flagged words counts (zero for `NA`) 4. Feature Hashing --- ```r count_words <- function(essays) { counts <- map(cd_words$word, ~str_count_na(essays, .x)) names(counts) = janitor::make_clean_names(str_c("w_", cd_words$word)) counts } ok_tr_eng3 <- ok_tr_eng2 %>% bind_cols( map_dfr(ok_tr_eng2$essays, count_words) ) %>% select(-starts_with("essay"), -starts_with("len_essay")) %>% bind_cols(ok_tr_eng2 %>% select(essays)) %>% mutate_if(is.character, as.factor) ``` --- ```r rec_eng_notext <- recipe(pets ~ ., data = ok_tr_eng3) %>% update_role(starts_with("w_"), essays, new_role = "discarded") %>% step_log(income, offset = 1) %>% step_meanimpute(income) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_upsample(pets, over_ratio = 1, seed = 42) rec_eng_textfeat <- recipe(pets ~ ., data = ok_tr_eng3) %>% update_role(starts_with("w_"), essays, new_role = "discarded") %>% step_textfeature(essays, prefix = "t", extract_functions = my_text_funs) %>% step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>% step_log(income, starts_with("t_"), -t_essays_sent_bing, offset = 1) %>% step_meanimpute(income) %>% step_zv(all_numeric()) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_upsample(pets, over_ratio = 1, seed = 42) ``` --- ```r rec_eng_fulltext <- recipe(pets ~ ., data = ok_tr_eng3) %>% update_role(essays, new_role = "discarded") %>% step_textfeature(essays, prefix = "t", extract_functions = my_text_funs) %>% step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>% step_log(income, starts_with("t_"), -t_essays_sent_bing, offset = 1) %>% step_meanimpute(income) %>% step_zv(all_numeric()) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_upsample(pets, over_ratio = 1, seed = 42) rec_eng_hashtext <- recipe(pets ~ ., data = ok_tr_eng3) %>% update_role(starts_with("w_"), essays, new_role = "discarded") %>% step_tokenize(essays) %>% step_tokenfilter(essays, min_times = 10) %>% step_texthash(essays, num_terms = 2^11) %>% step_log(income, offset = 1) %>% step_meanimpute(income) %>% step_zv(all_numeric()) %>% step_other(all_nominal(), -has_role("discarded"), other = "other1") %>% step_novel(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_upsample(pets, over_ratio = 1, seed = 42) ``` --- ```r set.seed(42) cv_splits_eng <- vfold_cv(ok_tr_eng3, v = 10, strata = pets) lst_recs <- list("notext" = rec_eng_notext, "textfeat" = rec_eng_textfeat, "fulltext" = rec_eng_fulltext, "hashtext" = rec_eng_hashtext) ``` ```r cv_res_eng <- cv_splits_eng %>% bind_cols( map2_dfc(lst_recs, names(lst_recs), train_recipe, splits = cv_splits_eng$splits) ) %>% pivot_longer(cols = names(lst_recs), names_to = "recipe", values_to = "AUC") %>% select(id, recipe, AUC) cv_res_eng_meds <- cv_res_eng %>% group_by(recipe) %>% summarise(AUC = mean(AUC), id = "All") ``` --- #### Haven't forgot about our other categorical variables Let's use some common sense (see code!). ```r cd_words <- cd_words %>% bind_rows( tibble(word = c("my cat", "my dog", "love cat", "love dog", "kitt", "pupp", "have cat", "have dog", "have a cat", "have a dog"), cats = NULL, dogs = NULL, fisher_obj = NULL, OR = 0, p_val = 0, fdr = 0)) ``` ```r ok_tr_eng4 <- full_features(ok_tr_eng) colnames(ok_tr_eng4) ``` ``` ## [1] "age" "body_type" "diet" ## [4] "drinks" "drugs" "education" ## [7] "ethnicity" "height" "income" ## [10] "job" "location" "orientation" ## [13] "pets" "religion" "sex" ## [16] "smokes" "status" "len_essay0" ## [19] "len_essay1" "len_essay2" "len_essay3" ## [22] "len_essay4" "len_essay5" "len_essay6" ## [25] "len_essay7" "len_essay8" "len_essay9" ## [28] "essays" "offspring_no_kids" "offspring_one_kid" ## [31] "offspring_many_kids" "offspring_wants_kids" "religion_not2serious" ## [34] "religion_laughing" "religion_somewhat" "body_type_perfect" ## [37] "sign_fun" "sign_not_matter" "sign_matters" ## [40] "sign_na" "speaks_spanish" "speaks_french" ## [43] "speaks_german" "speaks_chinese" "speaks_n_fluently" ## [46] "speaks_n_okay" "speaks_n_poorly" "education_kind" ## [49] "w_cat" "w_cats" "w_class" ## [52] "w_href" "w_ilink" "w_dog" ## [55] "w_writing" "w_strange" "w_dr" ## [58] "w_dogs" "w_feel" "w_shy" ## [61] "w_books" "w_br" "w_fantasy" ## [64] "w_games" "w_video" "w_means" ## [67] "w_pretty" "w_geek" "w_computer" ## [70] "w_brain" "w_stuff" "w_constantly" ## [73] "w_science" "w_nerd" "w_write" ## [76] "w_walk" "w_punk" "w_history" ## [79] "w_art" "w_possibly" "w_sci" ## [82] "w_series" "w_form" "w_fi" ## [85] "w_sort" "w_my_cat" "w_my_dog" ## [88] "w_love_cat" "w_love_dog" "w_kitt" ## [91] "w_pupp" "w_have_cat" "w_have_dog" ## [94] "w_have_a_cat" "w_have_a_dog" ``` --- ```r rec_eng_commonsense <- recipe(pets ~ ., data = ok_tr_eng4) %>% step_textfeature(essays, prefix = "t", extract_functions = my_text_funs) %>% update_role(essays, new_role = "discarded") %>% step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>% step_log(income, starts_with("len_"), starts_with("t_"), -t_essays_sent_bing, offset = 1) %>% step_meanimpute(income) %>% step_other( all_nominal(), -has_role("discarded"), -all_outcomes(), other = "all_else", threshold = 0.1) %>% step_novel( all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_upsample(pets, over_ratio = 1, seed = 42) ``` --- ```r set.seed(42) cv_splits_eng2 <- vfold_cv(ok_tr_eng4, v = 10, strata = pets) lst_recs <- list("commonsense" = rec_eng_commonsense) ``` ```r cv_res_commonsense <- cv_splits_eng2 %>% bind_cols( map2_dfc(lst_recs, names(lst_recs), train_recipe, splits = cv_splits_eng2$splits) ) %>% pivot_longer(cols = names(lst_recs), names_to = "recipe", values_to = "AUC") %>% select(id, recipe, AUC) cv_res_cs_meds <- cv_res_commonsense %>% group_by(recipe) %>% summarise(AUC = mean(AUC), id = "All") ``` --- ```r cv_res_eng %>% bind_rows(cv_res_commonsense) %>% mutate(recipe = fct_relevel(recipe, c("notext", "hashtext", "textfeat", "fulltext", "commonsense"))) %>% ggplot(aes(recipe, AUC, group = id, color = id)) + geom_line() + geom_line(data = cv_res_eng_meds %>% bind_rows(cv_res_cs_meds), color = "black", alpha = 0.2, lwd = 2) + geom_point() + geom_point(data = cv_res_eng_meds %>% bind_rows(cv_res_cs_meds), color = "black", alpha = 0.2, size = 5, pch = 17) + guides(color = FALSE) + theme_light() ``` <img src="images/Ok-FES3-1.png" width="100%" /> --- class: section-slide # Numerical Variables --- ### What could the problem be? The problem almost always depends on the combination between models and variables: - Varied scale (and kNN, SVM) --> `step_center()`, `step_scale()`, `step_normalize()`, `step_range()` - Bounded range of outcome (e.g. percentage and, well, almost any model) --> `step_logit()` - Skewed data and outliers (and linear regression, NN, kNN, SVM, but less for Trees) - Non-linearity (and LR) - Co-linearity (and LR, NN) --- ### Skewed Data: Common Transformations #### Box-Cox In its simplified form: `$$x_i^{(\lambda)} = \begin{cases} \frac{x_i^{\lambda} - 1}{\lambda} & \mbox{if }\lambda \ne 0\\ \ln{x_i} & \mbox{if }\lambda = 0 \end{cases}$$` --- Originally the Box-Cox transformation was developed to transform a dependent `\(y\)` variable by maximizing the log-likelihood of seeing `\(y^{(\lambda)}\)` in a `lm` against some `\(X\)`. But `\(\lambda\)` could be estimated in an unsupervised way as well: ```r MASS::boxcox(ok_tr_eng$age ~ 1) ``` <img src="images/Age-BoxCox-1.png" width="50%" /> --- ```r tibble(age = ok_tr_eng$age, age_bc = -1/age) %>% pivot_longer(starts_with("age"), names_to = "age") %>% ggplot(aes(value)) + geom_histogram(bins = 30, fill = "red", alpha = 0.5) + facet_wrap(~ age, scales = "free") + theme_light() ``` <img src="images/Age-BoxCox2-1.png" width="100%" /> --- #### Yeo-Johnson Revised Box-Cox to allow for zero and negative values: `$$x_i^{(\lambda)} = \begin{cases} \frac{(x_i + 1)^{\lambda} - 1}{\lambda} & \mbox{if }\lambda \ne 0,y \ge 0\\ \ln{(x_i + 1)} & \mbox{if }\lambda = 0, y \ge 0 \\ \frac{-[(-x_i + 1)^{2-\lambda} - 1]}{2 - \lambda} & \mbox{if }\lambda \ne 0,y < 0\\ -\ln{(-x_i + 1)} & \mbox{if }\lambda = 0, y < 0 \\ \end{cases}$$` Both transformations have `step`s in `recipes`: ```r recipe(pets ~ ., data = ok_tr_eng) %>% step_BoxCox(income) ``` ```r recipe(pets ~ ., data = ok_tr_eng) %>% step_YeoJohnson(income) ``` --- ### Non-Linearity: Basis Expansions and Splines Let's borrow a continuous `\(y\)` variable like `log(income)` to simplify: ```r ggplot(ok_tr_eng, aes(age, log(income))) + geom_point() + theme_light() ``` <img src="images/Income-Age-Scatter-1.png" width="100%" /> Remember MARS? Perhaps you'd like to *expand* age to two variables: `max(45 - age, 0)` and `max(age - 45, 0)`? --- You could do this manually in `lm()` like so ```r mod_lm <- lm(log(income) ~ I(pmax(0, 45 - age)) + I(pmax(0, age - 45)), data = ok_tr_eng) summary(mod_lm) ``` ``` ## ## Call: ## lm(formula = log(income) ~ I(pmax(0, 45 - age)) + I(pmax(0, age - ## 45)), data = ok_tr_eng) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.3331 -0.6472 -0.0632 0.3654 3.2316 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 11.414705 0.093229 122.437 < 2e-16 *** ## I(pmax(0, 45 - age)) -0.033230 0.005738 -5.791 1.16e-08 *** ## I(pmax(0, age - 45)) -0.029685 0.013992 -2.122 0.0343 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.856 on 562 degrees of freedom ## (2435 observations deleted due to missingness) ## Multiple R-squared: 0.05805, Adjusted R-squared: 0.05469 ## F-statistic: 17.32 on 2 and 562 DF, p-value: 5.038e-08 ``` --- Or use `splines::bs()` with `degree = 1` and `knots = 45`. ```r library(splines) mod_lm_bs <- lm(log(income) ~ bs(age, degree = 1, knots = 45), data = ok_tr_eng) summary(mod_lm_bs) ``` ``` ## ## Call: ## lm(formula = log(income) ~ bs(age, degree = 1, knots = 45), data = ok_tr_eng) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.3331 -0.6472 -0.0632 0.3654 3.2316 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 10.51749 0.07891 133.280 < 2e-16 *** ## bs(age, degree = 1, knots = 45)1 0.89722 0.15492 5.791 1.16e-08 *** ## bs(age, degree = 1, knots = 45)2 0.21447 0.27510 0.780 0.436 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.856 on 562 degrees of freedom ## (2435 observations deleted due to missingness) ## Multiple R-squared: 0.05805, Adjusted R-squared: 0.05469 ## F-statistic: 17.32 on 2 and 562 DF, p-value: 5.038e-08 ``` --- How will the variables look like? ```r X <- model.matrix(mod_lm) age_no_na <- ok_tr_eng$age[!is.na(ok_tr_eng$income)] head(cbind(age_no_na, X), 4) ``` ``` ## age_no_na (Intercept) I(pmax(0, 45 - age)) I(pmax(0, age - 45)) ## 1 39 1 6 0 ## 4 29 1 16 0 ## 12 32 1 13 0 ## 15 22 1 23 0 ``` ```r X <- model.matrix(mod_lm_bs) colnames(X)[2:3] <- c("age1", "age2") head(cbind(age_no_na, X), 4) ``` ``` ## age_no_na (Intercept) age1 age2 ## 1 39 1 0.7777778 0 ## 4 29 1 0.4074074 0 ## 12 32 1 0.5185185 0 ## 15 22 1 0.1481481 0 ``` `splines::bs()` re-parametrizes the variables but it is the same model! --- How will the fit look like? ```r ggplot(ok_tr_eng[!is.na(ok_tr_eng$income), ]) + geom_point(aes(age, log(income))) + geom_line(aes(age, mod_lm_bs$fit), color = "red", lwd = 2, alpha = 0.4) + theme_light() ``` <img src="images/Income-Age-Scatter2-1.png" width="100%" /> --- Or with a simple cubic spline with one knot in `bs()` <img src="images/Income-Age-Scatter3-1.png" width="100%" /> --- Manually with `step_relu()` ```r recipe(income ~ age, ok_tr_eng) %>% step_naomit(income) %>% step_relu(age, shift = 45, reverse = TRUE) %>% step_relu(age, shift = 45) %>% prep() %>% juice() %>% head() ``` ``` ## # A tibble: 6 x 4 ## age income left_relu_age right_relu_age ## <dbl> <dbl> <dbl> <dbl> ## 1 39 80000 6 0 ## 2 29 30000 16 0 ## 3 32 1000000 13 0 ## 4 22 20000 23 0 ## 5 51 1000000 0 6 ## 6 40 30000 5 0 ``` --- For splines and polynomial regression use `step_bs()`, `step_ns()` and `step_poly()` ```r recipe(income ~ age, ok_tr_eng) %>% step_naomit(income) %>% step_bs(age, degree = 3) %>% prep() %>% juice() %>% head() ``` ``` ## # A tibble: 6 x 4 ## income age_bs_1 age_bs_2 age_bs_3 ## <dbl> <dbl> <dbl> <dbl> ## 1 80000 0.420 0.315 0.0787 ## 2 30000 0.405 0.117 0.0113 ## 3 1000000 0.437 0.175 0.0233 ## 4 20000 0.207 0.0184 0.000544 ## 5 1000000 0.215 0.444 0.305 ## 6 30000 0.409 0.333 0.0905 ``` --- ### Co-linearity: Consider reducing dimensionality > A conventional thought process is that the more predictors that are included, the better chance the model will have to find the signal. But this is not the case; including irrelevant predictors can have detrimental effects on the final model. The negative impacts include: - increasing the computational time required for training a model - decreasing the predictive performance - complicating the predictor importance calculation. > By including any potential predictor, we end up with a sub-optimal model. [Kuhn & Johnson (2019)](http://www.feat.engineering/index.html) --- Common *dimensionality reduction* methods (some of which we've talked about!): - PCA, Kernel PCA, Sparse PCA --> `step_pca()`, `step_kpca()` - NMF --> `step_nnmf()` - PLS --> `step_pls()` All of these will find some projection matrix `\(B_{pxk}\)` where `\(k\)` is the number of columns (e.g. PCs) to choose and would typically be much lower than `\(p\)`, the original dimension. You would then project your `\(X\)` matrix like so: `\(X^* = XB\)` (in most methods `\(X^*\)` columns would also be linearly independent, or "close" to it), and you would fit your data on this transformed `\(X^*\)` .warning[ ⚠️ No guarantee for improvement in performance! In fact... ] .insight[ 💡 What is an alternative to reduce dimensionality? ] --- class: section-slide # Interaction Effects --- ### What is Interaction? <img src = "images/interaction.png" style="width: 100%"> --- ### But seriously ```r lm_int <- lm(log(income) ~ sex + height + sex:height, data = ok_tr_interaction %>% filter(between(height, 60, 80))) summary(lm_int) ``` ``` ## ## Call: ## lm(formula = log(income) ~ sex + height + sex:height, data = ok_tr_interaction %>% ## filter(between(height, 60, 80))) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.16063 -0.71711 -0.02872 0.46308 3.06165 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 6.36464 2.23738 2.845 0.00468 ** ## sexm 4.33800 2.61968 1.656 0.09855 . ## height 0.06753 0.03421 1.974 0.04912 * ## sexm:height -0.06277 0.03929 -1.597 0.11098 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.9044 on 386 degrees of freedom ## (1593 observations deleted due to missingness) ## Multiple R-squared: 0.02639, Adjusted R-squared: 0.01883 ## F-statistic: 3.488 on 3 and 386 DF, p-value: 0.0159 ``` --- ```r ok_tr_interaction %>% filter(between(height, 60, 80)) %>% ggplot(aes(height, log(income), color = sex)) + geom_point() + geom_smooth(method = "lm") ``` <img src="images/Height-Sex-Interaction-1.png" width="100%" /> .insight[ 💡 How would you visualize an interaction between two continuous features? Two categorical? Three?... ] --- #### You get an interaction! You get an interaction! With the last `rec_eng_commonsense` recipe we reached over 150 features. So all two-way interactions would mean adding... over 11K features. All three-way interactions would mean adding... over 550K features. .insight[ 💡 What danger comes to mind? How would you defend yourself? Which models surface interactions naturally? What about our model? Should you include interaction terms before or after pre-processing? ] --- Obviously, we need some control for this exploding no. of features: - Expert knowledge .font80percent[(e.g. when it comes to income most variables probably interact with...?)] - Heredity principle - Resampling, Penalize - When `\(p\)` is huge: Two-Stage Modeling combined with your favorite Feature Selection approach - Whatever you choose, visualize! --- So these will be our four options for handling interactions: 1. Not 2. Adding only interactions between `sex` and everything 3. Two Stage: Screening all "main effects" features selected by a single `glmnet` stage then all two-way interactions of those features 4. Two Stage: Screening all two-way interactions by multiple comparisons of a `glm` main effects model vs. main effects + interaction, combined with FDR adjustment for the p-value 5. All two-way interactions 😎 .font80percent[(with a `step_nzv()` though)] --- For stages 3 and 4 let's use previously available data. ```r ok_tr_combined <- bind_rows( ok_tr_missing, ok_tr_imbalance, ok_tr_eng ) ok_tr_combined <- full_features(ok_tr_combined) ``` --- For stage 3 need to screen out "main effects" features which passed `glmnet` ```r main_data <- rec_eng_commonsense %>% prep(ok_tr_combined) %>% juice() mod_lr_res <- mod_lr %>% fit(pets ~ ., data = main_data) %>% pluck("fit") %>% tidy() mod_lr_res ``` ``` ## # A tibble: 6,289 x 5 ## term step estimate lambda dev.ratio ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1 -2.45e-13 0.0889 2.60e-13 ## 2 (Intercept) 2 -4.83e- 3 0.0810 9.36e- 3 ## 3 w_cat 2 -6.84e- 3 0.0810 9.36e- 3 ## 4 w_my_cat 2 -1.34e- 1 0.0810 9.36e- 3 ## 5 w_my_dog 2 6.28e- 2 0.0810 9.36e- 3 ## 6 religion_christian 2 6.28e- 2 0.0810 9.36e- 3 ## 7 (Intercept) 3 -5.75e- 3 0.0738 2.11e- 2 ## 8 w_cat 3 -2.02e- 2 0.0738 2.11e- 2 ## 9 w_cats 3 -3.38e- 2 0.0738 2.11e- 2 ## 10 w_my_cat 3 -2.32e- 1 0.0738 2.11e- 2 ## # ... with 6,279 more rows ``` --- We can plot no. of features selected by `glmnet` by lambda ```r mod_lr_res %>% count(lambda) %>% ggplot(aes(lambda, n)) + geom_line(lwd = 2, color = "red", alpha = 0.5) + labs(y = "N features selected by glmnet") + theme_light() ``` <img src="images/Interaction-Stage3-Glmnet-1.png" width="100%" /> --- We see a plateau around 12-15 features, let's choose those. ```r mod_lr_res %>% filter(step == 6) %>% select(term, estimate) %>% arrange(-abs(estimate)) ``` ``` ## # A tibble: 15 x 2 ## term estimate ## <chr> <dbl> ## 1 w_my_cat -0.493 ## 2 w_my_dog 0.328 ## 3 religion_christian 0.278 ## 4 status_all_else -0.207 ## 5 w_cats -0.147 ## 6 (Intercept) 0.130 ## 7 drinks_all_else -0.100 ## 8 body_type_all_else -0.0823 ## 9 w_dog 0.0698 ## 10 w_cat -0.0430 ## 11 drugs_sometimes -0.0297 ## 12 location_sf 0.00977 ## 13 w_art -0.00600 ## 14 age -0.00189 ## 15 w_sci -0.000986 ``` --- These make sense, let's choose them as our main effects from which we pick all two-way interactions. ```r main_effects <- mod_lr_res %>% filter(step == 6, term != "(Intercept)") %>% pull(term) ``` For stage 4 we need to iterate over all possible two-way interactions, each time adding one into a full main effects `glm` model, getting the p-value, and applying FDR adjustment. --- ```r mod_glm_base <- glm(pets ~ ., data = main_data %>% select(-ends_with("_new")), family = "binomial") extract_interaction_p_val <- function(var1, var2) { pb$tick()$print() int_term <- str_c(var1, ":", var2) int_term2 <- str_c(var2, ":", var1) mod_glm <- quietly(update)( mod_glm_base, as.formula(str_c(".~.+", int_term)) ) if(length(mod_glm$warnings) > 1) { return(NA_real_) } p_val <- tidy(mod_glm$result) %>% filter(term %in% c(int_term, int_term2)) %>% pull(p.value) if (length(p_val) == 0) { return(NA_real_) } p_val } all_vars <- colnames(main_data %>% select(-ends_with("_new"))) all_ints <- t(combn(all_vars[-which(all_vars == "pets")], 2)) colnames(all_ints) <- c("var1", "var2") all_ints <- as_tibble(all_ints) pb <- progress_estimated(nrow(all_ints)) all_ints <- all_ints %>% mutate(p_val = map2_dbl(var1, var2, extract_interaction_p_val), fdr = p.adjust(p_val, method = "fdr")) ``` --- Let's filter only the *most* convincing interactions. ```r pairwise_ints <- all_ints %>% arrange(fdr) %>% filter(fdr < 10e-7) pairwise_ints ``` ``` ## # A tibble: 10 x 4 ## var1 var2 p_val fdr ## <chr> <chr> <dbl> <dbl> ## 1 ethnicity_other location_other 2.68e-14 2.06e-10 ## 2 offspring_one_kid speaks_n_fluently 5.06e-13 1.94e- 9 ## 3 religion_laughing ethnicity_white 1.15e-11 2.93e- 8 ## 4 len_essay0 education_kind_working 8.94e-11 1.53e- 7 ## 5 offspring_one_kid speaks_spanish 9.97e-11 1.53e- 7 ## 6 w_possibly education_all_else 1.23e-10 1.57e- 7 ## 7 w_games w_possibly 1.86e-10 2.04e- 7 ## 8 w_books w_sci 3.98e-10 3.81e- 7 ## 9 w_fi t_essays_n_uq_words 4.54e-10 3.87e- 7 ## 10 education_all_else status_all_else 8.89e-10 6.81e- 7 ``` --- First make sure you understand the difference between `other`, `all_else` etc. ```r ok_tr_combined %>% count(ethnicity, sort = TRUE) ``` ``` ## # A tibble: 6 x 2 ## ethnicity n ## <fct> <int> ## 1 white 3863 ## 2 other 1641 ## 3 asian 752 ## 4 hispanic 368 ## 5 black 248 ## 6 indian 128 ``` ```r main_data %>% select(starts_with("ethnicity")) %>% head() ``` ``` ## # A tibble: 6 x 4 ## ethnicity_other ethnicity_white ethnicity_all_else ethnicity_new ## <dbl> <dbl> <dbl> <dbl> ## 1 0 1 0 0 ## 2 0 0 1 0 ## 3 0 1 0 0 ## 4 1 0 0 0 ## 5 1 0 0 0 ## 6 1 0 0 0 ``` --- What are those people with all `ethnicity` columns with zero? ```r main_data %>% select(starts_with("ethnicity")) %>% filter_all(~.x == 0) ``` ``` ## # A tibble: 978 x 4 ## ethnicity_other ethnicity_white ethnicity_all_else ethnicity_new ## <dbl> <dbl> <dbl> <dbl> ## 1 0 0 0 0 ## 2 0 0 0 0 ## 3 0 0 0 0 ## 4 0 0 0 0 ## 5 0 0 0 0 ## 6 0 0 0 0 ## 7 0 0 0 0 ## 8 0 0 0 0 ## 9 0 0 0 0 ## 10 0 0 0 0 ## # ... with 968 more rows ``` --- Same with `location`. ```r ok_tr_combined %>% count(location, sort = TRUE) ``` ``` ## # A tibble: 4 x 2 ## location n ## <fct> <int> ## 1 sf 3706 ## 2 other 2034 ## 3 oakland 822 ## 4 berkeley 438 ``` ```r main_data %>% select(starts_with("location")) %>% head() ``` ``` ## # A tibble: 6 x 4 ## location_other location_sf location_all_else location_new ## <dbl> <dbl> <dbl> <dbl> ## 1 0 0 1 0 ## 2 1 0 0 0 ## 3 1 0 0 0 ## 4 0 0 0 0 ## 5 0 1 0 0 ## 6 0 0 0 0 ``` --- ```r main_data %>% group_by(ethnicity_other, location_other) %>% summarise(pct_cats = mean(pets == "cats"), n = n()) %>% ungroup() %>% mutate_at(vars("ethnicity_other", "location_other"), as.factor) %>% ggplot(aes(ethnicity_other, pct_cats, color = location_other)) + geom_line(aes(group = location_other)) + geom_point(aes(size = n)) + scale_y_continuous(labels = percent_format()) + theme_light() ``` <img src="images/Ethnicity-Interaction-1.png" width="90%" /> .insight[ 💡 Try stating what you see in words. ] --- ```r main_data %>% mutate(speaks_n_fluently = ifelse(speaks_n_fluently >= 3, "3+", speaks_n_fluently)) %>% group_by(speaks_n_fluently, offspring_one_kid) %>% summarise(pct_cats = mean(pets == "cats"), n = n()) %>% ungroup() %>% mutate_at(vars("speaks_n_fluently", "offspring_one_kid"), as.factor) %>% ggplot(aes(speaks_n_fluently, pct_cats, color = offspring_one_kid)) + geom_line(aes(group = offspring_one_kid)) + geom_point(aes(size = n)) + scale_y_continuous(labels = percent_format()) + theme_light() ``` <img src="images/Fluently-Interaction-1.png" width="90%" /> --- ```r main_data %>% group_by(religion_laughing, ethnicity_white) %>% summarise(pct_cats = mean(pets == "cats"), n = n()) %>% ungroup() %>% mutate_at(vars("religion_laughing", "ethnicity_white"), as.factor) %>% ggplot(aes(religion_laughing, pct_cats, color = ethnicity_white)) + geom_line(aes(group = ethnicity_white)) + geom_point(aes(size = n)) + scale_y_continuous(labels = percent_format()) + theme_light() ``` <img src="images/Religion-Interaction-1.png" width="90%" /> --- ```r ok_tr_interaction2 <- full_features(ok_tr_interaction) rec_int_noint <- rec_eng_commonsense rec_int_sex <- recipe(pets ~ ., data = ok_tr_interaction2) %>% step_textfeature(essays, prefix = "t", extract_functions = my_text_funs) %>% update_role(essays, new_role = "discarded") %>% step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>% step_log(income, starts_with("len_"), starts_with("t_"), -t_essays_sent_bing, offset = 1) %>% step_meanimpute(income) %>% step_other( all_nominal(), -has_role("discarded"), -all_outcomes(), other = "all_else", threshold = 0.1) %>% step_novel( all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_interact(~starts_with("sex"):all_numeric()) %>% step_nzv(all_numeric(), freq_cut = 99/1) %>% step_upsample(pets, over_ratio = 1, seed = 42) ``` --- ```r # pairwise interactions string code from Max Kuhn's workshop: # https://github.com/topepo/FES/blob/master/07_Detecting_Interaction_Effects/7_04_The_Brute-Force_Approach_to_Identifying_Predictive_Interactions/ames_glmnet.R topmain_ints <- t(combn(main_effects, 2)) colnames(topmain_ints) <- c("var1", "var2") topmain_ints <- topmain_ints %>% as_tibble() %>% mutate( term = str_c(var1, ":", var2) ) %>% pull(term) %>% paste(collapse = "+") topmain_ints <- str_c("~", topmain_ints) topmain_ints <- as.formula(topmain_ints) topmain_ints ``` ``` ## ~age:w_cat + age:w_cats + age:w_dog + age:w_art + age:w_sci + ## age:w_my_cat + age:w_my_dog + age:body_type_all_else + age:drinks_all_else + ## age:drugs_sometimes + age:location_sf + age:religion_christian + ## age:status_all_else + w_cat:w_cats + w_cat:w_dog + w_cat:w_art + ## w_cat:w_sci + w_cat:w_my_cat + w_cat:w_my_dog + w_cat:body_type_all_else + ## w_cat:drinks_all_else + w_cat:drugs_sometimes + w_cat:location_sf + ## w_cat:religion_christian + w_cat:status_all_else + w_cats:w_dog + ## w_cats:w_art + w_cats:w_sci + w_cats:w_my_cat + w_cats:w_my_dog + ## w_cats:body_type_all_else + w_cats:drinks_all_else + w_cats:drugs_sometimes + ## w_cats:location_sf + w_cats:religion_christian + w_cats:status_all_else + ## w_dog:w_art + w_dog:w_sci + w_dog:w_my_cat + w_dog:w_my_dog + ## w_dog:body_type_all_else + w_dog:drinks_all_else + w_dog:drugs_sometimes + ## w_dog:location_sf + w_dog:religion_christian + w_dog:status_all_else + ## w_art:w_sci + w_art:w_my_cat + w_art:w_my_dog + w_art:body_type_all_else + ## w_art:drinks_all_else + w_art:drugs_sometimes + w_art:location_sf + ## w_art:religion_christian + w_art:status_all_else + w_sci:w_my_cat + ## w_sci:w_my_dog + w_sci:body_type_all_else + w_sci:drinks_all_else + ## w_sci:drugs_sometimes + w_sci:location_sf + w_sci:religion_christian + ## w_sci:status_all_else + w_my_cat:w_my_dog + w_my_cat:body_type_all_else + ## w_my_cat:drinks_all_else + w_my_cat:drugs_sometimes + w_my_cat:location_sf + ## w_my_cat:religion_christian + w_my_cat:status_all_else + ## w_my_dog:body_type_all_else + w_my_dog:drinks_all_else + ## w_my_dog:drugs_sometimes + w_my_dog:location_sf + w_my_dog:religion_christian + ## w_my_dog:status_all_else + body_type_all_else:drinks_all_else + ## body_type_all_else:drugs_sometimes + body_type_all_else:location_sf + ## body_type_all_else:religion_christian + body_type_all_else:status_all_else + ## drinks_all_else:drugs_sometimes + drinks_all_else:location_sf + ## drinks_all_else:religion_christian + drinks_all_else:status_all_else + ## drugs_sometimes:location_sf + drugs_sometimes:religion_christian + ## drugs_sometimes:status_all_else + location_sf:religion_christian + ## location_sf:status_all_else + religion_christian:status_all_else ``` --- ```r rec_int_topmain <- recipe(pets ~ ., data = ok_tr_interaction2) %>% step_textfeature(essays, prefix = "t", extract_functions = my_text_funs) %>% update_role(essays, new_role = "discarded") %>% step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>% step_log(income, starts_with("len_"), starts_with("t_"), -t_essays_sent_bing, offset = 1) %>% step_meanimpute(income) %>% step_other( all_nominal(), -has_role("discarded"), -all_outcomes(), other = "all_else", threshold = 0.1) %>% step_novel( all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_interact(topmain_ints) %>% step_nzv(all_numeric(), freq_cut = 99/1) %>% step_upsample(pets, over_ratio = 1, seed = 42) ``` --- ```r rec_int_topints <- recipe(pets ~ ., data = ok_tr_interaction2) %>% step_textfeature(essays, prefix = "t", extract_functions = my_text_funs) %>% update_role(essays, new_role = "discarded") %>% step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>% step_log(income, starts_with("len_"), starts_with("t_"), -t_essays_sent_bing, offset = 1) %>% step_meanimpute(income) %>% step_other( all_nominal(), -has_role("discarded"), -all_outcomes(), other = "all_else", threshold = 0.1) %>% step_novel( all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_interact(topint_ints) %>% step_nzv(all_numeric(), freq_cut = 99/1) %>% step_upsample(pets, over_ratio = 1, seed = 42) ``` --- ```r rec_int_all <- recipe(pets ~ ., data = ok_tr_interaction2) %>% step_textfeature(essays, prefix = "t", extract_functions = my_text_funs) %>% update_role(essays, new_role = "discarded") %>% step_mutate_at(starts_with("t_"), fn = ~ifelse(is.na(.x), 0, .x)) %>% step_log(income, starts_with("len_"), starts_with("t_"), -t_essays_sent_bing, offset = 1) %>% step_meanimpute(income) %>% step_other( all_nominal(), -has_role("discarded"), -all_outcomes(), other = "all_else", threshold = 0.1) %>% step_novel( all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_modeimpute(all_nominal(), -has_role("discarded"), -all_outcomes()) %>% step_dummy(all_nominal(), -all_outcomes(), -has_role("discarded"), one_hot = FALSE) %>% step_interact(~all_numeric():all_numeric()) %>% step_nzv(all_numeric(), freq_cut = 95/5) %>% step_upsample(pets, over_ratio = 1, seed = 42) ``` --- ```r set.seed(42) cv_splits_int <- vfold_cv(ok_tr_interaction2, v = 10, strata = pets) lst_recs <- list("noint" = rec_int_noint, "sex" = rec_int_sex, "topmain" = rec_int_topmain, "topints" = rec_int_topints, "all" = rec_int_all) ``` ```r cv_res_int <- cv_splits_int %>% bind_cols( map2_dfc(lst_recs, names(lst_recs), train_recipe, splits = cv_splits_int$splits) ) %>% pivot_longer(cols = names(lst_recs), names_to = "recipe", values_to = "AUC") %>% select(id, recipe, AUC) cv_res_int_meds <- cv_res_int %>% group_by(recipe) %>% summarise(AUC = mean(AUC), id = "All") ``` --- ```r cv_res_int %>% mutate(recipe = fct_relevel(recipe, c("noint", "sex", "topmain", "topints", "all"))) %>% ggplot(aes(recipe, AUC, group = id, color = id)) + geom_line() + geom_line(data = cv_res_int_meds, color = "black", alpha = 0.2, lwd = 2) + geom_point() + geom_point(data = cv_res_int_meds, color = "black", alpha = 0.2, size = 5, pch = 17) + guides(color = FALSE) + theme_light() ``` <img src="images/Ok-Int-1.png" width="100%" /> --- class: section-slide # Tuning Params --- ### Everything is Tunable .insight[ 💡 I can think of at least 5 semi-random quantitative decisions so far, can you? ] But let us simply vary Elastic Net `\(\alpha\)` and `\(\lambda\)`: `\(\min_{\beta_0, \beta}\frac{1}{N}\sum_{i = 1}^N w_i l(y_i,\beta_0 + \beta^Tx_i) + \lambda[(1-\alpha)||\beta||^2_2/2 + \alpha||\beta||_1]\)` ```r library(tune) mod_en_tune <- logistic_reg(penalty = tune(), mixture = tune()) %>% set_engine("glmnet") en_grid <- grid_regular(penalty(range(-6, 0)), mixture(range(0, 1)), levels = c(7, 11)) ``` --- ```r ok_tr_tuning2 <- full_features(ok_tr_tuning) set.seed(42) cv_splits_tune <- vfold_cv(ok_tr_tuning2, v = 10, strata = pets) ``` ```r tune_res <- tune_grid(object = mod_en_tune, preprocessor = rec_eng_commonsense, resamples = cv_splits_tune, grid = en_grid, metrics = metric_set(roc_auc), control = control_grid(verbose = TRUE)) collect_metrics_tune_res <- collect_metrics(tune_res) collect_metrics_tune_res ``` ``` ## # A tibble: 77 x 7 ## penalty mixture .metric .estimator mean n std_err ## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> ## 1 0.000001 0 roc_auc binary 0.775 10 0.0173 ## 2 0.000001 0.1 roc_auc binary 0.768 10 0.0168 ## 3 0.000001 0.2 roc_auc binary 0.767 10 0.0167 ## 4 0.000001 0.3 roc_auc binary 0.767 10 0.0167 ## 5 0.000001 0.4 roc_auc binary 0.766 10 0.0167 ## 6 0.000001 0.5 roc_auc binary 0.766 10 0.0167 ## 7 0.000001 0.6 roc_auc binary 0.767 10 0.0168 ## 8 0.000001 0.7 roc_auc binary 0.767 10 0.0168 ## 9 0.000001 0.8 roc_auc binary 0.767 10 0.0168 ## 10 0.000001 0.9 roc_auc binary 0.767 10 0.0167 ## # ... with 67 more rows ``` --- ```r collect_metrics_tune_res %>% mutate(auc = mean, mixture = factor(mixture)) %>% ggplot(aes(penalty, auc, color = mixture)) + geom_line() + geom_point() + theme_light() ``` <img src="images/Ok-Tune-1.png" width="100%" /> --- ```r show_best(tune_res, n = 10, metric = "roc_auc") ``` ``` ## # A tibble: 10 x 7 ## penalty mixture .metric .estimator mean n std_err ## <dbl> <dbl> <chr> <chr> <dbl> <int> <dbl> ## 1 0.01 0.6 roc_auc binary 0.783 10 0.0166 ## 2 0.01 0.7 roc_auc binary 0.783 10 0.0165 ## 3 0.01 0.5 roc_auc binary 0.783 10 0.0168 ## 4 0.01 0.8 roc_auc binary 0.783 10 0.0165 ## 5 0.01 0.9 roc_auc binary 0.783 10 0.0163 ## 6 0.01 0.4 roc_auc binary 0.783 10 0.0169 ## 7 0.01 1 roc_auc binary 0.782 10 0.0160 ## 8 0.01 0.3 roc_auc binary 0.781 10 0.0170 ## 9 0.1 0 roc_auc binary 0.781 10 0.0166 ## 10 0.1 0.1 roc_auc binary 0.779 10 0.0157 ``` .insight[ 💡 So what would you do? ] --- class: section-slide # Cutoff choice --- ### The model is there. Remember it is best to predict a score, let the model's "clients" choose what to do with that score. Also remember the "I don't know" strategy! But if we must, let us provide: - a score to maximize precision of `cats` .font80percent[(while maintaing acceptable levels of recall)] - a score to maximize recall of `cats` .font80percent[(while maintaing acceptable levels of precision)] - a middle score --- ```r ok_tr_combined <- full_features( bind_rows( ok_tr_missing, ok_tr_imbalance, ok_tr_eng, ok_tr_interaction, ok_tr_tuning)) ok_tr_cutoff2 <- full_features(ok_tr_cutoff) rec_cutoff <- rec_eng_commonsense %>% prep(ok_tr_combined) ok_tr_for_cutoff <- rec_cutoff %>% juice() ok_te_for_cutoff <- rec_cutoff %>% bake(ok_tr_cutoff2) mod_lr_cutoff <- logistic_reg(penalty = 0.01, mixture = 0.5) %>% set_engine("glmnet") mod_lr_cutoff <- mod_lr_cutoff %>% fit(pets ~ ., data = ok_tr_for_cutoff) pred_cutoff <- mod_lr_cutoff %>% predict(new_data = ok_te_for_cutoff, penalty = 0.01, type = "prob") %>% mutate(truth = ok_te_for_cutoff$pets) ``` --- ```r library(pROC) cutoffs <- seq(0.1, 0.9, 0.1) roc_obj <- roc(pred_cutoff$truth, pred_cutoff$.pred_cats, levels = c("dogs", "cats")) coords(roc_obj, x = cutoffs, ret = c("accuracy", "recall", "precision", "specificity", "npv"), transpose = TRUE) ``` ``` ## 0.1 0.2 0.3 0.4 0.5 0.6 ## accuracy 0.21124829 0.3004115 0.4471879 0.5747599 0.7229081 0.7997257 ## recall 1.00000000 0.9833333 0.8916667 0.7250000 0.5916667 0.4416667 ## precision 0.17266187 0.1884984 0.2152918 0.2390110 0.3169643 0.4015152 ## specificity 0.05582923 0.1658456 0.3596059 0.5451560 0.7487685 0.8702791 ## npv 1.00000000 0.9805825 0.9439655 0.9095890 0.9029703 0.8877722 ## 0.7 0.8 0.9 ## accuracy 0.8312757 0.8504801 0.84362140 ## recall 0.2833333 0.1666667 0.06666667 ## precision 0.4788732 0.6896552 0.80000000 ## specificity 0.9392447 0.9852217 0.99671593 ## npv 0.8693009 0.8571429 0.84422809 ``` --- class: section-slide # Predict on Test set --- ### Long time no see! ```r set.seed(42) random_cats_to_get_rid_of <- okcupid_test %>% mutate(id = 1:n()) %>% filter(pets == "cats") %>% sample_n(40) %>% pull(id) okcupid_test2 <- okcupid_test %>% mutate(id = 1:n()) %>% filter(!id %in% random_cats_to_get_rid_of) %>% select(-id) okcupid_test2 <- full_features(okcupid_test2) okcupid_test2 %>% count(pets) %>% mutate(pct = n / sum(n)) ``` ``` ## # A tibble: 2 x 3 ## pets n pct ## <fct> <int> <dbl> ## 1 cats 653 0.165 ## 2 dogs 3307 0.835 ``` --- ```r okcupid_test_baked <- rec_cutoff %>% bake(okcupid_test2) mod_lr_test <- logistic_reg(penalty = 0.01, mixture = 0.5) %>% set_engine("glmnet") mod_lr_test <- mod_lr_test %>% fit(pets ~ ., data = ok_tr_for_cutoff) pred_test <- mod_lr_test %>% predict(new_data = okcupid_test_baked, penalty = 0.01, type = "prob") %>% mutate(truth = okcupid_test_baked$pets) ``` ```r roc_auc(pred_test, truth, .pred_cats) ``` ``` ## # A tibble: 1 x 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.790 ``` --- ```r roc_lr <- roc_curve( tibble(truth = pred_test$truth, cats = pred_test$.pred_cats), truth, cats) autoplot(roc_lr) + labs(x = "1 - Recall(Dogs)", y = "Recall(Cats)") ``` <img src="images/Ok-Test-ROC-Curve-1.png" width="60%" /> --- ```r roc_obj <- roc(pred_test$truth, pred_test$.pred_cats, levels = c("dogs", "cats")) cutoffs <- c(0.3, 0.5, 0.9) coords(roc_obj, x = cutoffs, ret = c("accuracy", "recall", "precision", "specificity", "npv"), transpose = TRUE) ``` ``` ## 0.3 0.5 0.9 ## accuracy 0.4530303 0.7323232 0.85050505 ## recall 0.9157734 0.6860643 0.09954058 ## precision 0.2207457 0.3438219 0.94202899 ## specificity 0.3616571 0.7414575 0.99879044 ## npv 0.9560352 0.9228453 0.84888204 ``` --- ### What *is* the model? ```r tidy(mod_lr_test) %>% filter(near(lambda, 0.01, tol = 0.0005)) %>% arrange(-abs(estimate)) %>% select(term, estimate, lambda) %>% head(20) ``` ``` ## # A tibble: 20 x 3 ## term estimate lambda ## <chr> <dbl> <dbl> ## 1 w_my_cat -2.08 0.0102 ## 2 w_my_dog 1.60 0.0102 ## 3 w_kitt -1.06 0.0102 ## 4 (Intercept) -0.834 0.0102 ## 5 religion_christian 0.821 0.0102 ## 6 w_cats -0.815 0.0102 ## 7 w_have_cat -0.759 0.0102 ## 8 w_love_dog 0.699 0.0102 ## 9 w_dog 0.612 0.0102 ## 10 body_type_all_else -0.571 0.0102 ## 11 diet_all_else -0.535 0.0102 ## 12 status_all_else -0.505 0.0102 ## 13 w_pupp 0.441 0.0102 ## 14 drugs_sometimes -0.426 0.0102 ## 15 job_all_else -0.418 0.0102 ## 16 ethnicity_white -0.416 0.0102 ## 17 body_type_average -0.391 0.0102 ## 18 drugs_all_else -0.364 0.0102 ## 19 w_have_a_dog 0.350 0.0102 ## 20 religion_all_else 0.345 0.0102 ``` --- ### What did we achieve? - ✔️ Reproducible model - ✔️ Dependable model - ✔️ Explainable model - ✔️ Small and Fast as hell model - ❌ Impressive model ### Is that the end? No. .insight[ 💡 What *didn't* we do? ] --- ### Few Excellent Books .pull-left[ <img src = "images/apm_cover.png" style="width: 85%"> ] .pull-right[ <img src = "images/fes_cover.jpg" style="width: 100%"> ]