class: logo-slide --- class: title-slide ## Tidy Data Wrangling - Part B ### Applications of Data Science - Class 3 ### Giora Simchoni #### `gsimchoni@gmail.com and add #dsapps in subject` ### Stat. and OR Department, TAU ### 2020-02-28 --- layout: true <div class="my-footer"> <span> <a href="https://dsapps-2020.github.io/Class_Slides/" target="_blank">Applications of Data Science </a> </span> </div> --- class: section-slide # Joining Tables --- class: section-slide # Detour: The Starwars Dataset(s) --- ## The Starwars Dataset(s) <img src = "images/starwars_db_schema.png" style="width: 100%"> ```r sw_tables <- read_rds("../data/sw_tables.rds") characters <- sw_tables$characters planets <- sw_tables$planets films <- sw_tables$films ``` .insight[ 💡 What are the advantages/disadvantages of storing data in such a way? ] --- ```r glimpse(characters) ``` ``` ## Observations: 173 ## Variables: 14 ## $ character_id <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3... ## $ name <chr> "Luke Skywalker", "Luke Skywalker", "Luke Skywalk... ## $ gender <chr> "male", "male", "male", "male", "male", NA, NA, N... ## $ homeworld_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8... ## $ height <dbl> 172, 172, 172, 172, 172, 167, 167, 167, 167, 167,... ## $ mass <dbl> 77, 77, 77, 77, 77, 75, 75, 75, 75, 75, 75, 32, 3... ## $ hair_color <chr> "blond", "blond", "blond", "blond", "blond", NA, ... ## $ skin_color <chr> "fair", "fair", "fair", "fair", "fair", "gold", "... ## $ eye_color <chr> "blue", "blue", "blue", "blue", "blue", "yellow",... ## $ birth_year <dbl> 19.0, 19.0, 19.0, 19.0, 19.0, 112.0, 112.0, 112.0... ## $ film_id <dbl> 1, 2, 3, 6, 7, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6... ## $ species <list> ["http://swapi.co/api/species/1/", "http://swapi... ## $ vehicles <list> [<"http://swapi.co/api/vehicles/14/", "http://sw... ## $ starships <list> [<"http://swapi.co/api/starships/12/", "http://s... ``` --- ```r glimpse(planets) ``` ``` ## Observations: 61 ## Variables: 10 ## $ planet_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,... ## $ name <chr> "Tatooine", "Alderaan", "Yavin IV", "Hoth", "D... ## $ rotation_period <dbl> 23, 24, 24, 23, 23, 12, 18, 26, 24, 27, 30, 27... ## $ orbital_period <dbl> 304, 364, 4818, 549, 341, 5110, 402, 312, 368,... ## $ diameter <dbl> 10465, 12500, 10200, 7200, 8900, 118000, 4900,... ## $ climate <chr> "arid", "temperate", "temperate, tropical", "f... ## $ gravity <chr> "1 standard", "1 standard", "1 standard", "1.1... ## $ terrain <chr> "desert", "grasslands, mountains", "jungle, ra... ## $ surface_water <dbl> 1.0, 40.0, 8.0, 100.0, 8.0, 0.0, 8.0, 12.0, NA... ## $ population <dbl> 2.0e+05, 2.0e+09, 1.0e+03, NA, NA, 6.0e+06, 3.... ``` ```r glimpse(films) ``` ``` ## Observations: 7 ## Variables: 7 ## $ film_id <dbl> 1, 2, 3, 4, 5, 6, 7 ## $ title <chr> "A New Hope", "The Empire Strikes Back", "Return... ## $ episode_id <int> 4, 5, 6, 1, 2, 3, 7 ## $ opening_crawl <chr> "It is a period of civil war.\r\nRebel spaceship... ## $ director <chr> "George Lucas", "Irvin Kershner", "Richard Marqu... ## $ producer <chr> "Gary Kurtz, Rick McCallum", "Gary Kutz, Rick Mc... ## $ release_date <chr> "1977-05-25", "1980-05-17", "1983-05-25", "1999-... ``` --- class: section-slide # End of Detour --- #### Q: Which characters appear in SW films directed by George Lucas? ```r unique( characters$name[ characters$film_id %in% films$film_id[films$director == "George Lucas"] ] ) # or dplyr approach: characters %>% filter(film_id %in% (films %>% filter(director=="George Lucas") %>% pull(film_id))) %>% pull(name) %>% unique() ``` First problem with this approach: code gets messier and messier, prone to bugs and hard to debug. --- #### Q: Which characters, whose homeworld is Alderaan, appear in SW films directed by George Lucas? ```r unique( characters$name[ characters$film_id %in% films$film_id[films$director == "George Lucas"] & characters$homeworld_id == planets$planet_id[planets$name == "Alderaan"] ] ) # or dplyr approach: characters %>% filter(film_id %in% (films %>% filter(director=="George Lucas") %>% pull(film_id)), homeworld_id == (planets %>% filter(name == "Alderaan") %>% pull(planet_id))) %>% pull(name) %>% unique() # [1] "Leia Organa" "Bail Prestor Organa" "Raymus Antilles" ``` --- #### Now imagine these two tables `lines`: each film, each scene, each minute, each character, the line ```r lines <- tibble::tribble( ~film_id, ~scene_id, ~minute_id, ~character_id, ~line, 1, 1, 1, 23, "blah blah blah", 1, 1, 2, 15, "something something", 1, 1, 2, 23, "something something to you to", 1, 1, 3, 15, NA, 1, 1, 3, 23, NA, 1, 1, 4, 8, "whatever whatever" ) ``` `locations`: for each film, each scene, each minute, its location ```r locations <- tibble::tribble( ~film_id, ~scene_id, ~minute_id, ~location, 1, 1, 1, "spaceship", 1, 1, 2, "Alderaan, outdoors", 1, 1, 3, "spaceship", 1, 1, 4, "bar" ) ``` --- #### Q: Which characters say "something" in a bar? - filter only `lines` which contain "something" - filter only "bar" `locations` and then... - if the two filtered tables match on film, scene and minute... - we take the unique characters - but how do we match? `for` loop*? ➡️ Clearly, second problem with this approach: it doesn't generalize well to more complex scenarios, where we need to match on multiple criteria And third problem: speed .font80percent[(but only in a complex scenario!)] .font80percent[* There *is* a way without using a for loop, without joining, still not recommended.] --- ## `inner_join()` (Inner) Joining two tables: ```r characters %>% inner_join(films) %>% select(character_id, name, film_id, title, director) %>% head(7) ``` ``` ## Joining, by = "film_id" ``` ``` ## # A tibble: 7 x 5 ## character_id name film_id title director ## <int> <chr> <dbl> <chr> <chr> ## 1 1 Luke Skywalker 1 A New Hope George Lucas ## 2 1 Luke Skywalker 2 The Empire Strikes B~ Irvin Kershner ## 3 1 Luke Skywalker 3 Return of the Jedi Richard Marqua~ ## 4 1 Luke Skywalker 6 Revenge of the Sith George Lucas ## 5 1 Luke Skywalker 7 The Force Awakens J. J. Abrams ## 6 2 C-3PO 1 A New Hope George Lucas ## 7 2 C-3PO 2 The Empire Strikes B~ Irvin Kershner ``` --- (Inner) Joining multiple tables: ```r characters %>% inner_join(films) %>% inner_join(planets, by = c("homeworld_id" = "planet_id")) %>% select(character_id, name.x, film_id, title, director, name.y, climate) %>% head(7) ``` ``` ## Joining, by = "film_id" ``` ``` ## # A tibble: 7 x 7 ## character_id name.x film_id title director name.y climate ## <int> <chr> <dbl> <chr> <chr> <chr> <chr> ## 1 1 Luke Skyw~ 1 A New Hope George Lucas Tatoo~ arid ## 2 1 Luke Skyw~ 2 The Empire S~ Irvin Kersh~ Tatoo~ arid ## 3 1 Luke Skyw~ 3 Return of th~ Richard Mar~ Tatoo~ arid ## 4 1 Luke Skyw~ 6 Revenge of t~ George Lucas Tatoo~ arid ## 5 1 Luke Skyw~ 7 The Force Aw~ J. J. Abrams Tatoo~ arid ## 6 2 C-3PO 1 A New Hope George Lucas Tatoo~ arid ## 7 2 C-3PO 2 The Empire S~ Irvin Kersh~ Tatoo~ arid ``` --- (Inner) Joining on multiple keys: ```r lines %>% inner_join(locations, by = c("film_id", "scene_id", "minute_id")) ``` ``` ## # A tibble: 6 x 6 ## film_id scene_id minute_id character_id line location ## <dbl> <dbl> <dbl> <dbl> <chr> <chr> ## 1 1 1 1 23 blah blah blah spaceship ## 2 1 1 2 15 something something Alderaan, ou~ ## 3 1 1 2 23 something somethin~ Alderaan, ou~ ## 4 1 1 3 15 <NA> spaceship ## 5 1 1 3 23 <NA> spaceship ## 6 1 1 4 8 whatever whatever bar ``` .insight[ 💡 The base R function for joining is `merge()` which tends to be slower. ] --- ## What would `inner_join()` do? Q: Which characters appear in SW films directed by George Lucas? ```r # naive characters %>% inner_join(films, by = "film_id") %>% filter(director == "George Lucas") %>% pull(name) %>% unique() # smarter characters %>% inner_join(films %>% filter(director == "George Lucas"), by = "film_id") %>% pull(name) %>% unique() ``` .insight[ 💡 What else could you do to make `inner_join`'s life easier? ] --- ## What would `inner_join()` do? Q: Which characters, whose homeworld is Alderaan, appear in SW films directed by George Lucas? ```r characters %>% inner_join(films %>% filter(director == "George Lucas"), by = "film_id") %>% inner_join(planets %>% filter(name == "Alderaan"), by = c("homeworld_id" = "planet_id"), suffix = c("_char", "_planet")) %>% pull(name_char) %>% unique() ``` ``` ## [1] "Leia Organa" "Bail Prestor Organa" "Raymus Antilles" ``` --- ### Note of caution: Join is not always faster! ```r baseR_no_join <- function() { unique(characters$name[characters$film_id %in% films$film_id[films$director == "George Lucas"]]) } baseR_with_join <- function() { unique(merge(characters, films[films$director == "George Lucas", ], by = "film_id")$name) } dplyr_no_join <- function() { characters %>% filter(film_id %in% (films %>% filter(director=="George Lucas") %>% pull(film_id))) %>% pull(name) %>% unique() } dplyr_with_join <- function() { characters %>% inner_join(films %>% filter(director=="George Lucas"), by = "film_id") %>% pull(name) %>% unique() } ``` --- ```r library(microbenchmark) res <- microbenchmark(baseR_no_join(), baseR_with_join(), dplyr_no_join(), dplyr_with_join(), times = 20) autoplot(res) ``` <img src="images/Join-Slow2-1.png" width="50%" /> --- ### So when does it shine? - See [this](https://stackoverflow.com/q/59183599/4095235) StackOverflow question and answer for a good example* - But in general: when data gets bigger, when multiple keys are involved .font80percent[* Yes, I know, I answered my own question, 🤦] ### Other types of Joins? Definitely, read [here](https://stat545.com/join-cheatsheet.html), [here](https://r4ds.had.co.nz/relational-data.html) and [here](https://rstudio-education.github.io/tidyverse-cookbook/transform-tables.html#joins) about: - `left_join()` - `right_join()` - `full_join()` - `anti_join()` - `semi_join()` --- ## All Joins <img src = "images/joins_explained1.png" style="width: 90%"> .font80percent[ Source: [The Tidyverse Cookbook](https://rstudio-education.github.io/tidyverse-cookbook/transform-tables.html#joins) ] --- ## All Joins <img src = "images/joins_explained2.png" style="width: 90%"> .font80percent[ Source: [The Tidyverse Cookbook](https://rstudio-education.github.io/tidyverse-cookbook/transform-tables.html#joins) ] --- class: section-slide # Tidying Tables --- class: section-slide # Detour: The Migration Dataset --- ## The Migration Dataset - Straight from the [UN, Dept. of Economic and Social Affairs, Population Division](https://www.un.org/en/development/desa/population/index.asp) - "Monitoring global population trends" - For each country, how many (men, women and total) migrated to each country - In 1990, 1995, 2000, 2005, 2010, 2015, 2019 - How would *you* give access to these data? --- ### You want dirty data? Well dirty data costs! <img src = "images/migration_excel.png" style="width: 100%"> --- ### What's untidy about the migration Excel file? - It's an Excel file 😠 - Multiple sheets - Color coded, font coded (bold), space coded! - Logo, free text in header lines - Merged cells - French letters (anything but [A-Za-z] can break code) - Spaces, parentheses in column names - Different `NA` values: "..", "-" - Variable "country_dest" contains sub-total and totals for categories, continents... - Variable "country_orig" violates Tidy rule no. 1: not in its own column --- ### "Today Only, The Landlord Went Nuts!" - ~~It's an Excel file 😠~~ - ~~Multiple sheets~~ - ~~Color coded, font coded (bold), space coded!~~ - ~~Logo, free text in header lines~~ - ~~Merged cells~~ - ~~French letters (anything but [A-Za-z] can break code)~~ - ~~Spaces, parentheses in column names~~ - ~~Different `NA` values: "..", "-"~~ - ~~Variable "country_dest" contains sub-total and totals for categories, continents...~~ - Variable "country_orig" violates Tidy rule no. 1: not in its own column --- ## The much nicer `migration` table ```r migration <- read_rds("../data/migration.rds") migration ``` ``` ## # A tibble: 3,248 x 236 ## gender year code country_dest afghanistan albania algeria ## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 men 1990 108 burundi 0 0 0 ## 2 men 1990 174 comoros 0 0 0 ## 3 men 1990 262 djibouti 0 0 0 ## 4 men 1990 232 eritrea 0 0 0 ## 5 men 1990 231 ethiopia 0 0 0 ## 6 men 1990 404 kenya 0 0 0 ## 7 men 1990 450 madagascar 0 0 0 ## 8 men 1990 454 malawi 0 0 0 ## 9 men 1990 480 mauritius 0 0 0 ## 10 men 1990 175 mayotte 0 0 0 ## # ... with 3,238 more rows, and 229 more variables: american_samoa <dbl>, ## # andorra <dbl>, angola <dbl>, anguilla <dbl>, ## # antigua_and_barbuda <dbl>, argentina <dbl>, armenia <dbl>, ## # aruba <dbl>, australia <dbl>, austria <dbl>, azerbaijan <dbl>, ## # bahamas <dbl>, bahrain <dbl>, bangladesh <dbl>, barbados <dbl>, ## # belarus <dbl>, belgium <dbl>, belize <dbl>, benin <dbl>, ## # bermuda <dbl>, bhutan <dbl>, bolivia_plurinational_state_of <dbl>, ## # bonaire_sint_eustatius_and_saba <dbl>, bosnia_and_herzegovina <dbl>, ## # botswana <dbl>, brazil <dbl>, british_virgin_islands <dbl>, ## # brunei_darussalam <dbl>, bulgaria <dbl>, burkina_faso <dbl>, ## # burundi <dbl>, cabo_verde <dbl>, cambodia <dbl>, cameroon <dbl>, ## # canada <dbl>, cayman_islands <dbl>, central_african_republic <dbl>, ## # chad <dbl>, channel_islands <dbl>, chile <dbl>, china <dbl>, ## # china_hong_kong_sar <dbl>, china_macao_sar <dbl>, colombia <dbl>, ## # comoros <dbl>, congo <dbl>, cook_islands <dbl>, costa_rica <dbl>, ## # côte_divoire <dbl>, croatia <dbl>, cuba <dbl>, curaçao <dbl>, ## # cyprus <dbl>, czechia <dbl>, dem._peoples_republic_of_korea <dbl>, ## # democratic_republic_of_the_congo <dbl>, denmark <dbl>, djibouti <dbl>, ## # dominica <dbl>, dominican_republic <dbl>, ecuador <dbl>, egypt <dbl>, ## # el_salvador <dbl>, equatorial_guinea <dbl>, eritrea <dbl>, ## # estonia <dbl>, eswatini <dbl>, ethiopia <dbl>, ## # falkland_islands_malvinas <dbl>, faroe_islands <dbl>, fiji <dbl>, ## # finland <dbl>, france <dbl>, french_guiana <dbl>, ## # french_polynesia <dbl>, gabon <dbl>, gambia <dbl>, georgia <dbl>, ## # germany <dbl>, ghana <dbl>, gibraltar <dbl>, greece <dbl>, ## # greenland <dbl>, grenada <dbl>, guadeloupe <dbl>, guam <dbl>, ## # guatemala <dbl>, guinea <dbl>, guinea_bissau <dbl>, guyana <dbl>, ## # haiti <dbl>, holy_see <dbl>, honduras <dbl>, hungary <dbl>, ## # iceland <dbl>, india <dbl>, indonesia <dbl>, ## # iran_islamic_republic_of <dbl>, iraq <dbl>, ireland <dbl>, ... ``` --- ## It's not right, but it's Ok: - How many men immigrated from Russia to Israel in 1990? ```r migration %>% filter(country_dest == "israel", year == 1990, gender == "men") %>% pull(russian_federation) ``` ``` ## [1] 80450 ``` - How many men immigrated from Israel to Russia in 1990? ```r migration %>% filter(country_dest == "russian_federation", year == 1990, gender == "men") %>% pull(israel) ``` ``` ## [1] 1395 ``` --- ## It would have been much nicer to have: ```r migration %>% filter(country_orig == "israel", country_dest == "russian_fedration", year == 1990, gender == "men") %>% pull(n_migrants) ``` Then we could put in a (simpler) function and call the opposite: ```r get_1way_migration <- function(orig, dest, gen, .year) { migration %>% filter(country_orig == orig, country_dest == dest, year == .year, gender == gen) %>% pull(n_migrants) } get_1way_migration("russian_federation", "israel") ``` --- class: section-slide # End of Detour --- ## `pivot_longer()` ```r migration_long <- migration %>% pivot_longer(cols = -c(1:4), names_to = "country_orig", values_to = "n_migrants") migration_long ``` ``` ## # A tibble: 753,536 x 6 ## gender year code country_dest country_orig n_migrants ## <chr> <dbl> <dbl> <chr> <chr> <dbl> ## 1 men 1990 108 burundi afghanistan 0 ## 2 men 1990 108 burundi albania 0 ## 3 men 1990 108 burundi algeria 0 ## 4 men 1990 108 burundi american_samoa 0 ## 5 men 1990 108 burundi andorra 0 ## 6 men 1990 108 burundi angola 0 ## 7 men 1990 108 burundi anguilla 0 ## 8 men 1990 108 burundi antigua_and_barbuda 0 ## 9 men 1990 108 burundi argentina 0 ## 10 men 1990 108 burundi armenia 0 ## # ... with 753,526 more rows ``` --- ## What sorcery is this? <img src = "images/pivot_longer_explained.png" style="width: 100%"> .font80percent[Source: [The Carpentries](http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html)] --- ## `pivot_wider()` ```r migration_wide <- migration_long %>% pivot_wider(id_cols = 1:4, names_from = country_orig, values_from = n_migrants) migration_wide ``` ``` ## # A tibble: 3,248 x 236 ## gender year code country_dest afghanistan albania algeria ## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> ## 1 men 1990 108 burundi 0 0 0 ## 2 men 1990 174 comoros 0 0 0 ## 3 men 1990 262 djibouti 0 0 0 ## 4 men 1990 232 eritrea 0 0 0 ## 5 men 1990 231 ethiopia 0 0 0 ## 6 men 1990 404 kenya 0 0 0 ## 7 men 1990 450 madagascar 0 0 0 ## 8 men 1990 454 malawi 0 0 0 ## 9 men 1990 480 mauritius 0 0 0 ## 10 men 1990 175 mayotte 0 0 0 ## # ... with 3,238 more rows, and 229 more variables: american_samoa <dbl>, ## # andorra <dbl>, angola <dbl>, anguilla <dbl>, ## # antigua_and_barbuda <dbl>, argentina <dbl>, armenia <dbl>, ## # aruba <dbl>, australia <dbl>, austria <dbl>, azerbaijan <dbl>, ## # bahamas <dbl>, bahrain <dbl>, bangladesh <dbl>, barbados <dbl>, ## # belarus <dbl>, belgium <dbl>, belize <dbl>, benin <dbl>, ## # bermuda <dbl>, bhutan <dbl>, bolivia_plurinational_state_of <dbl>, ## # bonaire_sint_eustatius_and_saba <dbl>, bosnia_and_herzegovina <dbl>, ## # botswana <dbl>, brazil <dbl>, british_virgin_islands <dbl>, ## # brunei_darussalam <dbl>, bulgaria <dbl>, burkina_faso <dbl>, ## # burundi <dbl>, cabo_verde <dbl>, cambodia <dbl>, cameroon <dbl>, ## # canada <dbl>, cayman_islands <dbl>, central_african_republic <dbl>, ## # chad <dbl>, channel_islands <dbl>, chile <dbl>, china <dbl>, ## # china_hong_kong_sar <dbl>, china_macao_sar <dbl>, colombia <dbl>, ## # comoros <dbl>, congo <dbl>, cook_islands <dbl>, costa_rica <dbl>, ## # côte_divoire <dbl>, croatia <dbl>, cuba <dbl>, curaçao <dbl>, ## # cyprus <dbl>, czechia <dbl>, dem._peoples_republic_of_korea <dbl>, ## # democratic_republic_of_the_congo <dbl>, denmark <dbl>, djibouti <dbl>, ## # dominica <dbl>, dominican_republic <dbl>, ecuador <dbl>, egypt <dbl>, ## # el_salvador <dbl>, equatorial_guinea <dbl>, eritrea <dbl>, ## # estonia <dbl>, eswatini <dbl>, ethiopia <dbl>, ## # falkland_islands_malvinas <dbl>, faroe_islands <dbl>, fiji <dbl>, ## # finland <dbl>, france <dbl>, french_guiana <dbl>, ## # french_polynesia <dbl>, gabon <dbl>, gambia <dbl>, georgia <dbl>, ## # germany <dbl>, ghana <dbl>, gibraltar <dbl>, greece <dbl>, ## # greenland <dbl>, grenada <dbl>, guadeloupe <dbl>, guam <dbl>, ## # guatemala <dbl>, guinea <dbl>, guinea_bissau <dbl>, guyana <dbl>, ## # haiti <dbl>, holy_see <dbl>, honduras <dbl>, hungary <dbl>, ## # iceland <dbl>, india <dbl>, indonesia <dbl>, ## # iran_islamic_republic_of <dbl>, iraq <dbl>, ireland <dbl>, ... ``` --- ### Where is this going to? .font80percent[ ```r get_1way_migration <- function(.country_dest, .country_orig) { migration_long %>% filter(country_dest == .country_dest, country_orig == .country_orig) %>% group_by(year) %>% tally(n_migrants) %>% mutate(direction = str_c(.country_orig, " to ", .country_dest)) } get_2way_migration <- function(country_a, country_b) { a2b <- get_1way_migration(country_b, country_a) b2a <- get_1way_migration(country_a, country_b) bind_rows(a2b, b2a) } get_2way_migration("israel", "russian_federation") %>% ggplot(aes(year, n)) + geom_line() + geom_point(color = "orange", size = 5) + labs(x = "", y = "", title = "Israel-Russia Yearly No. of immigrants") + theme_dark() + facet_wrap(~direction, scales = "free", labeller = labeller(direction = rus_isr_names)) + scale_y_continuous(labels = scales::comma_format()) + theme(axis.text = element_text(size = 12, hjust=0.9, family = "mono"), strip.text.x = element_text(size = 14, family = "mono"), plot.title = element_text(hjust = 0.5, size = 18, family = "mono")) ``` ] --- <img src="images/Where-Is-This-Going-To2-1.png" width="100%" /> --- ## Some more useful Tidying up verbs Remember? ```r table3 <- read_rds("../data/tidy_tables.rds")$table3 table3 ``` ``` ## # A tibble: 315 x 3 ## religion yob pct_straight ## <chr> <dbl> <chr> ## 1 atheist 1950 26/29 ## 2 buddhist 1950 6/6 ## 3 christian 1950 28/32 ## 4 hindu 1950 0/0 ## 5 jewish 1950 21/24 ## 6 muslim 1950 0/0 ## 7 unspecified 1950 71/76 ## 8 atheist 1951 31/33 ## 9 buddhist 1951 11/11 ## 10 christian 1951 23/24 ## # ... with 305 more rows ``` --- ## `separate()` ```r table3_tidy <- table3 %>% separate(pct_straight, into = c("straight", "total"), sep = "/") table3_tidy ``` ``` ## # A tibble: 315 x 4 ## religion yob straight total ## <chr> <dbl> <chr> <chr> ## 1 atheist 1950 26 29 ## 2 buddhist 1950 6 6 ## 3 christian 1950 28 32 ## 4 hindu 1950 0 0 ## 5 jewish 1950 21 24 ## 6 muslim 1950 0 0 ## 7 unspecified 1950 71 76 ## 8 atheist 1951 31 33 ## 9 buddhist 1951 11 11 ## 10 christian 1951 23 24 ## # ... with 305 more rows ``` --- ## `unite()` (Though see a much more generalizable approach with `purrr`) ```r table3_tidy %>% unite(col = "pct_straight", straight, total, sep = "/") ``` ``` ## # A tibble: 315 x 3 ## religion yob pct_straight ## <chr> <dbl> <chr> ## 1 atheist 1950 26/29 ## 2 buddhist 1950 6/6 ## 3 christian 1950 28/32 ## 4 hindu 1950 0/0 ## 5 jewish 1950 21/24 ## 6 muslim 1950 0/0 ## 7 unspecified 1950 71/76 ## 8 atheist 1951 31/33 ## 9 buddhist 1951 11/11 ## 10 christian 1951 23/24 ## # ... with 305 more rows ``` --- class: section-slide # Iteration without looping --- ## Fun fact: you're already not-looping! - Say you want the lengths of all of these strings: ```r strings_vec <- c("I'm feeling fine", "I'm perfectly OK", "Nothing is wrong!") ``` - Do you `for` loop? ```r strings_len <- numeric(length(strings_vec)) for (i in seq_along(strings_vec)) { strings_len[i] <- nchar(strings_vec[i]) } strings_len ``` ``` ## [1] 16 16 17 ``` --- - No, you use R's vectorized functions nature: ```r nchar(strings_vec) ``` ``` ## [1] 16 16 17 ``` - In case you were wondering: ```r microbenchmark(nchar_loop(), nchar_vectorized()) ``` ``` ## Unit: microseconds ## expr min lq mean median uq max neval cld ## nchar_loop() 2.9 3 47.688 3.1 3.2 4459.8 100 a ## nchar_vectorized() 1.0 1 15.996 1.1 1.1 1470.5 100 a ``` - So why should iterating through a `data.frame` columns or rows be any different? --- ## The problem: ```r okcupid <- read_csv("../data/okcupid.csv.zip", col_types = cols()) okcupid %>% select(essay0:essay2) %>% head(3) ``` ``` ## # A tibble: 3 x 3 ## essay0 essay1 essay2 ## <chr> <chr> <chr> ## 1 "about me:<br />\n<br /~ "currently working as a~ "making people laugh.<~ ## 2 "i am a chef: this is w~ dedicating everyday to ~ "being silly. having r~ ## 3 "i'm not ashamed of muc~ "i make nerdy software ~ "improvising in differ~ ``` - I want to apply some transformation to one or more columns (say the length of each `essay` question for each user) - I want to apply some transformation to one or more rows (say the average length of all `essay` questions for each user) - BTW, you can always use a `for` loop, see how it is done [here](https://r4ds.had.co.nz/iteration.html#for-loops) --- ## The `apply` family - There *is* a solution in base R - `apply()` - `sapply()` - `lapply()` - `tapply()` - `vapply()` - `mapply()` - What do you think is the issue with these? See [this](https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply). --- ## I don't `for`, I `purrr` The `purrr` package provides a set of functions to make iteration easier: - No boilerplate code for looping --> less looping bugs - Focus on the function, the action, not the plumbing - Generally faster (implemented in C) - Definitely more clear, concise and elegant code TBH, I'm addicted 😢 --- ## You get a `map()`, you get a `map()`! | Single | Two | Multiple | Returns | Of | |---------------|-------------|------|--------|-------| | `map()` | `map2()` | `pmap()` | `list` | `?`| | `map_lgl()` | `map2_lgl()` | `pmap_lgl()` |`vector`| `logical` | | `map_chr()` | `map2_chr()` | `pmap_chr()` |`vector`| `character` | | `map_int()` | `map2_int()` | `pmap_int()` |`vector`| `integer` | | `map_dbl()` | `map2_dbl()` | `pmap_dbl()` |`vector`| `double` | | `map_dfr()` | `map2_dfr()` | `pmap_dfr()` | `tibble` | `?` | Where "Single" means "single vector/column input", "Two" means "two vectors/columns input" etc. .font80percent[(Tip of the iceberg really, I want you to survive this slide)] --- ### Example1: Vectorizing a Function Take a clearly not-vectorized function: ```r my_func <- function(x) { if (x %% 2 == 0) return("even") "odd" } my_func(10) ``` ``` ## [1] "even" ``` ```r my_func(1:5) ``` <pre style="color: red;"><code>## Warning in if (x%%2 == 0) return("even"): the condition has length > 1 and ## only the first element will be used </code></pre> ``` ## [1] "odd" ``` .insight[ 💡 This is silly example, do you know how to easily vectorize this function? ] --- `map()` will always return a list: ```r map(1:3, my_func) ``` ``` ## [[1]] ## [1] "odd" ## ## [[2]] ## [1] "even" ## ## [[3]] ## [1] "odd" ``` ```r 1:3 %>% map(my_func) ``` ``` ## [[1]] ## [1] "odd" ## ## [[2]] ## [1] "even" ## ## [[3]] ## [1] "odd" ``` --- `map_chr()` will always return a vector of `character`: ```r map_chr(1:3, my_func) ``` ``` ## [1] "odd" "even" "odd" ``` ```r 1:3 %>% map_chr(my_func) ``` ``` ## [1] "odd" "even" "odd" ``` But here is the beautiful thing: ```r my_func_vectorized <- function(vec) map_chr(vec, my_func) my_func_vectorized(1:3) ``` ``` ## [1] "odd" "even" "odd" ``` Look Ma, no loops! --- ### Example2: Complex `mutate()` Manager: Add me a column, for each OkCupid user, whether he/she's above average height. ```r is_above_average_height <- function(sex, height_cm) { if (sex == "m") height_cm > 180 else height_cm > 165 } okcupid <- okcupid %>% mutate(is_tall = map2_lgl(sex, height_cm, is_above_average_height)) okcupid %>% select(sex, height_cm, is_tall) %>% group_by(sex) %>% sample_n(3) ``` ``` ## # A tibble: 6 x 3 ## # Groups: sex [2] ## sex height_cm is_tall ## <chr> <dbl> <lgl> ## 1 f 170. TRUE ## 2 f 163. FALSE ## 3 f 165. TRUE ## 4 m 175. FALSE ## 5 m 178. FALSE ## 6 m 175. FALSE ``` .font80percent[This specific example might not be *that* bad with `ifelse()`, what will you do with 5 variables?] --- ### Example2: You could even supply args ```r is_above_average_height <- function(sex, height_cm, men_avg, women_avg) { if(sex == "m") height_cm > men_avg else height_cm > women_avg } okcupid <- okcupid %>% mutate(is_tall = map2_lgl(sex, height_cm, is_above_average_height, men_avg = 180, women_avg = 165)) okcupid %>% select(sex, height_cm, is_tall) %>% group_by(sex) %>% sample_n(3) ``` ``` ## # A tibble: 6 x 3 ## # Groups: sex [2] ## sex height_cm is_tall ## <chr> <dbl> <lgl> ## 1 f 168. TRUE ## 2 f 155. FALSE ## 3 f 178. TRUE ## 4 m 178. FALSE ## 5 m 178. FALSE ## 6 m 180. TRUE ``` --- ### Example2: Anonymous Functions ```r okcupid <- okcupid %>% mutate(is_tall = map2_lgl(sex, height_cm, function(x, y) if(x == "m") y > 180 else y > 165)) ``` Heck, you can even: ```r okcupid <- okcupid %>% mutate(is_tall = map2_lgl(sex, height_cm, ~{if(.x == "m") .y > 180 else .y > 165})) ``` .insight[ 💡 There's a thin line between elegance and unreadable undebuggable bragging. ] --- ### Example 3: Remember our problem? For each `essay` column, add a column with its length: ```r okcupid %>% mutate_at(vars(essay0:essay9), list("len" = str_length)) %>% select(starts_with("essay")) %>% head(3) ``` ``` ## # A tibble: 3 x 20 ## essay0 essay1 essay2 essay3 essay4 essay5 essay6 essay7 essay8 essay9 ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 "abou~ "curr~ "maki~ "the ~ "book~ "food~ duali~ "tryi~ "i am~ "you ~ ## 2 "i am~ dedic~ "bein~ <NA> "i am~ "deli~ <NA> <NA> i am ~ <NA> ## 3 "i'm ~ "i ma~ "impr~ "my l~ "okay~ "move~ <NA> viewi~ "when~ "you ~ ## # ... with 10 more variables: essay0_len <int>, essay1_len <int>, ## # essay2_len <int>, essay3_len <int>, essay4_len <int>, ## # essay5_len <int>, essay6_len <int>, essay7_len <int>, ## # essay8_len <int>, essay9_len <int> ``` `purrr` not needed, like I said, you've already been non-looping! --- ### Example 3: Input multiple columns ~~For each user, compute the average `essay` length:~~ Wait, before that, for each user compute the average of the 3 first `essay`s: ```r mean_length_3essay <- function(x, y, z) { mean(str_length(c(x, y, z)), na.rm = TRUE) } okcupid %>% mutate(essay3_avglen = pmap_dbl( list(essay0, essay1, essay2), mean_length_3essay)) %>% select(essay0:essay2, essay3_avglen) %>% head(2) ``` ``` ## # A tibble: 2 x 4 ## essay0 essay1 essay2 essay3_avglen ## <chr> <chr> <chr> <dbl> ## 1 "about me:<br />\n~ "currently working~ "making people lau~ 557. ## 2 "i am a chef: this~ dedicating everyda~ "being silly. havi~ 280. ``` --- OK now, for each user, compute the average `essay` length: ```r mean_length_essay <- function(...) { mean(str_length(c(...)), na.rm = TRUE) } okcupid %>% mutate(essay_avglen = pmap_dbl( select(., starts_with("essay")), mean_length_essay)) %>% select(essay0:essay3, essay_avglen) %>% head(3) ``` ``` ## # A tibble: 3 x 5 ## essay0 essay1 essay2 essay3 essay_avglen ## <chr> <chr> <chr> <chr> <dbl> ## 1 "about me:<br ~ "currently wo~ "making peopl~ "the way i lo~ 264. ## 2 "i am a chef: ~ dedicating ev~ "being silly.~ <NA> 241. ## 3 "i'm not asham~ "i make nerdy~ "improvising ~ "my large jaw~ 612 ``` --- ### Example4: Output multiple columns ```r essay0_features <- function(essay0) { contains_love <- str_detect(essay0, "love") contains_obama <- str_detect(essay0, "obama") contains_rel <- str_detect(essay0, "relationship") list( essay0_love = contains_love, essay0_obama = contains_obama, essay0_ser_rel = contains_rel ) } okcupid %>% select(essay0) %>% map_dfc(essay0_features) ``` ``` ## # A tibble: 59,946 x 3 ## essay0_love essay0_obama essay0_ser_rel ## <lgl> <lgl> <lgl> ## 1 TRUE FALSE FALSE ## 2 TRUE FALSE FALSE ## 3 TRUE FALSE FALSE ## 4 FALSE FALSE FALSE ## 5 FALSE FALSE FALSE ## 6 TRUE FALSE FALSE ## 7 TRUE FALSE FALSE ## 8 NA NA NA ## 9 NA NA NA ## 10 TRUE FALSE FALSE ## # ... with 59,936 more rows ``` ```r # or: map_dfc(okcupid$essay0, essay0_features) ``` --- Or if you want those features as additional columns, you could bind them with `bind_cols()`: ```r okcupid %>% bind_cols( okcupid %>% select(essay0) %>% map_dfc(essay0_features) ) %>% select(age, sex, starts_with("essay0")) ``` ``` ## # A tibble: 59,946 x 6 ## age sex essay0 essay0_love essay0_obama essay0_ser_rel ## <dbl> <chr> <chr> <lgl> <lgl> <lgl> ## 1 22 m "about me:<br />\n<~ TRUE FALSE FALSE ## 2 35 m "i am a chef: this ~ TRUE FALSE FALSE ## 3 38 m "i'm not ashamed of~ TRUE FALSE FALSE ## 4 23 m i work in a library~ FALSE FALSE FALSE ## 5 29 m "hey how's it going~ FALSE FALSE FALSE ## 6 29 m "i'm an australian ~ TRUE FALSE FALSE ## 7 32 f "life is about the ~ TRUE FALSE FALSE ## 8 31 f <NA> NA NA NA ## 9 24 f <NA> NA NA NA ## 10 37 m "my names jake.<br ~ TRUE FALSE FALSE ## # ... with 59,936 more rows ``` --- ## Dealing with failure ```r my_func("a") ``` <pre style="color: red;"><code>## Error in x%%2: non-numeric argument to binary operator </code></pre> Silly example, but: (a) When dealing with big data expect the unexpected (input) (b) You don't want your app to crash Look at: `safely()`, `quietly()` and `possibly()` which wrap your code nicely and protect you from the unexpected (crash). --- ### My favorite: `possibly()` ```r my_func_safe <- possibly(my_func, otherwise = NA) my_func_safe("a") ``` ``` ## [1] NA ``` ```r map_chr(list(1, 2, "3", 4), my_func_safe) ``` ``` ## [1] "odd" "even" NA "even" ``` .insight[ 💡 See a more realistic example when we talk about Web Scraping ] .insight[ 💡 What would `map_chr(c(1, 2, "3", 4), my_func_safe)` return? ] --- ## `walk()`, `walk2()`, `pwalk()` You don't always need to *return* anything, you just wanna loop ```r walk(1:10, ~print("Hey Girrrl")) ``` ``` ## [1] "Hey Girrrl" ## [1] "Hey Girrrl" ## [1] "Hey Girrrl" ## [1] "Hey Girrrl" ## [1] "Hey Girrrl" ## [1] "Hey Girrrl" ## [1] "Hey Girrrl" ## [1] "Hey Girrrl" ## [1] "Hey Girrrl" ## [1] "Hey Girrrl" ``` --- ```r plot_norm <- function(mu, sd) { plot(density(rnorm(100, mu, sd)), main = str_c("mu: ", mu, ", sd: ", sd)) } par(mfcol = c(1, 2)) walk2(c(0, 10), c(1, 0.1), plot_norm) ``` <img src="images/Walk2-1.png" width="80%" /> --- ### In the words of Hadley > Once you master these functions, you’ll find it takes much less time to solve iteration problems. <br> > But you should never feel bad about using a for loop instead of a map function. <br> > The important thing is that you solve the problem that you’re working on, not write the most concise and elegant code. <br> > Some people will tell you to avoid for loops because they are slow. They’re wrong!