Tidy Data Wrangling

---

## Tidy Data Wrangling - Part B

### Applications of Data Science - Class 3

### Giora Simchoni

#### `gsimchoni@gmail.com and add #dsapps in subject`

### Stat. and OR Department, TAU
### 2020-02-28

---

<div class="my-footer">
  <span>
    <a href="https://dsapps-2020.github.io/Class_Slides/" target="_blank">Applications of Data Science
    </a>
  </span>
</div>

---

# Joining Tables

---
class: section-slide

# Detour: The Starwars Dataset(s)

---

## The Starwars Dataset(s)

```r
sw_tables <- read_rds("../data/sw_tables.rds")
characters <- sw_tables$characters
planets <- sw_tables$planets
films <- sw_tables$films
```

```r
glimpse(characters)
```

```
## Observations: 173
## Variables: 14
## $ character_id <int> 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3...
## $ name         <chr> "Luke Skywalker", "Luke Skywalker", "Luke Skywalk...
## $ gender       <chr> "male", "male", "male", "male", "male", NA, NA, N...
## $ homeworld_id <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8, 8, 8, 8...
## $ height       <dbl> 172, 172, 172, 172, 172, 167, 167, 167, 167, 167,...
## $ mass         <dbl> 77, 77, 77, 77, 77, 75, 75, 75, 75, 75, 75, 32, 3...
## $ hair_color   <chr> "blond", "blond", "blond", "blond", "blond", NA, ...
## $ skin_color   <chr> "fair", "fair", "fair", "fair", "fair", "gold", "...
## $ eye_color    <chr> "blue", "blue", "blue", "blue", "blue", "yellow",...
## $ birth_year   <dbl> 19.0, 19.0, 19.0, 19.0, 19.0, 112.0, 112.0, 112.0...
## $ film_id      <dbl> 1, 2, 3, 6, 7, 1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6...
## $ species      <list> ["http://swapi.co/api/species/1/", "http://swapi...
## $ vehicles     <list> [<"http://swapi.co/api/vehicles/14/", "http://sw...
## $ starships    <list> [<"http://swapi.co/api/starships/12/", "http://s...
```

---

```r
glimpse(planets)
```

```
## Observations: 61
## Variables: 10
## $ planet_id       <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14,...
## $ name            <chr> "Tatooine", "Alderaan", "Yavin IV", "Hoth", "D...
## $ rotation_period <dbl> 23, 24, 24, 23, 23, 12, 18, 26, 24, 27, 30, 27...
## $ orbital_period  <dbl> 304, 364, 4818, 549, 341, 5110, 402, 312, 368,...
## $ diameter        <dbl> 10465, 12500, 10200, 7200, 8900, 118000, 4900,...
## $ climate         <chr> "arid", "temperate", "temperate, tropical", "f...
## $ gravity         <chr> "1 standard", "1 standard", "1 standard", "1.1...
## $ terrain         <chr> "desert", "grasslands, mountains", "jungle, ra...
## $ surface_water   <dbl> 1.0, 40.0, 8.0, 100.0, 8.0, 0.0, 8.0, 12.0, NA...
## $ population      <dbl> 2.0e+05, 2.0e+09, 1.0e+03, NA, NA, 6.0e+06, 3....
```

```r
glimpse(films)
```

```
## Observations: 7
## Variables: 7
## $ film_id       <dbl> 1, 2, 3, 4, 5, 6, 7
## $ title         <chr> "A New Hope", "The Empire Strikes Back", "Return...
## $ episode_id    <int> 4, 5, 6, 1, 2, 3, 7
## $ opening_crawl <chr> "It is a period of civil war.\r\nRebel spaceship...
## $ director      <chr> "George Lucas", "Irvin Kershner", "Richard Marqu...
## $ producer      <chr> "Gary Kurtz, Rick McCallum", "Gary Kutz, Rick Mc...
## $ release_date  <chr> "1977-05-25", "1980-05-17", "1983-05-25", "1999-...
```

---
class: section-slide

# End of Detour

---

#### Q: Which characters appear in SW films directed by George Lucas?

```r
unique(
  characters$name[
    characters$film_id %in%
      films$film_id[films$director == "George Lucas"]
    ]
  )

# or dplyr approach:
characters %>%
    filter(film_id %in% (films %>%
                           filter(director=="George Lucas") %>%
                           pull(film_id))) %>%
    pull(name) %>%
    unique()
```

First problem with this approach: code gets messier and messier, prone to bugs and hard to debug.
---

#### Q: Which characters, whose homeworld is Alderaan, appear in SW films directed by George Lucas?

```r
unique(
  characters$name[
    characters$film_id %in%
      films$film_id[films$director == "George Lucas"] &
    characters$homeworld_id ==
      planets$planet_id[planets$name == "Alderaan"]
  ]
)

# or dplyr approach:
characters %>%
    filter(film_id %in% (films %>%
                           filter(director=="George Lucas") %>%
                           pull(film_id)),
           homeworld_id == (planets %>%
                              filter(name == "Alderaan") %>%
                              pull(planet_id))) %>%
    pull(name) %>%
    unique()

# [1] "Leia Organa"         "Bail Prestor Organa" "Raymus Antilles"
```

---

#### Now imagine these two tables

`lines`: each film, each scene, each minute, each character, the line

```r
lines <- tibble::tribble(
  ~film_id, ~scene_id, ~minute_id, ~character_id,                           ~line,
         1,         1,          1,            23,                "blah blah blah",
         1,         1,          2,            15,           "something something",
         1,         1,          2,            23, "something something to you to",
         1,         1,          3,            15,                              NA,
         1,         1,          3,            23,                              NA,
         1,         1,          4,             8,             "whatever whatever"
  )
```

`locations`: for each film, each scene, each minute, its location

```r
locations <- tibble::tribble(
  ~film_id, ~scene_id, ~minute_id,            ~location,
         1,         1,          1,          "spaceship",
         1,         1,          2, "Alderaan, outdoors",
         1,         1,          3,          "spaceship",
         1,         1,          4,                "bar"
  )
```

---

#### Q: Which characters say "something" in a bar?

- filter only `lines` which contain "something"
- filter only "bar" `locations` and then...
- if the two filtered tables match on film, scene and minute...
- we take the unique characters
- but how do we match? `for` loop*?

➡️ Clearly, second problem with this approach: it doesn't generalize well to more complex scenarios, where we need to match on multiple criteria

And third problem: speed .font80percent[(but only in a complex scenario!)]

.font80percent[&#42; There *is* a way without using a for loop, without joining, still not recommended.]

---

## `inner_join()`

(Inner) Joining two tables:

```r
characters %>%
  inner_join(films) %>%
  select(character_id, name, film_id, title, director) %>%
  head(7)
```

```
## Joining, by = "film_id"
```

```
## # A tibble: 7 x 5
##   character_id name           film_id title                 director       
##          <int> <chr>            <dbl> <chr>                 <chr>          
## 1            1 Luke Skywalker       1 A New Hope            George Lucas   
## 2            1 Luke Skywalker       2 The Empire Strikes B~ Irvin Kershner 
## 3            1 Luke Skywalker       3 Return of the Jedi    Richard Marqua~
## 4            1 Luke Skywalker       6 Revenge of the Sith   George Lucas   
## 5            1 Luke Skywalker       7 The Force Awakens     J. J. Abrams   
## 6            2 C-3PO                1 A New Hope            George Lucas   
## 7            2 C-3PO                2 The Empire Strikes B~ Irvin Kershner
```

---

(Inner) Joining multiple tables:

```r
characters %>%
  inner_join(films) %>%
  inner_join(planets, by = c("homeworld_id" = "planet_id")) %>%
  select(character_id, name.x, film_id, title, director, name.y, climate) %>% head(7)
```

```
## Joining, by = "film_id"
```

```
## # A tibble: 7 x 7
##   character_id name.x     film_id title         director     name.y climate
##          <int> <chr>        <dbl> <chr>         <chr>        <chr>  <chr>  
## 1            1 Luke Skyw~       1 A New Hope    George Lucas Tatoo~ arid   
## 2            1 Luke Skyw~       2 The Empire S~ Irvin Kersh~ Tatoo~ arid   
## 3            1 Luke Skyw~       3 Return of th~ Richard Mar~ Tatoo~ arid   
## 4            1 Luke Skyw~       6 Revenge of t~ George Lucas Tatoo~ arid   
## 5            1 Luke Skyw~       7 The Force Aw~ J. J. Abrams Tatoo~ arid   
## 6            2 C-3PO            1 A New Hope    George Lucas Tatoo~ arid   
## 7            2 C-3PO            2 The Empire S~ Irvin Kersh~ Tatoo~ arid
```

---

(Inner) Joining on multiple keys:

```r
lines %>%
  inner_join(locations, by = c("film_id", "scene_id", "minute_id"))
```

```
## # A tibble: 6 x 6
##   film_id scene_id minute_id character_id line                location     
##     <dbl>    <dbl>     <dbl>        <dbl> <chr>               <chr>        
## 1       1        1         1           23 blah blah blah      spaceship    
## 2       1        1         2           15 something something Alderaan, ou~
## 3       1        1         2           23 something somethin~ Alderaan, ou~
## 4       1        1         3           15 <NA>                spaceship    
## 5       1        1         3           23 <NA>                spaceship    
## 6       1        1         4            8 whatever whatever   bar
```

---

## What would `inner_join()` do?

Q: Which characters appear in SW films directed by George Lucas?

```r
# naive
characters %>%
  inner_join(films, by = "film_id") %>%
  filter(director == "George Lucas") %>%
  pull(name) %>%
  unique()

# smarter
characters %>%
  inner_join(films %>% filter(director == "George Lucas"),
             by = "film_id") %>%
  pull(name) %>%
  unique()
```

## What would `inner_join()` do?

Q: Which characters, whose homeworld is Alderaan, appear in SW films directed by George Lucas?

```r
characters %>%
  inner_join(films %>% filter(director == "George Lucas"),
             by = "film_id") %>%
  inner_join(planets %>% filter(name == "Alderaan"),
             by = c("homeworld_id" = "planet_id"),
                    suffix = c("_char", "_planet")) %>%
  pull(name_char) %>%
  unique()
```

```
## [1] "Leia Organa"         "Bail Prestor Organa" "Raymus Antilles"
```

---

### Note of caution: Join is not always faster!

```r
baseR_no_join <- function() {
  unique(characters$name[characters$film_id %in% films$film_id[films$director == "George Lucas"]])
}

baseR_with_join <- function() {
  unique(merge(characters, films[films$director == "George Lucas", ], by = "film_id")$name)
}

dplyr_no_join <- function() {
  characters %>%
    filter(film_id %in% (films %>%
                           filter(director=="George Lucas") %>%
                           pull(film_id))) %>%
    pull(name) %>%
    unique()
}

dplyr_with_join <- function() {
  characters %>%
    inner_join(films %>% filter(director=="George Lucas"), by = "film_id") %>%
    pull(name) %>%
    unique()
}
```

---

```r
library(microbenchmark)
res <- microbenchmark(baseR_no_join(), baseR_with_join(),
                      dplyr_no_join(), dplyr_with_join(), times = 20)
autoplot(res)
```

---

### So when does it shine?

- See [this](https://stackoverflow.com/q/59183599/4095235) StackOverflow question and answer for a good example*
- But in general: when data gets bigger, when multiple keys are involved

### Other types of Joins?

Definitely, read [here](https://stat545.com/join-cheatsheet.html), [here](https://r4ds.had.co.nz/relational-data.html) and [here](https://rstudio-education.github.io/tidyverse-cookbook/transform-tables.html#joins) about:
- `left_join()`
- `right_join()`
- `full_join()`
- `anti_join()`
- `semi_join()`

---

## All Joins

.font80percent[
Source: [The Tidyverse Cookbook](https://rstudio-education.github.io/tidyverse-cookbook/transform-tables.html#joins)
]
---

## All Joins

.font80percent[
Source: [The Tidyverse Cookbook](https://rstudio-education.github.io/tidyverse-cookbook/transform-tables.html#joins)
]
---
class: section-slide

# Tidying Tables

---
class: section-slide

# Detour: The Migration Dataset

---

## The Migration Dataset

- Straight from the [UN, Dept. of Economic and Social Affairs, Population Division](https://www.un.org/en/development/desa/population/index.asp)
- "Monitoring global population trends"
- For each country, how many (men, women and total) migrated to each country
- In 1990, 1995, 2000, 2005, 2010, 2015, 2019
- How would *you* give access to these data?

---

### You want dirty data? Well dirty data costs!

---

### What's untidy about the migration Excel file?

- It's an Excel file 😠
- Multiple sheets
- Color coded, font coded (bold), space coded!
- Logo, free text in header lines
- Merged cells
- French letters (anything but [A-Za-z] can break code)
- Spaces, parentheses in column names
- Different `NA` values: "..", "-"
- Variable "country_dest" contains sub-total and totals for categories, continents...
- Variable "country_orig" violates Tidy rule no. 1: not in its own column

---

### "Today Only, The Landlord Went Nuts!"

- ~~It's an Excel file 😠~~
- ~~Multiple sheets~~
- ~~Color coded, font coded (bold), space coded!~~
- ~~Logo, free text in header lines~~
- ~~Merged cells~~
- ~~French letters (anything but [A-Za-z] can break code)~~
- ~~Spaces, parentheses in column names~~
- ~~Different `NA` values: "..", "-"~~
- ~~Variable "country_dest" contains sub-total and totals for categories, continents...~~
- Variable "country_orig" violates Tidy rule no. 1: not in its own column

---

## The much nicer `migration` table

```r
migration <- read_rds("../data/migration.rds")

migration
```

```
## # A tibble: 3,248 x 236
##    gender  year  code country_dest afghanistan albania algeria
##    <chr>  <dbl> <dbl> <chr>              <dbl>   <dbl>   <dbl>
##  1 men     1990   108 burundi                0       0       0
##  2 men     1990   174 comoros                0       0       0
##  3 men     1990   262 djibouti               0       0       0
##  4 men     1990   232 eritrea                0       0       0
##  5 men     1990   231 ethiopia               0       0       0
##  6 men     1990   404 kenya                  0       0       0
##  7 men     1990   450 madagascar             0       0       0
##  8 men     1990   454 malawi                 0       0       0
##  9 men     1990   480 mauritius              0       0       0
## 10 men     1990   175 mayotte                0       0       0
## # ... with 3,238 more rows, and 229 more variables: american_samoa <dbl>,
## #   andorra <dbl>, angola <dbl>, anguilla <dbl>,
## #   antigua_and_barbuda <dbl>, argentina <dbl>, armenia <dbl>,
## #   aruba <dbl>, australia <dbl>, austria <dbl>, azerbaijan <dbl>,
## #   bahamas <dbl>, bahrain <dbl>, bangladesh <dbl>, barbados <dbl>,
## #   belarus <dbl>, belgium <dbl>, belize <dbl>, benin <dbl>,
## #   bermuda <dbl>, bhutan <dbl>, bolivia_plurinational_state_of <dbl>,
## #   bonaire_sint_eustatius_and_saba <dbl>, bosnia_and_herzegovina <dbl>,
## #   botswana <dbl>, brazil <dbl>, british_virgin_islands <dbl>,
## #   brunei_darussalam <dbl>, bulgaria <dbl>, burkina_faso <dbl>,
## #   burundi <dbl>, cabo_verde <dbl>, cambodia <dbl>, cameroon <dbl>,
## #   canada <dbl>, cayman_islands <dbl>, central_african_republic <dbl>,
## #   chad <dbl>, channel_islands <dbl>, chile <dbl>, china <dbl>,
## #   china_hong_kong_sar <dbl>, china_macao_sar <dbl>, colombia <dbl>,
## #   comoros <dbl>, congo <dbl>, cook_islands <dbl>, costa_rica <dbl>,
## #   côte_divoire <dbl>, croatia <dbl>, cuba <dbl>, curaçao <dbl>,
## #   cyprus <dbl>, czechia <dbl>, dem._peoples_republic_of_korea <dbl>,
## #   democratic_republic_of_the_congo <dbl>, denmark <dbl>, djibouti <dbl>,
## #   dominica <dbl>, dominican_republic <dbl>, ecuador <dbl>, egypt <dbl>,
## #   el_salvador <dbl>, equatorial_guinea <dbl>, eritrea <dbl>,
## #   estonia <dbl>, eswatini <dbl>, ethiopia <dbl>,
## #   falkland_islands_malvinas <dbl>, faroe_islands <dbl>, fiji <dbl>,
## #   finland <dbl>, france <dbl>, french_guiana <dbl>,
## #   french_polynesia <dbl>, gabon <dbl>, gambia <dbl>, georgia <dbl>,
## #   germany <dbl>, ghana <dbl>, gibraltar <dbl>, greece <dbl>,
## #   greenland <dbl>, grenada <dbl>, guadeloupe <dbl>, guam <dbl>,
## #   guatemala <dbl>, guinea <dbl>, guinea_bissau <dbl>, guyana <dbl>,
## #   haiti <dbl>, holy_see <dbl>, honduras <dbl>, hungary <dbl>,
## #   iceland <dbl>, india <dbl>, indonesia <dbl>,
## #   iran_islamic_republic_of <dbl>, iraq <dbl>, ireland <dbl>, ...
```

---

## It's not right, but it's Ok:

- How many men immigrated from Russia to Israel in 1990?

```r
migration %>%
  filter(country_dest == "israel", year == 1990, gender == "men") %>%
  pull(russian_federation)
```

```
## [1] 80450
```

- How many men immigrated from Israel to Russia in 1990?

```r
migration %>%
  filter(country_dest == "russian_federation", year == 1990, gender == "men") %>%
  pull(israel)
```

```
## [1] 1395
```

---

## It would have been much nicer to have:

```r
migration %>%
  filter(country_orig == "israel", country_dest == "russian_fedration",
         year == 1990, gender == "men") %>%
  pull(n_migrants)
```

Then we could put in a (simpler) function and call the opposite:

```r
get_1way_migration <- function(orig, dest, gen, .year) {
  migration %>%
  filter(country_orig == orig, country_dest == dest,
         year == .year, gender == gen) %>%
  pull(n_migrants)
}

get_1way_migration("russian_federation", "israel")
```

---
class: section-slide

# End of Detour

---

## `pivot_longer()`

```r
migration_long <- migration %>%
  pivot_longer(cols = -c(1:4),
               names_to = "country_orig",
               values_to = "n_migrants")

migration_long
```

```
## # A tibble: 753,536 x 6
##    gender  year  code country_dest country_orig        n_migrants
##    <chr>  <dbl> <dbl> <chr>        <chr>                    <dbl>
##  1 men     1990   108 burundi      afghanistan                  0
##  2 men     1990   108 burundi      albania                      0
##  3 men     1990   108 burundi      algeria                      0
##  4 men     1990   108 burundi      american_samoa               0
##  5 men     1990   108 burundi      andorra                      0
##  6 men     1990   108 burundi      angola                       0
##  7 men     1990   108 burundi      anguilla                     0
##  8 men     1990   108 burundi      antigua_and_barbuda          0
##  9 men     1990   108 burundi      argentina                    0
## 10 men     1990   108 burundi      armenia                      0
## # ... with 753,526 more rows
```

---

## What sorcery is this?

.font80percent[Source: [The Carpentries](http://swcarpentry.github.io/r-novice-gapminder/14-tidyr/index.html)]

---

## `pivot_wider()`

```r
migration_wide <- migration_long %>%
  pivot_wider(id_cols = 1:4,
              names_from = country_orig,
              values_from = n_migrants)

migration_wide
```

---

### Where is this going to?
.font80percent[

```r
get_1way_migration <- function(.country_dest, .country_orig) {
  migration_long %>%
    filter(country_dest == .country_dest, country_orig == .country_orig) %>%
    group_by(year) %>%
    tally(n_migrants) %>%
    mutate(direction = str_c(.country_orig, " to ", .country_dest))
}
get_2way_migration <- function(country_a, country_b) {
  a2b <- get_1way_migration(country_b, country_a)
  b2a <- get_1way_migration(country_a, country_b)
  bind_rows(a2b, b2a)
}

get_2way_migration("israel", "russian_federation") %>%
  ggplot(aes(year, n)) + geom_line() +
  geom_point(color = "orange", size = 5) +
  labs(x = "", y = "", title = "Israel-Russia Yearly No. of immigrants") +
  theme_dark() +
  facet_wrap(~direction, scales = "free", labeller = labeller(direction = rus_isr_names)) +
  scale_y_continuous(labels = scales::comma_format()) +
  theme(axis.text = element_text(size = 12, hjust=0.9, family = "mono"),
        strip.text.x = element_text(size = 14, family = "mono"),
        plot.title = element_text(hjust = 0.5, size = 18, family = "mono"))
```
]

---

---

## Some more useful Tidying up verbs

Remember?

```r
table3 <- read_rds("../data/tidy_tables.rds")$table3

table3
```

```
## # A tibble: 315 x 3
##    religion      yob pct_straight
##    <chr>       <dbl> <chr>       
##  1 atheist      1950 26/29       
##  2 buddhist     1950 6/6         
##  3 christian    1950 28/32       
##  4 hindu        1950 0/0         
##  5 jewish       1950 21/24       
##  6 muslim       1950 0/0         
##  7 unspecified  1950 71/76       
##  8 atheist      1951 31/33       
##  9 buddhist     1951 11/11       
## 10 christian    1951 23/24       
## # ... with 305 more rows
```

---

## `separate()`

```r
table3_tidy <- table3 %>%
  separate(pct_straight, into = c("straight", "total"), sep = "/")

table3_tidy
```

```
## # A tibble: 315 x 4
##    religion      yob straight total
##    <chr>       <dbl> <chr>    <chr>
##  1 atheist      1950 26       29   
##  2 buddhist     1950 6        6    
##  3 christian    1950 28       32   
##  4 hindu        1950 0        0    
##  5 jewish       1950 21       24   
##  6 muslim       1950 0        0    
##  7 unspecified  1950 71       76   
##  8 atheist      1951 31       33   
##  9 buddhist     1951 11       11   
## 10 christian    1951 23       24   
## # ... with 305 more rows
```

---

## `unite()`

(Though see a much more generalizable approach with `purrr`)

```r
table3_tidy %>%
  unite(col = "pct_straight", straight, total, sep = "/")
```

---
class: section-slide

# Iteration without looping

---

## Fun fact: you're already not-looping!

- Say you want the lengths of all of these strings:

```r
strings_vec <- c("I'm feeling fine", "I'm perfectly OK",
                 "Nothing is wrong!")
```

- Do you `for` loop?

```r
strings_len <- numeric(length(strings_vec))
for (i in seq_along(strings_vec)) {
  strings_len[i] <- nchar(strings_vec[i])
}
strings_len
```

```
## [1] 16 16 17
```

---

- No, you use R's vectorized functions nature:

```r
nchar(strings_vec)
```

```
## [1] 16 16 17
```

- In case you were wondering:

```r
microbenchmark(nchar_loop(), nchar_vectorized())
```

```
## Unit: microseconds
##                expr min lq   mean median  uq    max neval cld
##        nchar_loop() 2.9  3 47.688    3.1 3.2 4459.8   100   a
##  nchar_vectorized() 1.0  1 15.996    1.1 1.1 1470.5   100   a
```

- So why should iterating through a `data.frame` columns or rows be any different?

---

## The problem:

```r
okcupid <- read_csv("../data/okcupid.csv.zip", col_types = cols())
okcupid %>% select(essay0:essay2) %>% head(3)
```

```
## # A tibble: 3 x 3
##   essay0                   essay1                   essay2                 
##   <chr>                    <chr>                    <chr>                  
## 1 "about me:<br />\n<br /~ "currently working as a~ "making people laugh.<~
## 2 "i am a chef: this is w~ dedicating everyday to ~ "being silly. having r~
## 3 "i'm not ashamed of muc~ "i make nerdy software ~ "improvising in differ~
```

- I want to apply some transformation to one or more columns (say the length of each `essay` question for each user)
- I want to apply some transformation to one or more rows (say the average length of all `essay` questions for each user)

- BTW, you can always use a `for` loop, see how it is done [here](https://r4ds.had.co.nz/iteration.html#for-loops)

---

## The `apply` family

- There *is* a solution in base R
  - `apply()`
  - `sapply()`
  - `lapply()`
  - `tapply()`
  - `vapply()`
  - `mapply()`
- What do you think is the issue with these? See [this](https://stackoverflow.com/questions/45101045/why-use-purrrmap-instead-of-lapply).

---

## I don't `for`, I `purrr`

The `purrr` package provides a set of functions to make iteration easier:

- No boilerplate code for looping --> less looping bugs
- Focus on the function, the action, not the plumbing
- Generally faster (implemented in C)
- Definitely more clear, concise and elegant code

TBH, I'm addicted 😢

---

## You get a `map()`, you get a `map()`!

| Single | Two | Multiple | Returns | Of |
|---------------|-------------|------|--------|-------|
| `map()`      | `map2()` | `pmap()` | `list` | `?`|
| `map_lgl()`      | `map2_lgl()`      |   `pmap_lgl()` |`vector`| `logical` |
| `map_chr()`      | `map2_chr()`      |   `pmap_chr()` |`vector`| `character` |
| `map_int()`      | `map2_int()`      |   `pmap_int()` |`vector`| `integer` |
| `map_dbl()`      | `map2_dbl()`      |   `pmap_dbl()` |`vector`| `double` |
| `map_dfr()` | `map2_dfr()`      |    `pmap_dfr()` | `tibble` | `?` |

Where "Single" means "single vector/column input", "Two" means "two vectors/columns input" etc.

---

### Example1: Vectorizing a Function

Take a clearly not-vectorized function:

```r
my_func <- function(x) {
  if (x %% 2 == 0) return("even")
  "odd"
}

my_func(10)
```

```
## [1] "even"
```

```r
my_func(1:5)
```

<pre style="color: red;"><code>## Warning in if (x%%2 == 0) return("even"): the condition has length > 1 and
## only the first element will be used
</code></pre>

```
## [1] "odd"
```

---

`map()` will always return a list:

```r
map(1:3, my_func)
```

```
## [[1]]
## [1] "odd"
## 
## [[2]]
## [1] "even"
## 
## [[3]]
## [1] "odd"
```

```r
1:3 %>% map(my_func)
```

```
## [[1]]
## [1] "odd"
## 
## [[2]]
## [1] "even"
## 
## [[3]]
## [1] "odd"
```

---

`map_chr()` will always return a vector of `character`:

```r
map_chr(1:3, my_func)
```

```
## [1] "odd"  "even" "odd"
```

```r
1:3 %>% map_chr(my_func)
```

```
## [1] "odd"  "even" "odd"
```

But here is the beautiful thing:

```r
my_func_vectorized <- function(vec) map_chr(vec, my_func)

my_func_vectorized(1:3)
```

```
## [1] "odd"  "even" "odd"
```

Look Ma, no loops!

---

### Example2: Complex `mutate()`

Manager: Add me a column, for each OkCupid user, whether he/she's above average height.

```r
is_above_average_height <- function(sex, height_cm) {
  if (sex == "m") height_cm > 180 else height_cm > 165
}

okcupid <- okcupid %>%
  mutate(is_tall = map2_lgl(sex, height_cm, is_above_average_height))

okcupid %>% select(sex, height_cm, is_tall) %>%
  group_by(sex) %>% sample_n(3)
```

```
## # A tibble: 6 x 3
## # Groups:   sex [2]
##   sex   height_cm is_tall
##   <chr>     <dbl> <lgl>  
## 1 f          170. TRUE   
## 2 f          163. FALSE  
## 3 f          165. TRUE   
## 4 m          175. FALSE  
## 5 m          178. FALSE  
## 6 m          175. FALSE
```

.font80percent[This specific example might not be *that* bad with `ifelse()`, what will you do with 5 variables?]

---

### Example2: You could even supply args

```r
is_above_average_height <- function(sex, height_cm, men_avg, women_avg) {
  if(sex == "m") height_cm > men_avg else height_cm > women_avg
}

okcupid <- okcupid %>%
  mutate(is_tall = map2_lgl(sex, height_cm, is_above_average_height,
                            men_avg = 180, women_avg = 165))

okcupid %>% select(sex, height_cm, is_tall) %>% group_by(sex) %>% sample_n(3)
```

```
## # A tibble: 6 x 3
## # Groups:   sex [2]
##   sex   height_cm is_tall
##   <chr>     <dbl> <lgl>  
## 1 f          168. TRUE   
## 2 f          155. FALSE  
## 3 f          178. TRUE   
## 4 m          178. FALSE  
## 5 m          178. FALSE  
## 6 m          180. TRUE
```

---

### Example2: Anonymous Functions

```r
okcupid <- okcupid %>%
  mutate(is_tall = map2_lgl(sex, height_cm,
                            function(x, y) if(x == "m") y > 180 else y > 165))
```

Heck, you can even:

```r
okcupid <- okcupid %>%
  mutate(is_tall = map2_lgl(sex, height_cm,
                            ~{if(.x == "m") .y > 180 else .y > 165}))
```

### Example 3: Remember our problem?

For each `essay` column, add a column with its length:

```r
okcupid %>%
  mutate_at(vars(essay0:essay9), list("len" = str_length)) %>%
  select(starts_with("essay")) %>%
  head(3)
```

```
## # A tibble: 3 x 20
##   essay0 essay1 essay2 essay3 essay4 essay5 essay6 essay7 essay8 essay9
##   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
## 1 "abou~ "curr~ "maki~ "the ~ "book~ "food~ duali~ "tryi~ "i am~ "you ~
## 2 "i am~ dedic~ "bein~ <NA>   "i am~ "deli~ <NA>   <NA>   i am ~ <NA>  
## 3 "i'm ~ "i ma~ "impr~ "my l~ "okay~ "move~ <NA>   viewi~ "when~ "you ~
## # ... with 10 more variables: essay0_len <int>, essay1_len <int>,
## #   essay2_len <int>, essay3_len <int>, essay4_len <int>,
## #   essay5_len <int>, essay6_len <int>, essay7_len <int>,
## #   essay8_len <int>, essay9_len <int>
```

`purrr` not needed, like I said, you've already been non-looping!

---

### Example 3: Input multiple columns

~~For each user, compute the average `essay` length:~~

Wait, before that, for each user compute the average of the 3 first `essay`s:

```r
mean_length_3essay <- function(x, y, z) {
  mean(str_length(c(x, y, z)), na.rm = TRUE)
}

okcupid %>%
  mutate(essay3_avglen = pmap_dbl(
    list(essay0, essay1, essay2),
    mean_length_3essay)) %>%
  select(essay0:essay2, essay3_avglen) %>%
  head(2)
```

```
## # A tibble: 2 x 4
##   essay0              essay1              essay2              essay3_avglen
##   <chr>               <chr>               <chr>                       <dbl>
## 1 "about me:<br />\n~ "currently working~ "making people lau~          557.
## 2 "i am a chef: this~ dedicating everyda~ "being silly. havi~          280.
```

---

OK now, for each user, compute the average `essay` length:

```r
mean_length_essay <- function(...) {
  mean(str_length(c(...)), na.rm = TRUE)
}

okcupid %>%
  mutate(essay_avglen = pmap_dbl(
    select(., starts_with("essay")),
    mean_length_essay)) %>%
  select(essay0:essay3, essay_avglen) %>%
  head(3)
```

```
## # A tibble: 3 x 5
##   essay0          essay1         essay2         essay3         essay_avglen
##   <chr>           <chr>          <chr>          <chr>                 <dbl>
## 1 "about me:<br ~ "currently wo~ "making peopl~ "the way i lo~         264.
## 2 "i am a chef: ~ dedicating ev~ "being silly.~ <NA>                   241.
## 3 "i'm not asham~ "i make nerdy~ "improvising ~ "my large jaw~         612
```

---

### Example4: Output multiple columns

```r
essay0_features <- function(essay0) {
 contains_love <- str_detect(essay0, "love")
 contains_obama <- str_detect(essay0, "obama")
 contains_rel <- str_detect(essay0, "relationship")
 list(
   essay0_love = contains_love,
   essay0_obama = contains_obama,
   essay0_ser_rel = contains_rel
 )
}

okcupid %>% select(essay0) %>% map_dfc(essay0_features)
```

```
## # A tibble: 59,946 x 3
##    essay0_love essay0_obama essay0_ser_rel
##    <lgl>       <lgl>        <lgl>         
##  1 TRUE        FALSE        FALSE         
##  2 TRUE        FALSE        FALSE         
##  3 TRUE        FALSE        FALSE         
##  4 FALSE       FALSE        FALSE         
##  5 FALSE       FALSE        FALSE         
##  6 TRUE        FALSE        FALSE         
##  7 TRUE        FALSE        FALSE         
##  8 NA          NA           NA            
##  9 NA          NA           NA            
## 10 TRUE        FALSE        FALSE         
## # ... with 59,936 more rows
```

```r
# or: map_dfc(okcupid$essay0, essay0_features)
```

---

Or if you want those features as additional columns, you could bind them with `bind_cols()`:

```r
okcupid %>%
  bind_cols(
    okcupid %>% select(essay0) %>% map_dfc(essay0_features)
  ) %>%
  select(age, sex, starts_with("essay0"))
```

```
## # A tibble: 59,946 x 6
##      age sex   essay0               essay0_love essay0_obama essay0_ser_rel
##    <dbl> <chr> <chr>                <lgl>       <lgl>        <lgl>         
##  1    22 m     "about me:<br />\n<~ TRUE        FALSE        FALSE         
##  2    35 m     "i am a chef: this ~ TRUE        FALSE        FALSE         
##  3    38 m     "i'm not ashamed of~ TRUE        FALSE        FALSE         
##  4    23 m     i work in a library~ FALSE       FALSE        FALSE         
##  5    29 m     "hey how's it going~ FALSE       FALSE        FALSE         
##  6    29 m     "i'm an australian ~ TRUE        FALSE        FALSE         
##  7    32 f     "life is about the ~ TRUE        FALSE        FALSE         
##  8    31 f     <NA>                 NA          NA           NA            
##  9    24 f     <NA>                 NA          NA           NA            
## 10    37 m     "my names jake.<br ~ TRUE        FALSE        FALSE         
## # ... with 59,936 more rows
```

---

## Dealing with failure

```r
my_func("a")
```

<pre style="color: red;"><code>## Error in x%%2: non-numeric argument to binary operator
</code></pre>

Silly example, but:

(a) When dealing with big data expect the unexpected (input)

(b) You don't want your app to crash

Look at: `safely()`, `quietly()` and `possibly()` which wrap your code nicely and protect you from the unexpected (crash).

---

### My favorite: `possibly()`

```r
my_func_safe <- possibly(my_func, otherwise = NA)

my_func_safe("a")
```

```
## [1] NA
```

```r
map_chr(list(1, 2, "3", 4), my_func_safe)
```

```
## [1] "odd"  "even" NA     "even"
```

## `walk()`, `walk2()`, `pwalk()`

You don't always need to *return* anything, you just wanna loop

```r
walk(1:10, ~print("Hey Girrrl"))
```

```
## [1] "Hey Girrrl"
## [1] "Hey Girrrl"
## [1] "Hey Girrrl"
## [1] "Hey Girrrl"
## [1] "Hey Girrrl"
## [1] "Hey Girrrl"
## [1] "Hey Girrrl"
## [1] "Hey Girrrl"
## [1] "Hey Girrrl"
## [1] "Hey Girrrl"
```

---

```r
plot_norm <- function(mu, sd) {
  plot(density(rnorm(100, mu, sd)),
       main = str_c("mu: ", mu, ", sd: ", sd))
}

par(mfcol = c(1, 2))
walk2(c(0, 10), c(1, 0.1), plot_norm)
```

---

### In the words of Hadley

> Once you master these functions, you’ll find it takes much less time to solve iteration problems.

<br>

> But you should never feel bad about using a for loop instead of a map function.

<br>

> The important thing is that you solve the problem that you’re working on, not write the most concise and elegant code.

<br>

> Some people will tell you to avoid for loops because they are slow. They’re wrong!