Deep dive: stats + scales + guides

.title[
# Deep dive: stats + scales + guides
]
.author[
### MACS 40700 <br /> University of Chicago
]

---

# Agenda:
* Scales: display for how the axes appear numerically
* Guides
  * Legends
  * Axes (incl ticks)
---

# Announcements

* Post on Ed for any questions!

<!-----

- [Project description](https://uc-dataviz.netlify.app/project-1.html)
- Group assignments
- Repos
- Deliverables
    - April 15 - Proposals for peer review
    - April 22 - Revised proposals for instructor review
    - April 28 - write-up and presentation

--->

---

# Setup
* Stats
* Scales

---

## Packages + figures

``` r
# load packages
library(tidyverse)
library(c3s2datasets)

# set default theme for ggplot2
ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16))

# set default figure parameters for knitr
knitr::opts_chunk$set(
  fig.width = 7, # 7"
  fig.asp = 0.618, # the golden ratio
  fig.retina = 2, # dpi multiplier for displaying HTML output on retina
  dpi = 150, # higher dpi, sharper image
  out.width = "50%"
)
```

---

## `scorecard`

``` r
glimpse(scorecard)
```

```
## Rows: 1,719
## Columns: 14
## $ unitid    <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009…
## $ name      <chr> "Alabama A & M University", "University of Alabama at Birmin…
## $ state     <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ type      <fct> "Public", "Public", "Public", "Public", "Public", "Public", …
## $ admrate   <dbl> 0.7160, 0.8854, 0.7367, 0.9799, 0.7890, 0.9680, 0.7118, 0.65…
## $ satavg    <dbl> 954, 1266, 1300, 955, 1244, 1069, NA, 1214, 1042, NA, 1111, …
## $ cost      <dbl> 21924, 26248, 24869, 21938, 31050, 20621, 32678, 33920, 3645…
## $ netcost   <dbl> 13057, 16585, 17250, 13593, 21534, 13689, 23258, 21098, 2037…
## $ avgfacsal <dbl> 79011, 104310, 88380, 69309, 94581, 70965, 99837, 68724, 564…
## $ pctpell   <dbl> 0.6853, 0.3253, 0.2377, 0.7205, 0.1712, 0.4821, 0.1301, 0.21…
## $ comprate  <dbl> 0.2807, 0.6245, 0.6072, 0.2843, 0.7223, 0.3569, 0.8088, 0.69…
## $ firstgen  <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381…
## $ debt      <dbl> 16600, 15832, 13905, 17500, 17986, 13119, 17750, 16000, 1500…
## $ locale    <fct> City, City, City, City, City, City, City, City, City, Suburb…
```

---

# Stats

---

## Stats != geoms

- Statistical transformation (**stat**) transforms the data, typically by summarizing
- Many of ggplot2’s stats are used behind the scenes to generate many important geoms

|`stat`            | geom              |
|------------------|-------------------|
|`stat_bin()`      | `geom_bar()`, `geom_freqpoly()`, `geom_histogram()` |
|`stat_bin2d()`    | `geom_bin2d()`    |
|`stat_bindot()`   | `geom_dotplot()`  |
|`stat_binhex()`   | `geom_hex()`      |
|`stat_boxplot()`  | `geom_boxplot()`  |
|`stat_contour()`  | `geom_contour()`  |
|`stat_quantile()` | `geom_quantile()` |
|`stat_smooth()`   | `geom_smooth()`   |
|`stat_sum()`      | `geom_count()`    |

---

## `stat_boxplot()`

---

## Layering with stats

``` r
ggplot(scorecard, aes(x = type, y = avgfacsal)) + 
  geom_point(alpha = 0.5) + 
  geom_point(stat = "summary", fun = "median", 
             color = "red", size = 5, pch = 4, stroke = 2) 
```

---

## Alternate layering with stats

``` r
ggplot(scorecard, aes(x = type, y = avgfacsal)) + 
  geom_point(alpha = 0.5) + 
  stat_summary(geom = "point", fun = "median", 
               color = "red", size = 5, pch = 4, stroke = 2) 
```

---

## Statistical transformations

.task[
What can you say about the distribution of average faculty salaries from the following QQ plot?
]

``` r
ggplot(scorecard, aes(sample = avgfacsal)) +
  stat_qq() + 
  stat_qq_line() + 
  labs(y = "avgfacsal")
```

<img src="index_files/figure-html/unnamed-chunk-6-1.png" width="50%" style="display: block; margin: auto;" />
]

---
# Stats: Exercise:

.task[Use the `PlantGrowth` dataset (hint: data()) and make a boxplot with the means marked with a green X]

``` r
data("PlantGrowth")
ggplot(PlantGrowth, aes(x = group, y = weight))  +
     geom_boxplot() +
     stat_summary(geom = "point", fun = "mean", 
                  color = "darkgreen", size = 5, pch = 4, stroke = 2)
```

<img src="index_files/figure-html/unnamed-chunk-7-1.png" width="50%" style="display: block; margin: auto;" />
]
]
.panel[.panel-name[Geom]
.small[

``` r
data("PlantGrowth")
ggplot(PlantGrowth, aes(x = group, y = weight))  +
     geom_boxplot() +
     geom_point(stat = "summary", fun = "mean", 
                  color = "darkgreen", size = 5, pch = 4, stroke = 2)
```

<img src="index_files/figure-html/unnamed-chunk-8-1.png" width="50%" style="display: block; margin: auto;" />
]
]
]

---

# Scales

---

## What is a scale?

- Each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale)

- The axis or legend is the inverse function: it allows you to convert visual properties back to data

---

## Scale specification

Every aesthetic in your plot is associated with exactly one scale:

``` r
# automatic scales
ggplot(scorecard, aes(x = pctpell, y = avgfacsal, color = type)) + 
  geom_point(alpha = 0.8)
```

``` r
# manual scales
ggplot(scorecard, aes(x = pctpell, y = avgfacsal, color = type)) + 
  geom_point(alpha = 0.8) +
  scale_x_continuous() + 
  scale_y_continuous() + 
  scale_color_ordinal()
```
]
.panel[.panel-name[Plot]
<img src="index_files/figure-html/unnamed-chunk-10-1.png" width="50%" style="display: block; margin: auto;" />
]
]

---

## Anatomy of a scale function

- Always starts with `scale`
- `A`: Name of the primary aesthetic (e.g., `color`, `shape`, `x`)
- `B`: Name of the scale (e.g., `continuous`, `discrete`, `brewer`)

---

## Guess the output

``` r
ggplot(scorecard, aes(x = pctpell, y = avgfacsal, color = type)) + 
  geom_point(alpha = 0.8) +
  scale_x_continuous(name = "pctpell") +
  scale_x_continuous(name = "Percent of students receiving a Pell grant") 
```

---

## "Address" messages

```
## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.
```

---

## Fixing axes

``` r
ggplot(scorecard, aes(x = pctpell, y = avgfacsal, color = type)) + 
  geom_point(alpha = 0.8) +
  scale_y_continuous(name = "Average Faculty Salary", labels = label_dollar()) +
  scale_x_continuous(name = "Percent of students receiving a Pell grant", labels = label_percent()) 
```

---

## Guess the output

.task[
What happens if you pair a discrete variable with a continuous scale? What happens if you pair a continuous variable with a discrete scale? Answer in the context of the following plots.
]

``` r
ggplot(scorecard, aes(type)) + geom_bar(stat = "count")
```

<img src="index_files/figure-html/unnamed-chunk-14-1.png" width="50%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Plots]
.pull-left[

``` r
ggplot(
  data = scorecard,
  mapping = aes(
    x = type,  
    y = avgfacsal
  )
) +
  geom_point(alpha = 0.5) +
  scale_x_continuous() 
```
]

``` r
ggplot(
  data = scorecard,
  mapping = aes(
    x = type,
    y = avgfacsal 
  )
) +
  geom_point(alpha = 0.5) +
  scale_y_discrete() 
```

]
]

```
## Error in `scale_x_continuous()`:
## ! Discrete value supplied to a continuous scale.
## ℹ Example values: Public, Private, nonprofit, and Private, for-profit.
```
]

.pull-right[
<img src="index_files/figure-html/incorrect-scale-discrete-1.png" width="50%" style="display: block; margin: auto;" />

]
]
]

<countdown-timer class="countdown" id="timer_40650de0" minutes="3" seconds="0" update-every="1" tabindex="0" style="right:0;bottom:0;"></countdown-timer>

---

## Transformations

When working with continuous data, the default is to map linearly from the data space onto the aesthetic space, but this scale can be transformed

``` r
ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) + 
  geom_point(alpha = 0.5)
```

<img src="index_files/figure-html/unnamed-chunk-16-1.png" width="45%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Transformed]

``` r
ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) + 
  geom_point(alpha = 0.5) +
  scale_y_continuous(trans = "log10")
```

<img src="index_files/figure-html/unnamed-chunk-17-1.png" width="45%" style="display: block; margin: auto;" />
]
]

---

## Continuous scale transformations

| Name      | Function `$f(x)$`         | Inverse `$f^{-1}(y)$`
|-----------|-------------------------|------------------------
| asn       | `$\tanh^{-1}(x)$`         | `$\tanh(y)$`
| exp       | `$e ^ x$`                 | `$\log(y)$`
| identity  | `$x$`                     | `$y$`
| log       | `$\log(x)$`               | `$e ^ y$`
| log10     | `$\log_{10}(x)$`          | `$10 ^ y$`
| log2      | `$\log_2(x)$`             | `$2 ^ y$`
| logit     | `$\log(\frac{x}{1 - x})$` | `$\frac{1}{1 + e(y)}$`
| pow10     | `$10^x$`                  | `$\log_{10}(y)$`
| probit    | `$\Phi(x)$`               | `$\Phi^{-1}(y)$`
| reciprocal| `$x^{-1}$`                | `$y^{-1}$`
| reverse   | `$-x$`                    | `$-y$`
| sqrt      | `$x^{1/2}$`               | `$y ^ 2$`

---

## Convenience functions for transformations

``` r
ggplot(scorecard, 
       aes(x = pctpell, y = avgfacsal)) + 
  geom_point(alpha = 0.5) +
  scale_y_continuous(trans = "log10")
```

<img src="index_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[

``` r
ggplot(scorecard, 
       aes(x = pctpell, y = avgfacsal)) + 
  geom_point(alpha = 0.5) +
  scale_y_log10(labels = scales::label_comma())
```

<img src="index_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## Scale transform vs. data transform

.task[
How are the following three plots different, how are they similar? What does this say about how scale transformations work.
]

``` r
scorecard %>%
  mutate(avgfacsal_log10 = 
           log(avgfacsal, base = 10)) %>%
  ggplot(aes(x = pctpell, 
             y = avgfacsal_log10)) + 
  geom_point(alpha = 0.5)
```
]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" />
]
]
.panel[.panel-name[Plot B]
.pull-left[

``` r
ggplot(scorecard, 
       aes(x = pctpell, y = avgfacsal)) + 
  geom_point(alpha = 0.5) +
  scale_y_log10()
```
]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" />
]
]
.panel[.panel-name[Plot C]
.pull-left[

``` r
ggplot(scorecard, 
       aes(x = pctpell, y = avgfacsal)) + 
  geom_point(alpha = 0.5) +
  scale_y_log10(
    labels = scales::label_comma())
```
]
.pull-right[
<img src="index_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" />
]
]
]

---
# Scales exercise

.task[In groups of two-to-three:
* Identify a relevant R dataset that has your assigned variable type (e.g. continuous/continuous)
* Create a simple plot
* Rework the scales using both of the following:
  * Naming
  * Transformation
]

<countdown-timer class="countdown" id="timer_52cfe012" minutes="4" seconds="0" update-every="1" tabindex="0" style="right:0;bottom:0;"></countdown-timer>
---

# Guides

---

## What is a guide?

Guides are legends and axes:

.footnote[
Source: ggplot2: Elegant Graphics for Data Analysis, [Chp 15](https://ggplot2-book.org/scales-guides.html#scale-guide).
]

---

## Customizing axes

``` r
ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + 
  geom_point(alpha = 0.5) +
  scale_x_continuous( name = "Net cost of attendance"  )
```

---

## Customizing axes

<img src="index_files/figure-html/unnamed-chunk-26-1.png" width="40%" style="display: block; margin: auto;" />
]

---

## Customizing axes

<img src="index_files/figure-html/unnamed-chunk-27-1.png" width="50%" style="display: block; margin: auto;" />
]

---

## Customizing axes

``` r
ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + 
  geom_point(alpha = 0.5) +
  scale_x_continuous(
    name = "Net cost of attendance",
    breaks = seq(from = 0, to = 60000, by = 10000),
    limits = c(0, 60000),
    labels = c("$0", "$10,000", "$20,000", "$30,000", "$40,000", "$50,000", "$60,000") 
  )
```

<img src="index_files/figure-html/unnamed-chunk-28-1.png" width="50%" style="display: block; margin: auto;" />
]

---

## Customizing axes

``` r
library(scales)

ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + 
  geom_point(alpha = 0.5) +
  scale_x_continuous(
    name = "Net cost of attendance",
    breaks = seq(from = 0, to = 60000, by = 10000),
    limits = c(0, 60000),
    labels = label_currency() )
```

<img src="index_files/figure-html/unnamed-chunk-29-1.png" width="45%" style="display: block; margin: auto;" />
]

---

## Customizing axes

``` r
ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + 
  geom_point(alpha = 0.5) +
  scale_x_continuous(
    name = "Net cost of attendance", breaks = seq(from = 0, to = 60000, by = 10000),
    limits = c(0, 60000), labels = label_currency()) +
  scale_y_continuous( name = "Average faculty salary (USD)", labels = label_dollar()) 
```

<img src="index_files/figure-html/unnamed-chunk-30-1.png" width="50%" style="display: block; margin: auto;" />
]

---

## Aside: storing a plot
.small[

``` r
set.seed(1234)
p_pctpell_avgfacsal_type <- ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) +
  geom_jitter(aes(color = type, shape = type), size = 2)

p_pctpell_avgfacsal_type
```

<img src="index_files/figure-html/unnamed-chunk-31-1.png" width="50%" style="display: block; margin: auto;" />
]
---

## Customizing axis and legend labels with `scale_*()`

``` r
p_pctpell_avgfacsal_type +
  scale_x_continuous(name = "Percent of students receiving a Pell grant") + 
  scale_y_continuous(name = "Average faculty salary (USD)") + 
  scale_color_discrete(name = "College type") + 
  scale_shape_discrete(name = "College type") 
```

<img src="index_files/figure-html/unnamed-chunk-32-1.png" width="50%" style="display: block; margin: auto;" />
]

---

## Customizing axis and legend labels with `labs()`

``` r
p_pctpell_avgfacsal_type + labs( x = "Percent of students receiving a Pell grant",
    y = "Average faculty salary (USD)",
    color = "College type")
```

<img src="index_files/figure-html/unnamed-chunk-33-1.png" width="50%" style="display: block; margin: auto;" />
]
]
.panel[.panel-name[Combined]
.small[

``` r
p_pctpell_avgfacsal_type + labs(  x = "Percent of students receiving a Pell grant", 
    y = "Average faculty salary (USD)", 
    color = "College type", 
    shape = "College type", )
```

<img src="index_files/figure-html/unnamed-chunk-34-1.png" width="50%" style="display: block; margin: auto;" />
]
]
]

---
# Guides exercise

.task[In groups of two-to-three:
* Identify a relevant R dataset (*hint: data()*) that has your assigned variable type (e.g. continuous/continuous)
* Create a simple plot
* Rework the scales using at least one per category of the following:
  * Scales (x axis): adding breaks, limits, labels transformations
  * Scales (y axis): adding breaks, limits, labels transformations
  * Add color or shape (e.g. `scale_color_discrete()`)
  * Legend and labs: polish
]

<countdown-timer class="countdown" id="timer_850bdebb" minutes="5" seconds="0" update-every="1" tabindex="0" style="right:0;bottom:0;"></countdown-timer>

Post this graph and code on Ed -- it should be reproducible based on the code you submit!! Provide approx 2-3 sentences explaining why/how you made the choices you did.

---
# Bonus: Merging Legends!
.panelset[
.panel[.panel-name[Easy Fix]

``` r
ggplot(mpg, aes(displ, hwy)) + 
    geom_point(aes(colour = factor(year), shape = factor(year)), size = 5) + 
    scale_colour_brewer("year", type = "qual", palette = 5) 
```

<img src="index_files/figure-html/easy-1.png" width="50%" style="display: block; margin: auto;" />
]

``` r
ggplot(mpg, aes(displ, hwy)) + 
    geom_point(aes(colour = factor(year), shape = factor(year)), size = 5) + 
    scale_colour_brewer("year", type = "qual", palette = 5) +
    labs(colour = "year", shape = "year")
```

<img src="index_files/figure-html/easy-fixed-1.png" width="50%" style="display: block; margin: auto;" />
]

``` r
ggplot(mpg, aes(displ, hwy)) + 
    geom_point(aes(colour = factor(year), shape = factor(cyl)), size = 5) + 
    scale_colour_brewer("year", type = "qual", palette = 5) 
```

<img src="index_files/figure-html/not-easy-1.png" width="50%" style="display: block; margin: auto;" />
]

``` r
legend <- mpg %>% distinct(year, cyl) %>% 
  mutate(
    shape = case_when(  cyl == 4 ~ 15, cyl == 5 ~ 17, cyl == 6 ~ 18, cyl == 8 ~ 19  ),
    colour = case_when( year == 1999 ~ "gray", year == 2008 ~ "black" ),
    label = paste0(year, " Year, ", cyl, " cyl"))
ggplot(mpg, aes(displ, hwy)) + 
    geom_point(aes(colour = interaction(as.factor(year), as.factor(cyl)), 
                   shape = interaction(as.factor(year), as.factor(cyl))), size = 5) + 
  scale_colour_discrete(name = "Combined", labels = legend %>% distinct(cyl, label) %>% pull(label), type = legend %>% distinct(cyl, colour) %>% pull(colour)) +
  scale_shape_manual(name = "Combined", 
    labels = legend %>% distinct(year, label) %>% pull(label),
    values = legend %>% distinct(year, shape) %>% pull(shape)) +
  geom_point()
```
]

.panel[.panel-name[ARGGHHH2 plot]
*(why would we want / not want to do this?)*
<img src="index_files/figure-html/unnamed-chunk-36-1.png" width="55%" style="display: block; margin: auto;" />
]
]

---

# Recap

* Assignment 2 (Next week!)
* **Scales** can help you rework how the axis labels appear
* **Guides**: pulling together legends / axes