Looking at data

---

# Agenda

* Faceting
* Data and graphic improvement
* Data-to-ink
* Working with data

---

# Take a sad plot, and make it better

---

The American Association of 
University Professors (AAUP) is a nonprofit membership association of faculty 
and other academic professionals. 
[This report](https://www.aaup.org/sites/default/files/files/AAUP_Report_InstrStaff-75-11_apr2013.pdf) 
by the AAUP shows trends in instructional staff employees between 1975 
and 2011, and contains an image very similar to the one given below.

---

Each row in this dataset represents a faculty type, and the columns are the years for which we have data. 
The values are percentage of hires of that type of faculty for each year.

Download file: https://github.com/MACS40700/class_ex/blob/main/instructional-staff.csv

``` r
staff <- read_csv("data/instructional-staff.csv")
staff
```

```
## # A tibble: 5 × 12
##   faculty_type    `1975` `1989` `1993` `1995` `1999` `2001` `2003` `2005` `2007`
##   <chr>            <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
## 1 Full-Time Tenu…   29     27.6   25     24.8   21.8   20.3   19.3   17.8   17.2
## 2 Full-Time Tenu…   16.1   11.4   10.2    9.6    8.9    9.2    8.8    8.2    8  
## 3 Full-Time Non-…   10.3   14.1   13.6   13.6   15.2   15.5   15     14.8   14.9
## 4 Part-Time Facu…   24     30.4   33.1   33.2   35.5   36     37     39.3   40.5
## 5 Graduate Stude…   20.5   16.5   18.1   18.8   18.7   19     20     19.9   19.5
## # ℹ 2 more variables: `2009` <dbl>, `2011` <dbl>
```
.footnote[alt link: https://uchicago.box.com/s/eqk73widao74ysdd172ob81jac38ecjx]
.footnote[alt option: `c3s2datasets`]
---

## Recreate the visualization

In order to recreate this visualization we need to first reshape the data to have one variable for faculty type and one variable for year. In other words, we will convert the data from the long format to wide format.

But before we do so...

.task[
If the long data will have a row for each year/faculty type combination, and there are 5 faculty types and 11 years of data, how many rows will the data have?
]

---

---

# Brief aside: tidy data

<img src="https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/85520b8f-4629-4763-8a2a-9ceff27458bf_rw_1920.jpg?h=21007b20ac00cf37318dca645c215453" width="80%" style="display: block; margin: auto;" />
---
# Brief aside: tidy data

---

## `pivot_*()` function

---

## `pivot_longer()`

``` r
pivot_longer(data, cols, names_to = "name", values_to = "value")
```

- The first argument is `data` as usual.
- The second argument, `cols`, is where you specify which columns to pivot 
into longer format -- in this case all columns except for the `faculty_type` 
- The third argument, `names_to`, is a string specifying the name of the column to create from the data stored in the column names of data -- in this case `year`
- The fourth argument, `values_to`, is a string specifying the name of the column to create from the data stored in cell values, in this case `percentage`

---

## Pivot instructor data

``` r
library(tidyverse)

staff_long <- staff %>%
  pivot_longer(cols = -faculty_type, names_to = "year", 
               values_to = "percentage") %>%
  mutate(percentage = as.numeric(percentage))

staff_long
```

```
## # A tibble: 55 × 3
##    faculty_type              year  percentage
##    <chr>                     <chr>      <dbl>
##  1 Full-Time Tenured Faculty 1975        29  
##  2 Full-Time Tenured Faculty 1989        27.6
##  3 Full-Time Tenured Faculty 1993        25  
##  4 Full-Time Tenured Faculty 1995        24.8
##  5 Full-Time Tenured Faculty 1999        21.8
##  6 Full-Time Tenured Faculty 2001        20.3
##  7 Full-Time Tenured Faculty 2003        19.3
##  8 Full-Time Tenured Faculty 2005        17.8
##  9 Full-Time Tenured Faculty 2007        17.2
## 10 Full-Time Tenured Faculty 2009        16.8
## # ℹ 45 more rows
```
]

---

``` r
staff_long %>%
  ggplot(aes(x = percentage, y = year, color = faculty_type)) +
  geom_col(position = "dodge")
```

<img src="index_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" />
]

---

``` r
staff_long %>%
  ggplot(aes(x = percentage, y = year, fill = faculty_type)) +
  geom_col(position = "dodge")
```

<img src="index_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" />
]

---

## Some improvement...

``` r
staff_long %>%
  ggplot(aes(x = percentage, y = year, fill = faculty_type)) +
  geom_col()
```

<img src="index_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" />
]

---

## More improvement

``` r
staff_long %>%
  ggplot(aes(x = year, y = percentage, group = faculty_type, 
             color = faculty_type)) +
  geom_line() +
  theme_minimal()
```

<img src="index_files/figure-html/unnamed-chunk-11-1.png" width="85%" style="display: block; margin: auto;" />
]

---

## Goal: even more improvement!

---

## Asking good questions

- Describe what you want
- Describe where you are
- Create a minimal **repr**oducible **ex**ample: `reprex::reprex()`

---

]

``` r
library(scales)

staff_long %>%
* mutate(
*   part_time = if_else(faculty_type == "Part-Time Faculty",
*                       "Part-Time Faculty", "Other Faculty"),
*   year = as.numeric(year)
*   ) %>%
  ggplot(aes(x = year, y = percentage/100, group = faculty_type, 
             color = part_time)) +
  geom_line() +
* scale_color_manual(values = c("gray", "red")) +
* scale_y_continuous(labels = label_percent(accuracy = 1)) +
  theme_minimal() +
  labs(
    title = "Instructional staff employment trends",
    x = "Year", y = "Percentage", color = NULL
  ) +
* theme(legend.position = "bottom")
```

]]

---

# Recap

Parts of a graph:
* aesthetics
* color
* shape
* size
* alpha (transparency)
* faceting

---
class: middle, inverse

# A/B testing

---

## Data: College education costs

- Data on four year colleges and universities in the United States (2018-19)

- Extracted from College Scorecard API

- Source: `c3s2datasets::scorecard`

]

<blockquote class="twitter-tweet"><p lang="en" dir="ltr">The 4 stages of a morning lecture <a href="https://t.co/B7v6WLtX3J">pic.twitter.com/B7v6WLtX3J</a></p>&mdash; College Student (@ColIegeStudent) <a href="https://twitter.com/ColIegeStudent/status/829377468595306500?ref_src=twsrc%5Etfw">February 8, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script>
]

---

## `c3s2datasets::scorecard`

``` r
library(tidyverse)
#install.packages("devtools")
devtools::install_github("MACS40700/c3s2datasets")
```

```
## ── R CMD build ─────────────────────────────────────────────────────────────────
##   
   checking for file ‘/private/var/folders/g1/9qqnjd1x21dbm_sdrj7lhrlw0000gn/T/RtmpCFDU1N/remotescd8358c1866e/MACS40700-c3s2datasets-a0c952e/DESCRIPTION’ ...
  
✔  checking for file ‘/private/var/folders/g1/9qqnjd1x21dbm_sdrj7lhrlw0000gn/T/RtmpCFDU1N/remotescd8358c1866e/MACS40700-c3s2datasets-a0c952e/DESCRIPTION’ (522ms)
## 
  
─  preparing ‘c3s2datasets’:
##    checking DESCRIPTION meta-information
  
   checking DESCRIPTION meta-information ...
  
✔  checking DESCRIPTION meta-information
## 
  
─  checking for LF line-endings in source and make files and shell scripts
## 
  
─  checking for empty or unneeded directories
## 
  
─  building ‘c3s2datasets_0.0.4.tar.gz’
## 
  
   
## 
```

``` r
library(c3s2datasets)

glimpse(scorecard)
```

```
## Rows: 1,719
## Columns: 14
## $ unitid    <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009…
## $ name      <chr> "Alabama A & M University", "University of Alabama at Birmin…
## $ state     <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", …
## $ type      <fct> "Public", "Public", "Public", "Public", "Public", "Public", …
## $ admrate   <dbl> 0.7160, 0.8854, 0.7367, 0.9799, 0.7890, 0.9680, 0.7118, 0.65…
## $ satavg    <dbl> 954, 1266, 1300, 955, 1244, 1069, NA, 1214, 1042, NA, 1111, …
## $ cost      <dbl> 21924, 26248, 24869, 21938, 31050, 20621, 32678, 33920, 3645…
## $ netcost   <dbl> 13057, 16585, 17250, 13593, 21534, 13689, 23258, 21098, 2037…
## $ avgfacsal <dbl> 79011, 104310, 88380, 69309, 94581, 70965, 99837, 68724, 564…
## $ pctpell   <dbl> 0.6853, 0.3253, 0.2377, 0.7205, 0.1712, 0.4821, 0.1301, 0.21…
## $ comprate  <dbl> 0.2807, 0.6245, 0.6072, 0.2843, 0.7223, 0.3569, 0.8088, 0.69…
## $ firstgen  <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381…
## $ debt      <dbl> 16600, 15832, 13905, 17500, 17986, 13119, 17750, 16000, 1500…
## $ locale    <fct> City, City, City, City, City, City, City, City, City, Suburb…
```

---

## A simple visualization

``` r
ggplot(scorecard, aes(x = avgfacsal, y = netcost)) +
  geom_point(alpha = 0.5, size = 2) +
  geom_smooth(method = "lm", se = FALSE, size = 0.7) +
  labs(
    x = "Average faculty salary (USD)",
    y = "Net cost of attendance (USD)",
    title = "Faculty salaries and net cost of attendance in US universities"
  )
```

]

]
]

---

## New variable: `pctpell_cat`

``` r
scorecard <- scorecard %>%
  mutate(pctpell_cat = cut_interval(x = pctpell, n = 6)) %>%
  drop_na(pctpell_cat)

scorecard %>%
  select(pctpell, pctpell_cat)
```

```
## # A tibble: 1,715 × 2
##    pctpell pctpell_cat  
##      <dbl> <fct>        
##  1   0.685 (0.652,0.816]
##  2   0.325 (0.163,0.326]
##  3   0.238 (0.163,0.326]
##  4   0.720 (0.652,0.816]
##  5   0.171 (0.163,0.326]
##  6   0.482 (0.326,0.489]
##  7   0.130 [0,0.163]    
##  8   0.215 (0.163,0.326]
##  9   0.461 (0.326,0.489]
## 10   0.675 (0.652,0.816]
## # ℹ 1,705 more rows
```

---

## Distribution of `pctpell_cat`

``` r
scorecard <- scorecard %>%
  mutate(pctpell_cat = cut_interval(x = pctpell, n = 6)) %>%
  drop_na(pctpell_cat)

scorecard %>%
  count(pctpell_cat)
```

```
## # A tibble: 6 × 2
##   pctpell_cat       n
##   <fct>         <int>
## 1 [0,0.163]       164
## 2 (0.163,0.326]   657
## 3 (0.326,0.489]   591
## 4 (0.489,0.652]   209
## 5 (0.652,0.816]    68
## 6 (0.816,0.979]    26
```

---

## A slightly more complex visualization

]

]
]

---

.task[
In the next two slides, the same plots are created with different "cosmetic" choices. Examine the plots two given (Plot A and Plot B), and decide whcih one you prefer.
]

---

## Test 1

.panelset[
.panel[.panel-name[Plot A]
<img src="index_files/figure-html/test-1-a-1.png" width="80%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Plot B]
<img src="index_files/figure-html/test-1-b-1.png" width="80%" style="display: block; margin: auto;" />
]
]

---

## Test 2

.panelset[
.panel[.panel-name[Plot A]
<img src="index_files/figure-html/unnamed-chunk-18-1.png" width="80%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Plot B]
<img src="index_files/figure-html/test-2-b-1.png" width="80%" style="display: block; margin: auto;" />
]
]

---

---

## Minimal theme + viridis scale, default option

.panelset[
.panel[.panel-name[Plot]
<img src="index_files/figure-html/unnamed-chunk-19-1.png" width="80%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(scorecard, aes(x = avgfacsal, y = netcost, color = pctpell_cat)) +
  geom_point(alpha = 0.5, show.legend = FALSE) +
  geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) +
  facet_wrap(vars(pctpell_cat)) +
  labs(
    x = "Average faculty salary (USD)",
    y = "Net cost of attendance (USD)",
    color = "Percentage of Pell grant recipients",
    title = "Faculty salaries and net cost of attendance in US universities"
  ) +
* theme_minimal(base_size = 16) +
* scale_color_viridis_d(end = 0.9)
```
]
]

---

## Viridis scale, option A (magma)

.panelset[
.panel[.panel-name[Plot]
<img src="index_files/figure-html/unnamed-chunk-21-1.png" width="80%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(scorecard, aes(x = avgfacsal, y = netcost, color = pctpell_cat)) +
  geom_point(alpha = 0.5, show.legend = FALSE) +
  geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) +
  facet_wrap(vars(pctpell_cat)) +
  labs(
    x = "Average faculty salary (USD)",
    y = "Net cost of attendance (USD)",
    color = "Percentage of Pell grant recipients",
    title = "Faculty salaries and net cost of attendance in US universities"
  ) +
  theme_minimal(base_size = 16) +
* scale_color_viridis_d(end = 0.8, option = "A")
```
]
]

---

## Dark theme + further theme customization

.panelset[
.panel[.panel-name[Plot]
<img src="index_files/figure-html/unnamed-chunk-23-1.png" width="80%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(scorecard, aes(x = avgfacsal, y = netcost, color = pctpell_cat)) +
  geom_point(alpha = 0.5, show.legend = FALSE) +
  geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) +
  facet_wrap(vars(pctpell_cat)) +
  labs(
    x = "Average faculty salary (USD)",
    y = "Net cost of attendance (USD)",
    color = "Percentage of Pell grant recipients",
    title = "Faculty salaries and net cost of attendance in US universities"
  ) +
* theme_dark(base_size = 16) +
* scale_color_manual(values = c("yellow", "blue", "orange", "red", "green", "white")) +
* theme(
*   text = element_text(color = "red", face = "bold.italic"),
*   plot.background = element_rect(fill = "yellow")
* )
```
]
]

---

# What makes bad figures bad?

---

## Bad taste

---

## Data-to-ink ratio

Tufte strongly recommends maximizing the **data-to-ink ratio** this in the Visual Display of Quantitative Information (Tufte, 1983).

>Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51).

---

.panelset[
.panel[.panel-name[Plot A]
<img src="index_files/figure-html/mean-netcost-pctpell-a-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Plot B]
<img src="index_files/figure-html/mean-netcost-pctpell-b-1.png" width="60%" style="display: block; margin: auto;" />
]
]

---

---

## Summary statistics

``` r
mean_netcost_pctpell <- scorecard %>%
  group_by(pctpell_cat) %>%
  summarise(mean_netcost = mean(netcost, na.rm = TRUE))

mean_netcost_pctpell
```

```
## # A tibble: 6 × 2
##   pctpell_cat   mean_netcost
##   <fct>                <dbl>
## 1 [0,0.163]           28713.
## 2 (0.163,0.326]       21525.
## 3 (0.326,0.489]       18660.
## 4 (0.489,0.652]       17087.
## 5 (0.652,0.816]       14771.
## 6 (0.816,0.979]        8466.
```

---

## Barplot

.panelset[
.panel[.panel-name[Plot]
<img src="index_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(mean_netcost_pctpell, aes(y = pctpell_cat, x = mean_netcost)) +
* geom_col() +
  labs(
    x = "Mean net cost of attendance", y = "Pell grant recipients",
    title = "Mean net cost of attendance, by Pell grant recipients"
  ) +
  theme_minimal(base_size = 16)
```
]
]

---

## Scatterplot

.panelset[
.panel[.panel-name[Plot]
<img src="index_files/figure-html/unnamed-chunk-30-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(mean_netcost_pctpell, aes(y = pctpell_cat, x = mean_netcost)) +
* geom_point(size = 4) +
  labs(
    x = "Mean net cost of attendance", y = "Pell grant recipients",
    title = "Mean net cost of attendance, by Pell grant recipients"
  ) +
  theme_minimal(base_size = 16)
```
]
]

---

## Lollipop plot -- a happy medium?

.panelset[
.panel[.panel-name[Plot]
<img src="index_files/figure-html/mean-netcost-pctpell-lollipop-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(mean_netcost_pctpell, aes(y = pctpell_cat, x = mean_netcost)) +
  geom_point(size = 4) +
* geom_segment(
*   aes(
*     x = 0, xend = mean_netcost,
*     y = pctpell_cat, yend = pctpell_cat
*   )
* ) +
  labs(
    x = "Mean net cost of attendance", y = "Pell grant recipients",
    title = "Mean net cost of attendance, by Pell grant recipients"
  ) +
  theme_minimal(base_size = 16)
```
]
]

---

## Activity: Napoleon’s retreat

.pull-left-wide[
.task[
.small[
This is Minard’s visualization of Napoleon’s retreat. Discuss in a pair (or group) the features of the following visualization. What are the variables plotted? Which aesthetics are they mapped to?
]
]
]

.footnote[Source: [Wikipedia](https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png)]

<countdown-timer class="countdown" id="timer_cabd47d1" minutes="5" seconds="0" update-every="1" tabindex="0" style="top:0;right:0;"></countdown-timer>

---

## Bad data

.panelset[
.panel[.panel-name[Original]
<img src="images/healy-democracy-nyt-version.png" alt="A crisis of faith in democracy? New York Times." width="50%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Improved]
<img src="images/healy-democracy-voeten-version-2.png" alt="A crisis of faith in democracy? New York Times." width="50%" style="display: block; margin: auto;" />
]
]

.footnote[
Healy, Data Visualization: A practical introduction. [Chapter 1](https://socviz.co/lookatdata.html). Figures 1.8 and 1.9.
]

---

## Bad perception

.footnote[
Healy, Data Visualization: A practical introduction. [Chapter 1](https://socviz.co/lookatdata.html). Figure 1.12.
]

---

# Aesthetic mappings in ggplot2

---

## A second look: lollipop plot

.panelset[
.panel[.panel-name[Plot]
<img src="index_files/figure-html/mean-netcost-pctpell-lollipop-layer-1.png" width="60%" style="display: block; margin: auto;" />
]
.panel[.panel-name[Code]

``` r
ggplot(mean_netcost_pctpell, aes(y = pctpell_cat, x = mean_netcost)) +
  geom_point(size = 4) +
  geom_segment(
    aes(
      x = 0, xend = mean_netcost,
      y = pctpell_cat, yend = pctpell_cat
    )
  ) +
  labs(
    x = "Mean net cost of attendance", y = "Pell grant recipients",
    title = "Mean net cost of attendance, by Pell grant recipients"
  ) +
  theme_minimal(base_size = 16)
```
]
]

---

## Global vs. layer-specific aesthetics

- Aesthetic mappings can be supplied in the initial `ggplot()` call, in individual layers, or in some combination of both.

- Within each layer, you can add, override, or remove mappings.

- If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference. However, the distinction is important when you start adding additional layers.

---

## What's memorable?

.task[
Think back to all the plots you saw in the lecture, without flipping back through the slides. Which plot first comes to mind? Describe it in words.
]

---
class: center, middle, inverse
# Finding data: how to identify data for this course / your reesarch questions

---
## Finding data!

* Post on Ed (sometimes people have seen relevant data!)
* 'Data is plural' newsletter
* Google 'thing-you-care-about' + 'dataset'

---
class: inverse

# Assignment 1

You need to find a graph and **critique** it (don't totally trash it -- this is an academic exercise). If you want you can work to make it better if you can get your hands on similar data. But if not, that's OK!

---

# Speaking of: doing well on assignments