class: center, middle, inverse, title-slide .title[ # Looking at data ] .author[ ### MACS 40700
University of Chicago ] --- # Agenda * Faceting * Data and graphic improvement * Data-to-ink * Working with data --- class: middle, inverse # Take a sad plot, and make it better --- The American Association of University Professors (AAUP) is a nonprofit membership association of faculty and other academic professionals. [This report](https://www.aaup.org/sites/default/files/files/AAUP_Report_InstrStaff-75-11_apr2013.pdf) by the AAUP shows trends in instructional staff employees between 1975 and 2011, and contains an image very similar to the one given below. <img src="images/staff-employment.png" width="70%" style="display: block; margin: auto;" /> --- Each row in this dataset represents a faculty type, and the columns are the years for which we have data. The values are percentage of hires of that type of faculty for each year. Download file: https://github.com/MACS40700/class_ex/blob/main/instructional-staff.csv ``` r staff <- read_csv("data/instructional-staff.csv") staff ``` ``` ## # A tibble: 5 × 12 ## faculty_type `1975` `1989` `1993` `1995` `1999` `2001` `2003` `2005` `2007` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Full-Time Tenu… 29 27.6 25 24.8 21.8 20.3 19.3 17.8 17.2 ## 2 Full-Time Tenu… 16.1 11.4 10.2 9.6 8.9 9.2 8.8 8.2 8 ## 3 Full-Time Non-… 10.3 14.1 13.6 13.6 15.2 15.5 15 14.8 14.9 ## 4 Part-Time Facu… 24 30.4 33.1 33.2 35.5 36 37 39.3 40.5 ## 5 Graduate Stude… 20.5 16.5 18.1 18.8 18.7 19 20 19.9 19.5 ## # ℹ 2 more variables: `2009` <dbl>, `2011` <dbl> ``` .footnote[alt link: https://uchicago.box.com/s/eqk73widao74ysdd172ob81jac38ecjx] .footnote[alt option: `c3s2datasets`] --- ## Recreate the visualization In order to recreate this visualization we need to first reshape the data to have one variable for faculty type and one variable for year. In other words, we will convert the data from the long format to wide format. But before we do so... .task[ If the long data will have a row for each year/faculty type combination, and there are 5 faculty types and 11 years of data, how many rows will the data have? ] --- class: center, middle <img src="images/pivot.gif" width="80%" style="display: block; margin: auto;" /> --- # Brief aside: tidy data <img src="https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/85520b8f-4629-4763-8a2a-9ceff27458bf_rw_1920.jpg?h=21007b20ac00cf37318dca645c215453" width="80%" style="display: block; margin: auto;" /> --- # Brief aside: tidy data <img src="https://cdn.myportfolio.com/45214904-6a61-4e23-98d6-b140f8654a40/fc9b748b-db96-4ed4-aa23-f6e0ffc866ee_rw_1920.jpg?h=8fa394b572089354aa87b1d602b0f887" width="80%" style="display: block; margin: auto;" /> --- ## `pivot_*()` function <img src="https://github.com/gadenbuie/tidyexplain/raw/main/images/tidyr-pivoting.gif" width="50%" style="display: block; margin: auto;" /> --- ## `pivot_longer()` ``` r pivot_longer(data, cols, names_to = "name", values_to = "value") ``` - The first argument is `data` as usual. - The second argument, `cols`, is where you specify which columns to pivot into longer format -- in this case all columns except for the `faculty_type` - The third argument, `names_to`, is a string specifying the name of the column to create from the data stored in the column names of data -- in this case `year` - The fourth argument, `values_to`, is a string specifying the name of the column to create from the data stored in cell values, in this case `percentage` --- ## Pivot instructor data .small[ ``` r library(tidyverse) staff_long <- staff %>% pivot_longer(cols = -faculty_type, names_to = "year", values_to = "percentage") %>% mutate(percentage = as.numeric(percentage)) staff_long ``` ``` ## # A tibble: 55 × 3 ## faculty_type year percentage ## <chr> <chr> <dbl> ## 1 Full-Time Tenured Faculty 1975 29 ## 2 Full-Time Tenured Faculty 1989 27.6 ## 3 Full-Time Tenured Faculty 1993 25 ## 4 Full-Time Tenured Faculty 1995 24.8 ## 5 Full-Time Tenured Faculty 1999 21.8 ## 6 Full-Time Tenured Faculty 2001 20.3 ## 7 Full-Time Tenured Faculty 2003 19.3 ## 8 Full-Time Tenured Faculty 2005 17.8 ## 9 Full-Time Tenured Faculty 2007 17.2 ## 10 Full-Time Tenured Faculty 2009 16.8 ## # ℹ 45 more rows ``` ] --- .question[ This doesn't look quite right, how would you fix it? ] .small[ ``` r staff_long %>% ggplot(aes(x = percentage, y = year, color = faculty_type)) + geom_col(position = "dodge") ``` <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="80%" style="display: block; margin: auto;" /> ] --- .small[ ``` r staff_long %>% ggplot(aes(x = percentage, y = year, fill = faculty_type)) + geom_col(position = "dodge") ``` <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="80%" style="display: block; margin: auto;" /> ] --- ## Some improvement... .small[ ``` r staff_long %>% ggplot(aes(x = percentage, y = year, fill = faculty_type)) + geom_col() ``` <img src="index_files/figure-html/unnamed-chunk-10-1.png" width="60%" style="display: block; margin: auto;" /> ] --- ## More improvement .small[ ``` r staff_long %>% ggplot(aes(x = year, y = percentage, group = faculty_type, color = faculty_type)) + geom_line() + theme_minimal() ``` <img src="index_files/figure-html/unnamed-chunk-11-1.png" width="85%" style="display: block; margin: auto;" /> ] --- ## Goal: even more improvement! .task[ I want to achieve the following look but I have no idea how! ] <img src="images/sketch.png" width="60%" style="display: block; margin: auto;" /> --- ## Asking good questions - Describe what you want - Describe where you are - Create a minimal **repr**oducible **ex**ample: `reprex::reprex()` --- .panelset[ .panel[.panel-name[Plot] <img src="index_files/figure-html/instructor-lines-1.png" width="100%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r library(scales) staff_long %>% * mutate( * part_time = if_else(faculty_type == "Part-Time Faculty", * "Part-Time Faculty", "Other Faculty"), * year = as.numeric(year) * ) %>% ggplot(aes(x = year, y = percentage/100, group = faculty_type, color = part_time)) + geom_line() + * scale_color_manual(values = c("gray", "red")) + * scale_y_continuous(labels = label_percent(accuracy = 1)) + theme_minimal() + labs( title = "Instructional staff employment trends", x = "Year", y = "Percentage", color = NULL ) + * theme(legend.position = "bottom") ``` ]] --- # Recap Parts of a graph: * aesthetics * color * shape * size * alpha (transparency) * faceting --- class: middle, inverse # A/B testing --- ## Data: College education costs .pull-left[ - Data on four year colleges and universities in the United States (2018-19) - Extracted from College Scorecard API - Source: `c3s2datasets::scorecard` ] .pull-right[ <blockquote class="twitter-tweet"><p lang="en" dir="ltr">The 4 stages of a morning lecture <a href="https://t.co/B7v6WLtX3J">pic.twitter.com/B7v6WLtX3J</a></p>— College Student (@ColIegeStudent) <a href="https://twitter.com/ColIegeStudent/status/829377468595306500?ref_src=twsrc%5Etfw">February 8, 2017</a></blockquote> <script async src="https://platform.twitter.com/widgets.js" charset="utf-8"></script> ] --- ## `c3s2datasets::scorecard` ``` r library(tidyverse) #install.packages("devtools") devtools::install_github("MACS40700/c3s2datasets") ``` ``` ## ── R CMD build ───────────────────────────────────────────────────────────────── ## checking for file ‘/private/var/folders/g1/9qqnjd1x21dbm_sdrj7lhrlw0000gn/T/RtmpCFDU1N/remotescd8358c1866e/MACS40700-c3s2datasets-a0c952e/DESCRIPTION’ ... ✔ checking for file ‘/private/var/folders/g1/9qqnjd1x21dbm_sdrj7lhrlw0000gn/T/RtmpCFDU1N/remotescd8358c1866e/MACS40700-c3s2datasets-a0c952e/DESCRIPTION’ (522ms) ## ─ preparing ‘c3s2datasets’: ## checking DESCRIPTION meta-information checking DESCRIPTION meta-information ... ✔ checking DESCRIPTION meta-information ## ─ checking for LF line-endings in source and make files and shell scripts ## ─ checking for empty or unneeded directories ## ─ building ‘c3s2datasets_0.0.4.tar.gz’ ## ## ``` ``` r library(c3s2datasets) glimpse(scorecard) ``` ``` ## Rows: 1,719 ## Columns: 14 ## $ unitid <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009… ## $ name <chr> "Alabama A & M University", "University of Alabama at Birmin… ## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", … ## $ type <fct> "Public", "Public", "Public", "Public", "Public", "Public", … ## $ admrate <dbl> 0.7160, 0.8854, 0.7367, 0.9799, 0.7890, 0.9680, 0.7118, 0.65… ## $ satavg <dbl> 954, 1266, 1300, 955, 1244, 1069, NA, 1214, 1042, NA, 1111, … ## $ cost <dbl> 21924, 26248, 24869, 21938, 31050, 20621, 32678, 33920, 3645… ## $ netcost <dbl> 13057, 16585, 17250, 13593, 21534, 13689, 23258, 21098, 2037… ## $ avgfacsal <dbl> 79011, 104310, 88380, 69309, 94581, 70965, 99837, 68724, 564… ## $ pctpell <dbl> 0.6853, 0.3253, 0.2377, 0.7205, 0.1712, 0.4821, 0.1301, 0.21… ## $ comprate <dbl> 0.2807, 0.6245, 0.6072, 0.2843, 0.7223, 0.3569, 0.8088, 0.69… ## $ firstgen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381… ## $ debt <dbl> 16600, 15832, 13905, 17500, 17986, 13119, 17750, 16000, 1500… ## $ locale <fct> City, City, City, City, City, City, City, City, City, Suburb… ``` --- ## A simple visualization .panelset[ .panel[.panel-name[Code] ``` r ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(alpha = 0.5, size = 2) + geom_smooth(method = "lm", se = FALSE, size = 0.7) + labs( x = "Average faculty salary (USD)", y = "Net cost of attendance (USD)", title = "Faculty salaries and net cost of attendance in US universities" ) ``` ] .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- ## New variable: `pctpell_cat` ``` r scorecard <- scorecard %>% mutate(pctpell_cat = cut_interval(x = pctpell, n = 6)) %>% drop_na(pctpell_cat) scorecard %>% select(pctpell, pctpell_cat) ``` ``` ## # A tibble: 1,715 × 2 ## pctpell pctpell_cat ## <dbl> <fct> ## 1 0.685 (0.652,0.816] ## 2 0.325 (0.163,0.326] ## 3 0.238 (0.163,0.326] ## 4 0.720 (0.652,0.816] ## 5 0.171 (0.163,0.326] ## 6 0.482 (0.326,0.489] ## 7 0.130 [0,0.163] ## 8 0.215 (0.163,0.326] ## 9 0.461 (0.326,0.489] ## 10 0.675 (0.652,0.816] ## # ℹ 1,705 more rows ``` --- ## Distribution of `pctpell_cat` ``` r scorecard <- scorecard %>% mutate(pctpell_cat = cut_interval(x = pctpell, n = 6)) %>% drop_na(pctpell_cat) scorecard %>% count(pctpell_cat) ``` ``` ## # A tibble: 6 × 2 ## pctpell_cat n ## <fct> <int> ## 1 [0,0.163] 164 ## 2 (0.163,0.326] 657 ## 3 (0.326,0.489] 591 ## 4 (0.489,0.652] 209 ## 5 (0.652,0.816] 68 ## 6 (0.816,0.979] 26 ``` --- ## A slightly more complex visualization .panelset[ .panel[.panel-name[Code] ``` r ggplot(scorecard, aes(x = avgfacsal, y = netcost, color = pctpell_cat)) + geom_point(alpha = 0.5, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(vars(pctpell_cat)) + labs( x = "Average faculty salary (USD)", y = "Net cost of attendance (USD)", color = "Percentage of Pell grant recipients", title = "Faculty salaries and net cost of attendance in US universities" ) ``` ] .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-17-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- class: middle .task[ In the next two slides, the same plots are created with different "cosmetic" choices. Examine the plots two given (Plot A and Plot B), and decide whcih one you prefer. ] --- ## Test 1 .panelset[ .panel[.panel-name[Plot A] <img src="index_files/figure-html/test-1-a-1.png" width="80%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Plot B] <img src="index_files/figure-html/test-1-b-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- ## Test 2 .panelset[ .panel[.panel-name[Plot A] <img src="index_files/figure-html/unnamed-chunk-18-1.png" width="80%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Plot B] <img src="index_files/figure-html/test-2-b-1.png" width="80%" style="display: block; margin: auto;" /> ] ] --- class: middle .large[ .hand[ a deeper look at the plotting code... ] ] --- ## Minimal theme + viridis scale, default option .panelset[ .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-19-1.png" width="80%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(scorecard, aes(x = avgfacsal, y = netcost, color = pctpell_cat)) + geom_point(alpha = 0.5, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(vars(pctpell_cat)) + labs( x = "Average faculty salary (USD)", y = "Net cost of attendance (USD)", color = "Percentage of Pell grant recipients", title = "Faculty salaries and net cost of attendance in US universities" ) + * theme_minimal(base_size = 16) + * scale_color_viridis_d(end = 0.9) ``` ] ] --- ## Viridis scale, option A (magma) .panelset[ .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-21-1.png" width="80%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(scorecard, aes(x = avgfacsal, y = netcost, color = pctpell_cat)) + geom_point(alpha = 0.5, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(vars(pctpell_cat)) + labs( x = "Average faculty salary (USD)", y = "Net cost of attendance (USD)", color = "Percentage of Pell grant recipients", title = "Faculty salaries and net cost of attendance in US universities" ) + theme_minimal(base_size = 16) + * scale_color_viridis_d(end = 0.8, option = "A") ``` ] ] --- ## Dark theme + further theme customization .panelset[ .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-23-1.png" width="80%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(scorecard, aes(x = avgfacsal, y = netcost, color = pctpell_cat)) + geom_point(alpha = 0.5, show.legend = FALSE) + geom_smooth(method = "lm", se = FALSE, size = 0.5, show.legend = FALSE) + facet_wrap(vars(pctpell_cat)) + labs( x = "Average faculty salary (USD)", y = "Net cost of attendance (USD)", color = "Percentage of Pell grant recipients", title = "Faculty salaries and net cost of attendance in US universities" ) + * theme_dark(base_size = 16) + * scale_color_manual(values = c("yellow", "blue", "orange", "red", "green", "white")) + * theme( * text = element_text(color = "red", face = "bold.italic"), * plot.background = element_rect(fill = "yellow") * ) ``` ] ] --- class: middle, inverse # What makes bad figures bad? --- ## Bad taste <img src="index_files/figure-html/unnamed-chunk-25-1.png" width="80%" style="display: block; margin: auto;" /> --- ## Data-to-ink ratio Tufte strongly recommends maximizing the **data-to-ink ratio** this in the Visual Display of Quantitative Information (Tufte, 1983). >Graphical excellence is the well-designed presentation of interesting data—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space … [It] is nearly always multivariate … And graphical excellence requires telling the truth about the data. (Tufte, 1983, p. 51). <img src="images/tufte-visual-display-cover.png" alt="Cover of Visual Display of Quantitative Information, Tufte (1983)." width="15%" style="display: block; margin: auto 0 auto auto;" /> --- .task[ Which of the plots has higher data-to-ink ratio? ] .panelset[ .panel[.panel-name[Plot A] <img src="index_files/figure-html/mean-netcost-pctpell-a-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Plot B] <img src="index_files/figure-html/mean-netcost-pctpell-b-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- class: middle .large[ .hand[ a deeper look at the plotting code... ] ] --- ## Summary statistics ``` r mean_netcost_pctpell <- scorecard %>% group_by(pctpell_cat) %>% summarise(mean_netcost = mean(netcost, na.rm = TRUE)) mean_netcost_pctpell ``` ``` ## # A tibble: 6 × 2 ## pctpell_cat mean_netcost ## <fct> <dbl> ## 1 [0,0.163] 28713. ## 2 (0.163,0.326] 21525. ## 3 (0.326,0.489] 18660. ## 4 (0.489,0.652] 17087. ## 5 (0.652,0.816] 14771. ## 6 (0.816,0.979] 8466. ``` --- ## Barplot .panelset[ .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(mean_netcost_pctpell, aes(y = pctpell_cat, x = mean_netcost)) + * geom_col() + labs( x = "Mean net cost of attendance", y = "Pell grant recipients", title = "Mean net cost of attendance, by Pell grant recipients" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Scatterplot .panelset[ .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-30-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(mean_netcost_pctpell, aes(y = pctpell_cat, x = mean_netcost)) + * geom_point(size = 4) + labs( x = "Mean net cost of attendance", y = "Pell grant recipients", title = "Mean net cost of attendance, by Pell grant recipients" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Lollipop plot -- a happy medium? .panelset[ .panel[.panel-name[Plot] <img src="index_files/figure-html/mean-netcost-pctpell-lollipop-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(mean_netcost_pctpell, aes(y = pctpell_cat, x = mean_netcost)) + geom_point(size = 4) + * geom_segment( * aes( * x = 0, xend = mean_netcost, * y = pctpell_cat, yend = pctpell_cat * ) * ) + labs( x = "Mean net cost of attendance", y = "Pell grant recipients", title = "Mean net cost of attendance, by Pell grant recipients" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Activity: Napoleon’s retreat .pull-left-wide[ .task[ .small[ This is Minard’s visualization of Napoleon’s retreat. Discuss in a pair (or group) the features of the following visualization. What are the variables plotted? Which aesthetics are they mapped to? ] ] ] <img src="images/minard.jpg" alt="Minard’s visualization of Napoleon’s retreat" width="83%" style="display: block; margin: auto;" /> .footnote[Source: [Wikipedia](https://en.wikipedia.org/wiki/Charles_Joseph_Minard#/media/File:Minard.png)]
--- ## Bad data .panelset[ .panel[.panel-name[Original] <img src="images/healy-democracy-nyt-version.png" alt="A crisis of faith in democracy? New York Times." width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Improved] <img src="images/healy-democracy-voeten-version-2.png" alt="A crisis of faith in democracy? New York Times." width="50%" style="display: block; margin: auto;" /> ] ] .footnote[ Healy, Data Visualization: A practical introduction. [Chapter 1](https://socviz.co/lookatdata.html). Figures 1.8 and 1.9. ] --- ## Bad perception <img src="images/healy-perception-curves.png" alt="Aspect ratios affect our perception of rates of change, modeled after an example by William S. Cleveland." width="80%" style="display: block; margin: auto;" /> .footnote[ Healy, Data Visualization: A practical introduction. [Chapter 1](https://socviz.co/lookatdata.html). Figure 1.12. ] --- class: middle, inverse # Aesthetic mappings in ggplot2 --- ## A second look: lollipop plot .panelset[ .panel[.panel-name[Plot] <img src="index_files/figure-html/mean-netcost-pctpell-lollipop-layer-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Code] ``` r ggplot(mean_netcost_pctpell, aes(y = pctpell_cat, x = mean_netcost)) + geom_point(size = 4) + geom_segment( aes( x = 0, xend = mean_netcost, y = pctpell_cat, yend = pctpell_cat ) ) + labs( x = "Mean net cost of attendance", y = "Pell grant recipients", title = "Mean net cost of attendance, by Pell grant recipients" ) + theme_minimal(base_size = 16) ``` ] ] --- ## Global vs. layer-specific aesthetics - Aesthetic mappings can be supplied in the initial `ggplot()` call, in individual layers, or in some combination of both. - Within each layer, you can add, override, or remove mappings. - If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference. However, the distinction is important when you start adding additional layers. --- ## What's memorable? .task[ Think back to all the plots you saw in the lecture, without flipping back through the slides. Which plot first comes to mind? Describe it in words. ] --- class: center, middle, inverse # Finding data: how to identify data for this course / your reesarch questions --- ## Finding data! * Post on Ed (sometimes people have seen relevant data!) * 'Data is plural' newsletter * Google 'thing-you-care-about' + 'dataset' --- class: inverse # Assignment 1 You need to find a graph and **critique** it (don't totally trash it -- this is an academic exercise). If you want you can work to make it better if you can get your hands on similar data. But if not, that's OK! --- # Speaking of: doing well on assignments <img src="https://geekd-out.com/wp-content/uploads/2018/08/sugar-rush-featured-image.jpg" width="65%" style="display: block; margin: auto;" />