class: center, middle, inverse, title-slide .title[ # Deep dive: stats + scales + guides ] .author[ ### MACS 40700
University of Chicago ] --- class: middle, inverse # Agenda: * Scales: display for how the axes appear numerically * Guides * Legends * Axes (incl ticks) --- # Announcements * Post on Ed for any questions! <!----- - [Project description](https://uc-dataviz.netlify.app/project-1.html) - Group assignments - Repos - Deliverables - April 15 - Proposals for peer review - April 22 - Revised proposals for instructor review - April 28 - write-up and presentation ---> --- class: inverse, middle # Setup * Stats * Scales --- ## Packages + figures ``` r # load packages library(tidyverse) library(c3s2datasets) # set default theme for ggplot2 ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16)) # set default figure parameters for knitr knitr::opts_chunk$set( fig.width = 7, # 7" fig.asp = 0.618, # the golden ratio fig.retina = 2, # dpi multiplier for displaying HTML output on retina dpi = 150, # higher dpi, sharper image out.width = "50%" ) ``` --- ## `scorecard` ``` r glimpse(scorecard) ``` ``` ## Rows: 1,719 ## Columns: 14 ## $ unitid <dbl> 100654, 100663, 100706, 100724, 100751, 100830, 100858, 1009… ## $ name <chr> "Alabama A & M University", "University of Alabama at Birmin… ## $ state <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", … ## $ type <fct> "Public", "Public", "Public", "Public", "Public", "Public", … ## $ admrate <dbl> 0.7160, 0.8854, 0.7367, 0.9799, 0.7890, 0.9680, 0.7118, 0.65… ## $ satavg <dbl> 954, 1266, 1300, 955, 1244, 1069, NA, 1214, 1042, NA, 1111, … ## $ cost <dbl> 21924, 26248, 24869, 21938, 31050, 20621, 32678, 33920, 3645… ## $ netcost <dbl> 13057, 16585, 17250, 13593, 21534, 13689, 23258, 21098, 2037… ## $ avgfacsal <dbl> 79011, 104310, 88380, 69309, 94581, 70965, 99837, 68724, 564… ## $ pctpell <dbl> 0.6853, 0.3253, 0.2377, 0.7205, 0.1712, 0.4821, 0.1301, 0.21… ## $ comprate <dbl> 0.2807, 0.6245, 0.6072, 0.2843, 0.7223, 0.3569, 0.8088, 0.69… ## $ firstgen <dbl> 0.3658281, 0.3412237, 0.3101322, 0.3434343, 0.2257127, 0.381… ## $ debt <dbl> 16600, 15832, 13905, 17500, 17986, 13119, 17750, 16000, 1500… ## $ locale <fct> City, City, City, City, City, City, City, City, City, Suburb… ``` --- class: middle, inverse # Stats --- ## Stats != geoms - Statistical transformation (**stat**) transforms the data, typically by summarizing - Many of ggplot2’s stats are used behind the scenes to generate many important geoms |`stat` | geom | |------------------|-------------------| |`stat_bin()` | `geom_bar()`, `geom_freqpoly()`, `geom_histogram()` | |`stat_bin2d()` | `geom_bin2d()` | |`stat_bindot()` | `geom_dotplot()` | |`stat_binhex()` | `geom_hex()` | |`stat_boxplot()` | `geom_boxplot()` | |`stat_contour()` | `geom_contour()` | |`stat_quantile()` | `geom_quantile()` | |`stat_smooth()` | `geom_smooth()` | |`stat_sum()` | `geom_count()` | --- ## `stat_boxplot()` <iframe src="https://ggplot2.tidyverse.org/reference/geom_boxplot.html#summary-statistics" width="85%" height="400px" data-external="1"></iframe> --- ## Layering with stats ``` r ggplot(scorecard, aes(x = type, y = avgfacsal)) + geom_point(alpha = 0.5) + geom_point(stat = "summary", fun = "median", color = "red", size = 5, pch = 4, stroke = 2) ``` <img src="index_files/figure-html/unnamed-chunk-4-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Alternate layering with stats ``` r ggplot(scorecard, aes(x = type, y = avgfacsal)) + geom_point(alpha = 0.5) + stat_summary(geom = "point", fun = "median", color = "red", size = 5, pch = 4, stroke = 2) ``` <img src="index_files/figure-html/unnamed-chunk-5-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Statistical transformations .task[ What can you say about the distribution of average faculty salaries from the following QQ plot? ] .small[ ``` r ggplot(scorecard, aes(sample = avgfacsal)) + stat_qq() + stat_qq_line() + labs(y = "avgfacsal") ``` <img src="index_files/figure-html/unnamed-chunk-6-1.png" width="50%" style="display: block; margin: auto;" /> ] --- # Stats: Exercise: .task[Use the `PlantGrowth` dataset (hint: data()) and make a boxplot with the means marked with a green X] -- .panelset[ .panel[.panel-name[Stat] .small[ ``` r data("PlantGrowth") ggplot(PlantGrowth, aes(x = group, y = weight)) + geom_boxplot() + stat_summary(geom = "point", fun = "mean", color = "darkgreen", size = 5, pch = 4, stroke = 2) ``` <img src="index_files/figure-html/unnamed-chunk-7-1.png" width="50%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Geom] .small[ ``` r data("PlantGrowth") ggplot(PlantGrowth, aes(x = group, y = weight)) + geom_boxplot() + geom_point(stat = "summary", fun = "mean", color = "darkgreen", size = 5, pch = 4, stroke = 2) ``` <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="50%" style="display: block; margin: auto;" /> ] ] ] --- class: middle, inverse # Scales --- ## What is a scale? - Each scale is a function from a region in data space (the domain of the scale) to a region in aesthetic space (the range of the scale) - The axis or legend is the inverse function: it allows you to convert visual properties back to data --- ## Scale specification Every aesthetic in your plot is associated with exactly one scale: .panelset[ .panel[.panel-name[Code] ``` r # automatic scales ggplot(scorecard, aes(x = pctpell, y = avgfacsal, color = type)) + geom_point(alpha = 0.8) ``` ``` r # manual scales ggplot(scorecard, aes(x = pctpell, y = avgfacsal, color = type)) + geom_point(alpha = 0.8) + scale_x_continuous() + scale_y_continuous() + scale_color_ordinal() ``` ] .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-10-1.png" width="50%" style="display: block; margin: auto;" /> ] ] --- ## Anatomy of a scale function .large[ .center[ `scale_A_B()` ] ] - Always starts with `scale` - `A`: Name of the primary aesthetic (e.g., `color`, `shape`, `x`) - `B`: Name of the scale (e.g., `continuous`, `discrete`, `brewer`) --- ## Guess the output .task[ What will the x-axis label of the following plot say? ] ``` r ggplot(scorecard, aes(x = pctpell, y = avgfacsal, color = type)) + geom_point(alpha = 0.8) + scale_x_continuous(name = "pctpell") + scale_x_continuous(name = "Percent of students receiving a Pell grant") ``` --- ## "Address" messages ``` r ggplot(scorecard, aes(x = pctpell, y = avgfacsal, color = type)) + geom_point(alpha = 0.8) + scale_x_continuous(name = "pctpell") + scale_x_continuous(name = "Percent of students receiving a Pell grant") ``` ``` ## Scale for x is already present. ## Adding another scale for x, which will replace the existing scale. ``` <img src="index_files/figure-html/unnamed-chunk-12-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Fixing axes ``` r ggplot(scorecard, aes(x = pctpell, y = avgfacsal, color = type)) + geom_point(alpha = 0.8) + scale_y_continuous(name = "Average Faculty Salary", labels = label_dollar()) + scale_x_continuous(name = "Percent of students receiving a Pell grant", labels = label_percent()) ``` <img src="index_files/figure-html/unnamed-chunk-13-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Guess the output .task[ What happens if you pair a discrete variable with a continuous scale? What happens if you pair a continuous variable with a discrete scale? Answer in the context of the following plots. ] .panelset[ .panel[.panel-name[Type] ``` r ggplot(scorecard, aes(type)) + geom_bar(stat = "count") ``` <img src="index_files/figure-html/unnamed-chunk-14-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Plots] .pull-left[ ``` r ggplot( data = scorecard, mapping = aes( x = type, y = avgfacsal ) ) + geom_point(alpha = 0.5) + scale_x_continuous() ``` ] .pull-right[ ``` r ggplot( data = scorecard, mapping = aes( x = type, y = avgfacsal ) ) + geom_point(alpha = 0.5) + scale_y_discrete() ``` ] ] .panel[.panel-name[Results] .pull-left[ ``` ## Error in `scale_x_continuous()`: ## ! Discrete value supplied to a continuous scale. ## ℹ Example values: Public, Private, nonprofit, and Private, for-profit. ``` ] .pull-right[ <img src="index_files/figure-html/incorrect-scale-discrete-1.png" width="50%" style="display: block; margin: auto;" /> ] ] ]
--- ## Transformations When working with continuous data, the default is to map linearly from the data space onto the aesthetic space, but this scale can be transformed .panelset[ .panel[.panel-name[Linear] ``` r ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) + geom_point(alpha = 0.5) ``` <img src="index_files/figure-html/unnamed-chunk-16-1.png" width="45%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Transformed] ``` r ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_y_continuous(trans = "log10") ``` <img src="index_files/figure-html/unnamed-chunk-17-1.png" width="45%" style="display: block; margin: auto;" /> ] ] --- ## Continuous scale transformations | Name | Function `\(f(x)\)` | Inverse `\(f^{-1}(y)\)` |-----------|-------------------------|------------------------ | asn | `\(\tanh^{-1}(x)\)` | `\(\tanh(y)\)` | exp | `\(e ^ x\)` | `\(\log(y)\)` | identity | `\(x\)` | `\(y\)` | log | `\(\log(x)\)` | `\(e ^ y\)` | log10 | `\(\log_{10}(x)\)` | `\(10 ^ y\)` | log2 | `\(\log_2(x)\)` | `\(2 ^ y\)` | logit | `\(\log(\frac{x}{1 - x})\)` | `\(\frac{1}{1 + e(y)}\)` | pow10 | `\(10^x\)` | `\(\log_{10}(y)\)` | probit | `\(\Phi(x)\)` | `\(\Phi^{-1}(y)\)` | reciprocal| `\(x^{-1}\)` | `\(y^{-1}\)` | reverse | `\(-x\)` | `\(-y\)` | sqrt | `\(x^{1/2}\)` | `\(y ^ 2\)` --- ## Convenience functions for transformations .pull-left[ ``` r ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_y_continuous(trans = "log10") ``` <img src="index_files/figure-html/unnamed-chunk-18-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ ``` r ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_y_log10(labels = scales::label_comma()) ``` <img src="index_files/figure-html/unnamed-chunk-19-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## Scale transform vs. data transform .task[ How are the following three plots different, how are they similar? What does this say about how scale transformations work. ] .panelset[ .panel[.panel-name[Plot A] .pull-left[ ``` r scorecard %>% mutate(avgfacsal_log10 = log(avgfacsal, base = 10)) %>% ggplot(aes(x = pctpell, y = avgfacsal_log10)) + geom_point(alpha = 0.5) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-20-1.png" width="100%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Plot B] .pull-left[ ``` r ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_y_log10() ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Plot C] .pull-left[ ``` r ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_y_log10( labels = scales::label_comma()) ``` ] .pull-right[ <img src="index_files/figure-html/unnamed-chunk-22-1.png" width="100%" style="display: block; margin: auto;" /> ] ] ] --- # Scales exercise .task[In groups of two-to-three: * Identify a relevant R dataset that has your assigned variable type (e.g. continuous/continuous) * Create a simple plot * Rework the scales using both of the following: * Naming * Transformation ]
--- class: middle, inverse # Guides --- ## What is a guide? Guides are legends and axes: <img src="images/scale-guides.png" alt="Common components of axes and legends" width="80%" style="display: block; margin: auto;" /> .footnote[ Source: ggplot2: Elegant Graphics for Data Analysis, [Chp 15](https://ggplot2-book.org/scales-guides.html#scale-guide). ] --- ## Customizing axes ``` r ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_x_continuous( name = "Net cost of attendance" ) ``` <img src="index_files/figure-html/unnamed-chunk-25-1.png" width="50%" style="display: block; margin: auto;" /> --- ## Customizing axes .task[ Why does 60000 not appear on the x-axis? ] .small[ ``` r ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_x_continuous( name = "Net cost of attendance", breaks = seq(from = 0, to = 60000, by = 10000) ) ``` <img src="index_files/figure-html/unnamed-chunk-26-1.png" width="40%" style="display: block; margin: auto;" /> ] --- ## Customizing axes .small[ ``` r ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_x_continuous( name = "Net cost of attendance", breaks = seq(from = 0, to = 60000, by = 10000), limits = c(0, 60000) ) ``` <img src="index_files/figure-html/unnamed-chunk-27-1.png" width="50%" style="display: block; margin: auto;" /> ] --- ## Customizing axes .small[ ``` r ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_x_continuous( name = "Net cost of attendance", breaks = seq(from = 0, to = 60000, by = 10000), limits = c(0, 60000), labels = c("$0", "$10,000", "$20,000", "$30,000", "$40,000", "$50,000", "$60,000") ) ``` <img src="index_files/figure-html/unnamed-chunk-28-1.png" width="50%" style="display: block; margin: auto;" /> ] --- ## Customizing axes .small[ ``` r library(scales) ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_x_continuous( name = "Net cost of attendance", breaks = seq(from = 0, to = 60000, by = 10000), limits = c(0, 60000), labels = label_currency() ) ``` <img src="index_files/figure-html/unnamed-chunk-29-1.png" width="45%" style="display: block; margin: auto;" /> ] --- ## Customizing axes .small[ ``` r ggplot(scorecard, aes(x = netcost, y = avgfacsal)) + geom_point(alpha = 0.5) + scale_x_continuous( name = "Net cost of attendance", breaks = seq(from = 0, to = 60000, by = 10000), limits = c(0, 60000), labels = label_currency()) + scale_y_continuous( name = "Average faculty salary (USD)", labels = label_dollar()) ``` <img src="index_files/figure-html/unnamed-chunk-30-1.png" width="50%" style="display: block; margin: auto;" /> ] --- ## Aside: storing a plot .small[ ``` r set.seed(1234) p_pctpell_avgfacsal_type <- ggplot(scorecard, aes(x = pctpell, y = avgfacsal)) + geom_jitter(aes(color = type, shape = type), size = 2) p_pctpell_avgfacsal_type ``` <img src="index_files/figure-html/unnamed-chunk-31-1.png" width="50%" style="display: block; margin: auto;" /> ] --- ## Customizing axis and legend labels with `scale_*()` .small[ ``` r p_pctpell_avgfacsal_type + scale_x_continuous(name = "Percent of students receiving a Pell grant") + scale_y_continuous(name = "Average faculty salary (USD)") + scale_color_discrete(name = "College type") + scale_shape_discrete(name = "College type") ``` <img src="index_files/figure-html/unnamed-chunk-32-1.png" width="50%" style="display: block; margin: auto;" /> ] --- ## Customizing axis and legend labels with `labs()` .panelset[ .panel[.panel-name[Separate] .small[ ``` r p_pctpell_avgfacsal_type + labs( x = "Percent of students receiving a Pell grant", y = "Average faculty salary (USD)", color = "College type") ``` <img src="index_files/figure-html/unnamed-chunk-33-1.png" width="50%" style="display: block; margin: auto;" /> ] ] .panel[.panel-name[Combined] .small[ ``` r p_pctpell_avgfacsal_type + labs( x = "Percent of students receiving a Pell grant", y = "Average faculty salary (USD)", color = "College type", shape = "College type", ) ``` <img src="index_files/figure-html/unnamed-chunk-34-1.png" width="50%" style="display: block; margin: auto;" /> ] ] ] --- # Guides exercise .task[In groups of two-to-three: * Identify a relevant R dataset (*hint: data()*) that has your assigned variable type (e.g. continuous/continuous) * Create a simple plot * Rework the scales using at least one per category of the following: * Scales (x axis): adding breaks, limits, labels transformations * Scales (y axis): adding breaks, limits, labels transformations * Add color or shape (e.g. `scale_color_discrete()`) * Legend and labs: polish ]
Post this graph and code on Ed -- it should be reproducible based on the code you submit!! Provide approx 2-3 sentences explaining why/how you made the choices you did. --- # Bonus: Merging Legends! .panelset[ .panel[.panel-name[Easy Fix] ``` r ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = factor(year), shape = factor(year)), size = 5) + scale_colour_brewer("year", type = "qual", palette = 5) ``` <img src="index_files/figure-html/easy-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Easy Fixed] ``` r ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = factor(year), shape = factor(year)), size = 5) + scale_colour_brewer("year", type = "qual", palette = 5) + labs(colour = "year", shape = "year") ``` <img src="index_files/figure-html/easy-fixed-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[ARGGHHH] ``` r ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = factor(year), shape = factor(cyl)), size = 5) + scale_colour_brewer("year", type = "qual", palette = 5) ``` <img src="index_files/figure-html/not-easy-1.png" width="50%" style="display: block; margin: auto;" /> ] .panel[.panel-name[ARGGHHH2] ``` r legend <- mpg %>% distinct(year, cyl) %>% mutate( shape = case_when( cyl == 4 ~ 15, cyl == 5 ~ 17, cyl == 6 ~ 18, cyl == 8 ~ 19 ), colour = case_when( year == 1999 ~ "gray", year == 2008 ~ "black" ), label = paste0(year, " Year, ", cyl, " cyl")) ggplot(mpg, aes(displ, hwy)) + geom_point(aes(colour = interaction(as.factor(year), as.factor(cyl)), shape = interaction(as.factor(year), as.factor(cyl))), size = 5) + scale_colour_discrete(name = "Combined", labels = legend %>% distinct(cyl, label) %>% pull(label), type = legend %>% distinct(cyl, colour) %>% pull(colour)) + scale_shape_manual(name = "Combined", labels = legend %>% distinct(year, label) %>% pull(label), values = legend %>% distinct(year, shape) %>% pull(shape)) + geom_point() ``` ] .panel[.panel-name[ARGGHHH2 plot] *(why would we want / not want to do this?)* <img src="index_files/figure-html/unnamed-chunk-36-1.png" width="55%" style="display: block; margin: auto;" /> ] ] --- # Recap * Assignment 2 (Next week!) * **Scales** can help you rework how the axis labels appear * **Guides**: pulling together legends / axes