class: center, middle, inverse, title-slide .title[ # Deep dive: geometric objects (AKA geoms) ] .author[ ### MACS 40700
University of Chicago ] --- class: middle, inverse # Agenda: * Aesthetic mappings: WHERE TO PUT! * Geoms: making a layer cake * Scales --- ## Announcements * Assignment deadlines (often BUT NOT ALWAYS Weds / Fridays) --- class: middle, inverse # Setup --- ## Packages ``` r # load packages library(tidyverse) # remotes::install_github("MACS40700/c3s2datasets") library(c3s2datasets) ``` --- ## ggplot2 theme ``` r # set default theme for ggplot2 ggplot2::theme_set(ggplot2::theme_minimal(base_size = 16)) ``` --- ## Figure sizing For more on including figures in R Markdown documents with the right size, resolution, etc. the following resources are great: - [R for Data Science - Graphics for communication](https://r4ds.had.co.nz/graphics-for-communication.html) - [Tips and tricks for working with images and figures in R Markdown documents](https://www.zevross.com/blog/2017/06/19/tips-and-tricks-for-working-with-images-and-figures-in-r-markdown-documents/) ``` r # set default figure parameters for knitr knitr::opts_chunk$set( fig.width = 8, # 8" fig.asp = 0.618, # the golden ratio fig.retina = 2, # dpi multiplier for displaying HTML output on retina dpi = 150, # higher dpi, sharper image out.width = "70%" ) ``` --- ## Data prep: new variables From last class... ``` r scorecard <- scorecard %>% mutate(pctpell_cat = cut_interval(x = pctpell, n = 6)) %>% drop_na(pctpell_cat) scorecard %>% count(pctpell_cat) ``` ``` ## # A tibble: 6 × 2 ## pctpell_cat n ## <fct> <int> ## 1 [0,0.163] 164 ## 2 (0.163,0.326] 657 ## 3 (0.326,0.489] 591 ## 4 (0.489,0.652] 209 ## 5 (0.652,0.816] 68 ## 6 (0.816,0.979] 26 ``` --- ## Data prep: summary table From last class... ``` r mean_netcost_pctpell <- scorecard %>% group_by(pctpell_cat) %>% summarise(mean_netcost = mean(netcost, na.rm = TRUE)) mean_netcost_pctpell ``` ``` ## # A tibble: 6 × 2 ## pctpell_cat mean_netcost ## <fct> <dbl> ## 1 [0,0.163] 28713. ## 2 (0.163,0.326] 21525. ## 3 (0.326,0.489] 18660. ## 4 (0.489,0.652] 17087. ## 5 (0.652,0.816] 14771. ## 6 (0.816,0.979] 8466. ``` --- class: middle, inverse # Aesthetic mappings in ggplot2 --- ## Global vs. layer-specific aesthetics - Aesthetic mappings can be supplied in the initial `ggplot()` call, in individual layers, or in some combination of both. - Within each layer, you can add, override, or remove mappings. - If you only have one layer in the plot, the way you specify aesthetics doesn’t make any difference. However, the distinction is important when you start adding additional layers. --- ## Activity: Spot the difference I .task[ Do you expect the following plots to be the same or different? If different, how? Discuss in a pair (or group) without running the code and sketch the resulting plots based on what you think the code will produce. ] ``` r # Plot A ggplot(scorecard, aes(x = avgfacsal, y = netcost, color = pctpell_cat)) + geom_point(alpha = 0.7) + geom_smooth(method = "lm", se = FALSE, size = 0.5) ``` ``` r # Plot B ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(aes(color = pctpell_cat), alpha = 0.7) + geom_smooth(method = "lm", se = FALSE, size = 0.5) ```
--- ## Activity: Spot the difference I .panelset[ .panel[.panel-name["Plot A"] ``` r # Plot A ggplot(scorecard, aes(x = avgfacsal, y = netcost, color = pctpell_cat)) + geom_point(alpha = 0.7) + geom_smooth(method = "lm", se = FALSE, size = 0.5) ``` <img src="index_files/figure-html/unnamed-chunk-8-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name["Plot B"] ``` r # Plot B ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(aes(color = pctpell_cat), alpha = 0.7) + geom_smooth(method = "lm", se = FALSE, size = 0.5) ``` <img src="index_files/figure-html/unnamed-chunk-9-1.png" width="60%" style="display: block; margin: auto;" /> ]] --- ## Activity: Spot the difference II .task[ Do you expect the following plots to be the same or different? If different, how? Discuss in a pair (or group) without running the code and sketch the resulting plots based on what you think the code will produce. ] ``` r # Plot A ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(aes(color = pctpell_cat)) ``` ``` r # Plot B ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(color = "blue") ``` ``` r # Plot C ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(color = "#a493ba") ```
--- ## Activity: Spot the difference II .panelset[ .panel[.panel-name["Plot A"] ``` r # Plot A ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(aes(color = pctpell_cat)) ``` <img src="index_files/figure-html/unnamed-chunk-12-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name["Plot B"] ``` r # Plot B ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(color = "blue") ``` <img src="index_files/figure-html/unnamed-chunk-13-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name["Plot C"] ``` r # Plot C ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(color = "#800000") ``` <img src="index_files/figure-html/unnamed-chunk-14-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- class: middle, inverse # Geoms --- ## Geoms - Geometric objects, or geoms for short, perform the actual rendering of the layer, controlling the type of plot that you create - You can think of them as "the geometric shape used to represent the data" --- ## One variable - Discrete: - `geom_bar()`: display distribution of discrete variable. - Continuous - `geom_histogram()`: bin and count continuous variable, display with bars - `geom_density()`: smoothed density estimate - `geom_dotplot()`: stack individual points into a dot plot - `geom_freqpoly()`: bin and count continuous variable, display with lines --- ## .hand[aside...] Always use "typewriter text" (monospace font) when writing function names, and follow with `()`, e.g., - `geom_freqpoly()` - `mean()` - `lm()` --- ## `geom_bar()` ``` r ggplot(scorecard, aes(x = pctpell_cat)) + geom_bar() ``` <img src="index_files/figure-html/unnamed-chunk-15-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_bar()` ``` r ggplot(scorecard, aes(y = pctpell_cat)) + geom_bar() ``` <img src="index_files/figure-html/unnamed-chunk-16-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_histogram()` ``` r ggplot(scorecard, aes(x = netcost)) + geom_histogram() ``` <img src="index_files/figure-html/unnamed-chunk-17-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_histogram()` and `binwidth` .panelset[ .panel[.panel-name[2K] ``` r ggplot(scorecard, aes(x = netcost)) + geom_histogram(binwidth = 2000) ``` <img src="index_files/figure-html/unnamed-chunk-18-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[5K] ``` r ggplot(scorecard, aes(x = netcost)) + geom_histogram(binwidth = 5000) ``` <img src="index_files/figure-html/unnamed-chunk-19-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[10K] ``` r ggplot(scorecard, aes(x = netcost)) + geom_histogram(binwidth = 10000) ``` <img src="index_files/figure-html/unnamed-chunk-20-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[20K] ``` r ggplot(scorecard, aes(x = netcost)) + geom_histogram(binwidth = 20000) ``` <img src="index_files/figure-html/unnamed-chunk-21-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- ## `geom_density()` ``` r ggplot(scorecard, aes(x = netcost)) + geom_density() ``` <img src="index_files/figure-html/unnamed-chunk-22-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_density()` and bandwidth (`bw`) .panelset[ .panel[.panel-name[1] ``` r ggplot(scorecard, aes(x = netcost)) + geom_density(bw = 1) ``` <img src="index_files/figure-html/unnamed-chunk-23-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[100] ``` r ggplot(scorecard, aes(x = netcost)) + geom_density(bw = 100) ``` <img src="index_files/figure-html/unnamed-chunk-24-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[1000] ``` r ggplot(scorecard, aes(x = netcost)) + geom_density(bw = 1000) ``` <img src="index_files/figure-html/unnamed-chunk-25-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[10000] ``` r ggplot(scorecard, aes(x = netcost)) + geom_density(bw = 10000) ``` <img src="index_files/figure-html/unnamed-chunk-26-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- ## `geom_density()` outlines .panelset[ .panel[.panel-name[full] ``` r ggplot(scorecard, aes(x = netcost)) + geom_density(outline.type = "full") ``` <img src="index_files/figure-html/unnamed-chunk-27-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[both] ``` r ggplot(scorecard, aes(x = netcost)) + geom_density(outline.type = "both") ``` <img src="index_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[upper] ``` r ggplot(scorecard, aes(x = netcost)) + geom_density(outline.type = "upper") ``` <img src="index_files/figure-html/unnamed-chunk-29-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[lower] ``` r ggplot(scorecard, aes(x = netcost)) + geom_density(outline.type = "lower") ``` <img src="index_files/figure-html/unnamed-chunk-30-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- ## `geom_dotplot()` .task[ What does each point represent? How are their locations determined? What do the x and y axes represent? ] ``` r ggplot(scorecard, aes(x = netcost)) + geom_dotplot(binwidth = 418) ``` <img src="index_files/figure-html/unnamed-chunk-31-1.png" width="80%" style="display: block; margin: auto;" />
--- ## `geom_freqpoly()` ``` r ggplot(scorecard, aes(x = netcost)) + geom_freqpoly(binwidth = 1000) ``` <img src="index_files/figure-html/unnamed-chunk-33-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_freqpoly()` for comparisons .panelset[ .panel[.panel-name[Histogram] ``` r ggplot(scorecard, aes(x = netcost, fill = pctpell_cat)) + geom_histogram(binwidth = 5000) ``` <img src="index_files/figure-html/unnamed-chunk-34-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Frequency polygon] ``` r ggplot(scorecard, aes(x = netcost, color = pctpell_cat)) + geom_freqpoly(binwidth = 5000, size = 1) ``` <img src="index_files/figure-html/unnamed-chunk-35-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Hisogram (identity)] ``` r ggplot(scorecard, aes(x = netcost, fill = pctpell_cat)) + geom_histogram(binwidth = 5000, position = "identity") ``` <img src="index_files/figure-html/unnamed-chunk-36-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Hisogram (identity + alpha)] ``` r ggplot(scorecard, aes(x = netcost, fill = pctpell_cat)) + geom_histogram(binwidth = 5000, position = "identity", alpha = 0.5) ``` <img src="index_files/figure-html/unnamed-chunk-37-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- ## Two variables - both continuous - `geom_point()`: scatterplot - `geom_quantile()`: smoothed quantile regression - `geom_rug()`: marginal rug plots - `geom_smooth()`: smoothed line of best fit - `geom_text()`: text labels --- ## `geom_rug()` ``` r ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point() + geom_rug(alpha = 0.1) ``` <img src="index_files/figure-html/unnamed-chunk-38-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_rug()` on the outside ``` r ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point() + geom_rug(alpha = 0.1, outside = TRUE) + coord_cartesian(clip = "off") ``` <img src="index_files/figure-html/unnamed-chunk-39-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_rug()` on the outside, but better ``` r ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point() + geom_rug(alpha = 0.1, outside = TRUE, sides = "tr") + coord_cartesian(clip = "off") + theme(plot.margin = margin(1, 1, 1, 1, "cm")) ``` <img src="index_files/figure-html/unnamed-chunk-40-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_text()` ``` r sample_n(scorecard, size = 50) %>% ggplot(aes(x = avgfacsal, y = netcost)) + geom_text(aes(label = type)) ``` <img src="index_files/figure-html/unnamed-chunk-41-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_text()` and more ``` r sample_n(scorecard, size = 50) %>% ggplot(aes(x = avgfacsal, y = netcost)) + geom_text(aes(label = type, color = type)) ``` <img src="index_files/figure-html/unnamed-chunk-42-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_text()` and even more ``` r sample_n(scorecard, size = 50) %>% ggplot(aes(x = avgfacsal, y = netcost)) + geom_text(aes(label = type, color = type), show.legend = FALSE ) ``` <img src="index_files/figure-html/unnamed-chunk-43-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Two variables - show distribution - `geom_bin2d()`: bin into rectangles and count - `geom_density2d()`: smoothed 2d density estimate - `geom_hex()`: bin into hexagons and count --- ## `geom_hex()` ``` r library(hexbin) ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_hex() ``` <img src="index_files/figure-html/unnamed-chunk-44-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_hex()` and warnings - Requires installing the [**hexbin**](https://cran.r-project.org/web/packages/hexbin/index.html) package separately! ``` r install.packages("hexbin") ``` - Otherwise you might see ``` Warning: Computation failed in `stat_binhex()` ``` --- ## Two variables - At least one discrete - `geom_count()`: count number of point at distinct locations - `geom_jitter()`: randomly jitter overlapping points - One continuous, one discrete - `geom_col()`: a bar chart of pre-computed summaries - `geom_boxplot()`: boxplots - `geom_violin()`: show density of values in each group --- ## `geom_jitter()` .task[ How are the following three plots different? ] .panelset[ .panel[.panel-name[Plot A] ``` r ggplot(scorecard, aes(x = type, y = netcost)) + geom_point() ``` <img src="index_files/figure-html/unnamed-chunk-46-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Plot B] ``` r ggplot(scorecard, aes(x = type, y = netcost)) + geom_jitter() ``` <img src="index_files/figure-html/unnamed-chunk-47-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Plot C] ``` r ggplot(scorecard, aes(x = type, y = netcost)) + geom_jitter() ``` <img src="index_files/figure-html/unnamed-chunk-48-1.png" width="60%" style="display: block; margin: auto;" /> ] ]
--- ## `geom_jitter()` and `set.seed()` .panelset[ .panel[.panel-name[Plot A] ``` r set.seed(123) ggplot(scorecard, aes(x = type, y = netcost)) + geom_jitter() ``` <img src="index_files/figure-html/unnamed-chunk-50-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Plot B] ``` r set.seed(123) ggplot(scorecard, aes(x = type, y = netcost)) + geom_jitter() ``` <img src="index_files/figure-html/unnamed-chunk-51-1.png" width="60%" style="display: block; margin: auto;" /> ] .panel[.panel-name[Plot C] ``` r set.seed(234) ggplot(scorecard, aes(x = type, y = netcost)) + geom_jitter() ``` <img src="index_files/figure-html/unnamed-chunk-52-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- ## Two variables - One time, one continuous - `geom_area()`: area plot - `geom_line()`: line plot - `geom_step()`: step plot - Display uncertainty: - `geom_crossbar()`: vertical bar with center - `geom_errorbar()`: error bars - `geom_linerange()`: vertical line - `geom_pointrange()`: vertical line with center - Spatial - `geom_sf()`: for map data (more on this later...) --- ## Average net cost per Pell grant recipient ``` r mean_netcost_pctpell <- scorecard %>% mutate(pctpell = round(pctpell, digits = 2)) %>% group_by(pctpell) %>% summarise( n = n(), mean_netcost = mean(netcost), sd_netcost = sd(netcost) ) %>% drop_na(mean_netcost) mean_netcost_pctpell %>% head(6) ``` ``` ## # A tibble: 6 × 4 ## pctpell n mean_netcost sd_netcost ## <dbl> <int> <dbl> <dbl> ## 1 0.01 1 2640 NA ## 2 0.07 3 17447. 12883. ## 3 0.08 3 16578 11709. ## 4 0.1 8 32568. 10281. ## 5 0.12 11 31959 8336. ## 6 0.13 14 29326. 9716. ``` --- ## `geom_line()` ``` r ggplot(mean_netcost_pctpell, aes(x = pctpell, y = mean_netcost)) + geom_line() ``` <img src="index_files/figure-html/unnamed-chunk-54-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_area()` ``` r ggplot(mean_netcost_pctpell, aes(x = pctpell, y = mean_netcost)) + geom_area() ``` <img src="index_files/figure-html/unnamed-chunk-55-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_step()` ``` r ggplot(mean_netcost_pctpell, aes(x = pctpell, y = mean_netcost)) + geom_step() ``` <img src="index_files/figure-html/unnamed-chunk-56-1.png" width="60%" style="display: block; margin: auto;" /> --- ## `geom_errorbar()` .task[ Describe how this plot is constructed and what the points and the lines (error bars) correspond to. ] .panelset[ .panel[.panel-name[Code] ``` r ggplot(mean_netcost_pctpell, aes(x = pctpell, y = mean_netcost)) + geom_point() + geom_errorbar(aes( ymin = mean_netcost - sd_netcost, ymax = mean_netcost + sd_netcost )) ``` ] .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-57-1.png" width="60%" style="display: block; margin: auto;" /> ] ]
--- class: center, inverse, middle ## Let's clean things up a bit! Meet your new best friend, the [**scales**](https://scales.r-lib.org/) package! ``` r library(scales) ``` --- # Scales We need one mapping from our aesthetic layer to our mapping function (e.g. aes). Scales can help with that transformation (use of guides)! https://scales.r-lib.org/ --- ## Let's clean things up a bit! .panelset[ .panel[.panel-name[Code] ``` r library(tidyverse) library(c3s2datasets) ggplot(scorecard, aes(x = avgfacsal, y = netcost)) + geom_point(alpha = 0.4, size = 2, color = "#012169") + scale_x_continuous(labels = label_number(big.mark = ",")) + scale_y_continuous(labels = label_dollar()) + labs( x = "Average faculty salary (USD)", y = "Net cost of attendance (USD)", title = "Faculty salaries and net cost of attendance in US universities", subtitle = "2018-19", caption = "Source: College Scorecard" ) ``` ] .panel[.panel-name[Plot] <img src="index_files/figure-html/unnamed-chunk-60-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- class: middle, inverse # geoms --- ## Three variables - `geom_tile()`: tile the plane with rectangles - `geom_raster()`: fast version of `geom_tile()` for equal sized tiles --- ## `geom_tile()` ``` r ggplot(scorecard, aes(x = type, y = locale)) + geom_tile(aes(fill = netcost)) ``` <img src="index_files/figure-html/unnamed-chunk-61-1.png" width="60%" style="display: block; margin: auto;" /> --- ## Another look at smooth-ish surface .panelset[ .panel[.panel-name[Summarize] .small[ ``` r mean_netcost_type_locale <- scorecard %>% group_by(type, locale) %>% summarize(mean_netcost = mean(netcost, na.rm = TRUE), .groups = "drop") mean_netcost_type_locale %>% head(10) ``` ``` ## # A tibble: 10 × 3 ## type locale mean_netcost ## <fct> <fct> <dbl> ## 1 Public City 13428. ## 2 Public Suburb 14726. ## 3 Public Town 14433. ## 4 Public Rural 13040. ## 5 Public <NA> 3567 ## 6 Private, nonprofit City 23443. ## 7 Private, nonprofit Suburb 22024. ## 8 Private, nonprofit Town 21581. ## 9 Private, nonprofit Rural 22389. ## 10 Private, for-profit City 28311. ``` ] ] .panel[.panel-name[Plot] ``` r ggplot(mean_netcost_type_locale, aes(x = type, y = locale)) + geom_point(aes(color = mean_netcost), size = 30, pch = "square") ``` <img src="index_files/figure-html/unnamed-chunk-63-1.png" width="60%" style="display: block; margin: auto;" /> ] ] --- ## Activity: Pick a geom .task[ For each of the following problems, suggest a useful geom: 1. Display how the value of variable has changed over time 1. Show the detailed distribution of a single continuous variable 1. Focus attention on the overall relationship between two variables in a large dataset 1. Label outlying points in a single variable ]
--- class: inverse, middle # Assignment 2: Refining plots --- ## Assignment 2 .task[ For this assignment, you will submit **SIX plots**: a ‘rough draft and a ‘final’ version for each. ] https://macs40700.netlify.app/assignments/assign2/ -- ### Info * You can use any data for this assignment but the dataset should be the same for both and there should be some connection between the two plots – some narrative of some sort. * This assignment should be able to stand alone, meaning that all information needed to understand and interpret the plots is provided by you in your writeup. * Keep it to 500-750 words * The final plots should be professional quality–something we could expect to see in a final report, thesis, etc. --- ## A2: Refining components For each data type (continuous-y and categorical), you will provide three plots: 1. The final plot. This is the plot that you selected to represent the data most effectively. 2. The rough draft / initial plot. This is the initial version of your final plot with all ‘defaults’ – e.g. not at all customized. 3. The alternative plot. This is a second plot option you tried before committing to the final form. For example, maybe you were debating between histograms, dot plots, and box plots. Include one version you tried. --- # Recap * Geoms: consider the layering options * Jitter: consider different seeds * Scales: way to format axes nicely * Work on graph ideas for A2