Introduction to data visualization

.title[
# Introduction to data visualization
]
.author[
### MACS 40700 <br /> University of Chicago
]

---

# Course Details

---

## Teaching team

### Instructor

* Jean Clipperton - clipperton@uchicago.edu

---

## Themes: what, why, and how

- **What:** the plot
  - Specific types of visualizations for a particular purpose (e.g., maps for spatial data, Sankey diagrams for proportions, etc.) 
  - Tooling to produce them (e.g., specific R packages)

--
- **How:** the process
  - Start with a design (sketch + pseudo code)
  - Pre-process data (e.g., wrangle, reshape, join, etc.)
  - Map data to aesthetics
  - Make visual encoding decisions (e.g., address accessibility concerns)
  - Post-process for visual appeal and annotation

--
- **Why:** the theory
  - Tie together "how" and "what" through the grammar of graphics

---

# Course components

---

## Course website

.center[
<iframe width="900" height="450" src="https://macs40700.netlify.app/" frameborder="0" style="background:white;"></iframe>  
]

---

## Lectures

- Build on readings
- Attendance *and engagement* expected

- A little bit of everything:
  - Traditional lecture
  - Live coding + demos
  - Short exercises + solution discussion

---

## Announcements

- Posted on Ed, be sure to check regularly

- I'll assume that you've read an announcement by the next "business" day

.center[
<iframe width="900" height="450" src="https://edstem.org/us/courses/70315/discussion/" frameborder="0" style="background:white;"></iframe>  
]

---

# Assessments

---

## Assessments

- Homework assignments
  - Accessed on GitHub, submitted on Canvas, individual 
- Final Project
  - Accessed on GitHub, submitted on Canvas, individual or team-based

---

## Teams: UP TO YOU

- Final project
  - You can opt in to group work OR work independently
  - Team-based submissions may be up to three people and each person must clearly explain their contributions to the project both descriptively and within a % (e.g. I did x while my partner did y and I contributed z% to the project (but more detail!)). While the percentages don't have to match exactly, they should be in the general ball-park. 
- Expectations and roles
  - Everyone is expected to contribute equal *effort*
  - Everyone is expected to understand *all* code turned in
  - Individual contribution evaluated by peer evaluation, commits, etc.

---

## Grading

|Assignment|Type       |Value  | n  |Due                  |
|:---------|:----------|:------|:---|---------------------|
|Assignments         |Individual-ish      |60%    | 5* | ~ Every other week  |
|Final choice | Choice      |40%    | 1 .fn[*]  | Exam week      |

# Course policies

---

## Collaboration policy

- Only work that is clearly designated as team work should be completed collaboratively (Projects)

- Homework assignments must be completed individually. You may not directly share answers / code with others, however you are welcome to discuss the problems in general and ask for advice.

---

## Sharing / reusing code policy

- A huge volume of code is available on the web, and many tasks may have solutions posted

- Unless explicitly stated otherwise, this course's policy is that you may make use of any online resources (e.g. RStudio Community, StackOverflow, etc.) but you must explicitly cite where you obtained any code you directly use or use as inspiration in your solution(s).

- Any recycled code that is discovered and is not explicitly cited will be treated as plagiarism, regardless of its source

- AI: if you use AI, you need to include a statement about what asked, your original code, and the issues you fixed / resolved.

.task[**If you don't understand what the code is doing and are not prepared to explain it in detail, you should not submit it.**]

---

# Course Tools

---

## RStudio

- Local R installations

- [Software setup instructions](https://macs40700.netlify.app/setup/#option-2---install-the-software-locally)

---

## GitHub

- GitHub classroom for the course

- All of your work and your membership (enrollment) in the organization is private

- Each assignment is a private repo on GitHub, I distribute the assignments on GitHub and you submit them there

- Feedback on assignments is in Canvas

---

## Username advice

Some brief advice about selecting your account names (particularly for GitHub),

- Incorporate your actual name! People like to know who they’re dealing with and makes your username easier for people to guess or remember

- Reuse your username from other PROFESSIONAL contexts, e.g., Twitter or Slack

- Pick a username you will be comfortable revealing to your future boss

- Shorter is better than longer, but be as unique as possible

- Make it timeless. Avoid highlighting your current university, employer, or place of residence, or year (birth, graduation, etc.)

---

## Ed Discussions

<br>

- Online forum for asking and answering questions

- Allows for code snippets

- Connected to Canvas

- Ask **and answer** questions related to course logistics, assignment, etc. here

- Personal questions (e.g., extensions, illnesses, specific code, etc.) should be via private message

---

# Workflow

Here will be your workflow for class:

* Start on on the website: read over the assignment description.
* Click the github link to accept the assignment: this will create a repo with the proper format/setup for the assignment. The permissions are set so that it will be private to all except you and me/the TA
* Connect to the github repo and push your work to it
* When complete, go to Canvas, submit the github repo link. (we can't push grades from github to Canvas, unfortunately)

---

# Data, truth, and beauty

---

# Just show me the data!

``` r
head(my_data, 10)
```

```
## # A tibble: 10 × 2
##        x     y
##    <dbl> <dbl>
##  1  55.4  97.2
##  2  51.5  96.0
##  3  46.2  94.5
##  4  42.8  91.4
##  5  40.8  88.3
##  6  38.7  84.9
##  7  35.6  79.9
##  8  33.1  77.6
##  9  29.0  74.5
## 10  26.2  71.4
```
]

``` r
mean(my_data$x)
```

```
*## [1] 54.26327
```

``` r
mean(my_data$y)
```

```
*## [1] 47.83225
```

``` r
cor(my_data$x, my_data$y)
```

```
*## [1] -0.06447185
```
]

---

# oh no: all these data have the same summary stats!

---

# Raw data is not enough

---

# Humans love patterns

<div class="figure" style="text-align: center">
<img src="images/01/pattern-processing.png" alt="Pattern processing" width="80%" />
<p class="caption">Pattern processing</p>
</div>

]

---

# (Sometimes we love them too much)

]

]

---

# Beauty is necessary to see patterns

]

]

---

# Beautiful visualizations

---

# What makes a great visualization?

- Truthful
- Functional
- Beautiful
- Insightful
- Enlightening

]

---
Alberto Cairo, *The Truthful Art*:

> 1. It is truthful, as it’s based on thorough and honest research.
> 
> 2. It is functional, as it constitutes an accurate depiction of the data, and it’s built in a way that lets people do meaningful operations based on it (seeing change in time).
> 
> 3. It is beautiful, in the sense of being attractive, intriguing, and even aesthetically pleasing for its intended audience—scientists, in the first place, but the general public, too.
> 
> 4. It is insightful, as it reveals evidence that we would have a hard time seeing otherwise.
> 
> 5. It is enlightening because if we grasp and accept the evidence it depicts, it will change our minds for the better.

---

# What makes a great visualization?

> Graphical excellence is the **well-designed presentation of interesting data**—a matter of substance, of statistics, and of design … [It] consists of complex ideas communicated with clarity, precision, and efficiency. … [It] is that which **gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space** … [It] is nearly always multivariate … And graphical excellence requires **telling the truth about the data**.

---

# What makes a great visualization?

- Good aesthetics
- No substantive issues
- No perceptual issues
- Honesty + good judgment

]

---

# What's wrong?