General Issues
Make sure you name your files as requested, including matching
the specified use of upper and lower case. This matters on file systems
that are case-sensitive.
Make sure to commit your work to your local repository and push
your commits to GitLab. We can only see what is on GitLab, not what is
on your computer. You can check what we see by going to the GitLab web
interface.
Include your name and the date in the header of your
.Rmd file using author: and date:
tags.
Your HTML file should be a report of your findings.
Any graph you show should be discussed in your
narrative.
Any code you show should be discussed in your narrative.
If you do not need to discuss a piece of code in the narrative,
use echo FALSE to avoid showing it.
If you load a file that you have included in your repository or
that you download to your repository then you need to make sure the code
in your Rmarkdown document uses a relative path, not an absolute one.
Absolute paths will only make sense on your computer, not on the
computer of someone else who downloads your repository.
If you want to check your work is reproducible you can download
your work to a computer other than the one you use for developing it.
One option is the CLAS Linux systems accessed via FastX. You can use
RStudio there to set up a clean copy of your repository and then just
pull your changes and check that they knit successfully. Using
STAT4580::checkHW is a convenient way to do this.
1. Evaluate a Visualization
The Vox visualization attracted some attention in the internet; some
examples:
The original visualization
made the very common mistake of mapping magnitudes to circle radius,
which distorts the perceived magnitudes since perception focuses on
area. The main change in the revision is to map magnitude to
area.
The revision also changed some color assignments, but kept the
traditional assignment of pink for Breast Cancer.
Analysis of the visualization:
Items: diseases and associated measurements.
Attributes: disease, money raised; deaths.
Marks: circles, text.
Channels: vertical position, area, color (hue), text.
Mappings:
Ranks within the numeric variables are mapped to vertical
position.
Magnitudes of numeric variables are mapped to circle
areas.
Magnitudes are also mapped to text labels.
Disease is mapped to color (hue).
A goal of the visualization is to show the discrepancy between the
relative amounts raised and the relative numbers of deaths. This
relation is communicated by matching the positions or sizes of the
corresponding circles by color, a weaker channel.
One good alternative, used in one of the links above, is a scatter
plot:

Other options:
a Tufte-style slope graph using standardized variables or ranks
(essentially a parallel coordinates plot; used in another of the links
above);
visualizing a derived variable, such as funds per death.
There are issues with the data; some of these are discussed in the
articles linked to above.
2. EPA Fuel Economy Data
library(lubridate)
library(readr)
if (! file.exists("vehicles.csv.zip") ||
file.mtime("vehicles.csv.zip") + months(6) < now())
download.file("http://www.stat.uiowa.edu/~luke/data/vehicles.csv.zip",
"vehicles.csv.zip")
newmpg <- read_csv("vehicles.csv.zip", guess_max = 100000)
From the documentation
for the data the appropriate variables seem to be:
fuelType1 represents the primary fuel type,
fl in mpg.
highway08 corresponds to hwy in
mpg;
cylinders corresponds to cyl in
mpg;
displ corresponds to displ in
mpg;
The primary fuel type counts are
library(dplyr)
tbl <- count(newmpg, fuelType1)
kbl <- knitr::kable(tbl, format = "html")
kableExtra::kable_styling(kbl, full_width = FALSE)
|
fuelType1
|
n
|
|
Diesel
|
1310
|
|
Electricity
|
1329
|
|
Hydrogen
|
39
|
|
Midgrade Gasoline
|
168
|
|
Natural Gas
|
60
|
|
Premium Gasoline
|
15580
|
|
Regular Gasoline
|
31094
|
A bar chart of these numbers:
thm <- theme_minimal() + theme(text = element_text(size = 16))
ggplot(tbl, aes(x = n, y = reorder(fuelType1, n))) +
geom_col() +
scale_x_continuous(expand = expansion(mult = c(0, .1))) +
thm +
ylab(NULL)

Regular gas is the dominant fuel type over all years, with premium
second. All other fuel types, including electricity, make up a small
fraction.
3. Fuel Type Over the Years
A filled bar chart shows changes in the primary fuel type used over
the years:
newmpg2 <- filter(newmpg, year <= 2025) |>
mutate(year = factor(year))
ggplot(newmpg2, aes(y = year, fill = fuelType1)) +
geom_bar(position = "fill") +
scale_x_continuous(expand = c(0, 0)) +
labs(x = "Proportion", y = NULL)

Regular gas was the predominant fuel type in the mid 1980s, but
premium’s share gradually increased to the point where almost as many
models use premium as regular, but has decreased recently. Diesel’s
popularity declined early and had a small resurgence recently. The
market share for electricity is still small but is growing.
4. Highway Fuel Economy Over the Years
newmpg3 <- filter(newmpg, year <= 2025, year >= 2000) |>
mutate(year = factor(year))
alpha <- 0.2
size <- 0.3
nyear <- length(levels(newmpg3$year))
A strip chart is a useful way to look at the full data for a numeric
variable at several different levels of a discrete variable, but some
tuning is needed for larger data sets. For examining 26 years of highway
gas mileage data from the EPA data set using alpha = 0.2
and size = 0.3 along with jittering seems to work
reasonably well:
ggplot(newmpg3, aes(x = highway08, y = year)) +
geom_point(position = "jitter", size = size, alpha = alpha) +
ylab(NULL) +
thm

Over time the highway gas mileage distributions are moving upward a
little bit, with the upper tails becoming gradually longer and an
increasing number of very high efficiency models (mostly electric).
---
title: "Assignment 4 Notes"
output:
  html_document:
    toc: yes
    code_download: true
    code_folding: "hide"
---

```{r global_options, include = FALSE}
knitr::opts_chunk$set(collapse = TRUE, fig.align = "center")
```

## General Issues

* Make sure you name your files as requested, including matching the
  specified use of upper and lower case. This matters on file systems
  that are case-sensitive.

* Make sure to commit your work to your local repository and push your
  commits to GitLab. We can only see what is on GitLab, not what is on
  your computer. You can check what we see by going to the GitLab web
  interface.
 
* Include your name and the date in the header of your `.Rmd` file
  using `author:` and `date:` tags.

* Your HTML file should be a report of your findings.

    * Any graph you show should be discussed in your narrative.

    * Any code you show should be discussed in your narrative.

    * If you do not need to discuss a piece of code in the narrative,
      use `echo FALSE` to avoid showing it.

* If you load a file that you have included in your repository or that
  you download to your repository then you need to make sure the code
  in your Rmarkdown document uses a relative path, not an absolute
  one.  Absolute paths will only make sense on your computer, not on
  the computer of someone else who downloads your repository.

* If you want to check your work is reproducible you can download your
  work to a computer other than the one you use for developing it.
  One option is the CLAS Linux systems accessed via
  [FastX](https://linux.clas.uiowa.edu/help/fastx). You can use
  RStudio there to set up a clean copy of your repository and then
  just pull your changes and check that they knit successfully.
  Using `STAT4580::checkHW` is a convenient way to do this.


## 1. Evaluate a Visualization

The Vox visualization attracted some attention in the internet; some
examples:

- A [post](https://www.iflscience.com/infographic-shows-differences-between-diseases-we-donate-and-diseases-kill-us-25489)
  on <https://www.iflscience.com/>.

- A [post](https://nonprofitquarterly.org/2014/09/05/infographic-compares-donations-to-disease-and-finds-big-disparities/)
  on <https://nonprofitquarterly.org> with a link to one [alternative
  visualization](http://themendozaline.org/post/95757674381/this-bubble-chart-is-killing-me).

<!-- not available anymore
- Another [alternative visualization](<http://www.visualmagnetic.com/portfolio/donations-vs-deaths-where-should-our-money-go/>).
-->

- The [original visualization](img/orig-vox-chart.jpg) made the very
  common mistake of mapping magnitudes to circle radius, which
  distorts the perceived magnitudes since perception focuses on
  area. The main change in the [revision](img/new-vox-chart.jpg) is to
  map magnitude to area.

- The revision also changed some color assignments, but kept the
  traditional assignment of pink for Breast Cancer.

Analysis of the visualization:

- Items: diseases and associated measurements.

- Attributes: disease, money raised; deaths.

- Marks: circles, text.

- Channels: vertical position, area, color (hue), text.

- Mappings:

    * Ranks within the numeric variables are mapped to vertical position.

    * Magnitudes of numeric variables are mapped to circle areas.

    * Magnitudes are also mapped to text labels.

    * Disease is mapped to color (hue).

A goal of the visualization is to show the discrepancy between the
relative amounts raised and the relative numbers of deaths.  This
relation is communicated by matching the positions or sizes of the
corresponding circles by color, a weaker channel.

One good alternative, used in one of the links above, is a scatter plot:

```{r, echo = FALSE}
library(ggplot2)
if (! file.exists("dfunds.csv"))
    download.file("https://stat.uiowa.edu/~luke/data/dfunds.csv",
                  "dfunds.csv")
dfunds <- read.csv("dfunds.csv")
ggplot(dfunds, aes(x = Deaths, y = Funding, color = Disease)) +
    geom_point(size = 4)
```

Other options:

- a Tufte-style slope graph using standardized variables or ranks
  (essentially a parallel coordinates plot; used in another of the
  links above);

- visualizing a derived variable, such as funds per death.

There are issues with the data; some of these are discussed in the
articles linked to above.


## 2. EPA Fuel Economy Data

```{r, message = FALSE}
library(lubridate)
library(readr)
if (! file.exists("vehicles.csv.zip") ||
    file.mtime("vehicles.csv.zip") + months(6) < now())
    download.file("http://www.stat.uiowa.edu/~luke/data/vehicles.csv.zip",
                  "vehicles.csv.zip")
newmpg <- read_csv("vehicles.csv.zip", guess_max = 100000)
```

From the [documentation for the
data](https://www.fueleconomy.gov/feg/ws/index.shtml#vehicle) the
appropriate variables seem to be:

  * `fuelType1` represents the primary fuel type, `fl` in `mpg`.
  * `highway08` corresponds to `hwy` in `mpg`;
  * `cylinders` corresponds to `cyl` in `mpg`;
  * `displ` corresponds to `displ` in `mpg`;

The primary fuel type counts are

```{r, message = FALSE}
library(dplyr)
tbl <- count(newmpg, fuelType1)
kbl <- knitr::kable(tbl, format = "html")
kableExtra::kable_styling(kbl, full_width = FALSE)
```

A bar chart of these numbers:

```{r}
thm <- theme_minimal() + theme(text = element_text(size = 16))
ggplot(tbl, aes(x = n, y = reorder(fuelType1, n))) +
    geom_col() +
    scale_x_continuous(expand = expansion(mult = c(0, .1))) +
    thm +
    ylab(NULL)
```

Regular gas is the dominant fuel type over all years, with premium second.
All other fuel types, including electricity, make up a small fraction.


## 3. Fuel Type Over the Years

A filled bar chart shows changes in the primary fuel type used over
the years:
  
```{r}
newmpg2 <- filter(newmpg, year <= 2025) |>
    mutate(year = factor(year))
ggplot(newmpg2, aes(y = year, fill = fuelType1)) +
    geom_bar(position = "fill") +
    scale_x_continuous(expand = c(0, 0)) +
    labs(x = "Proportion", y = NULL)
```

Regular gas was the predominant fuel type in the mid 1980s, but
premium's share gradually increased to the point where almost as many
models use premium as regular, but has decreased recently. Diesel's
popularity declined early and had a small resurgence recently. The
market share for electricity is still small but is growing.


## 4. Highway Fuel Economy Over the Years

```{r}
newmpg3 <- filter(newmpg, year <= 2025, year >= 2000) |>
    mutate(year = factor(year))
alpha <- 0.2
size <- 0.3
nyear <- length(levels(newmpg3$year))
```

A strip chart is a useful way to look at the full data for a numeric
variable at several different levels of a discrete variable, but some
tuning is needed for larger data sets. For examining `r nyear` years of
highway gas mileage data from the EPA data set using
`alpha` = `r alpha` and `size` = `r size` along with jittering seems to
work reasonably well:

```{r}
ggplot(newmpg3, aes(x = highway08, y = year)) +
    geom_point(position = "jitter", size = size, alpha = alpha) +
    ylab(NULL) +
    thm
```

Over time the highway gas mileage distributions are moving upward
a little bit, with the upper tails becoming gradually longer and an
increasing number of very high efficiency models (mostly electric).

<!--
Local Variables: 
mode: poly-markdown+R
mode: flyspell
End:
-->
