---
title: "Scatter Plot Scalability and Enhancements"
output:
html_document:
toc: yes
code_folding: show
code_download: true
---
```{r setup, include = FALSE, message = FALSE}
source(here::here("setup.R"))
knitr::opts_chunk$set(collapse = TRUE, message = FALSE,
fig.height = 5, fig.width = 6, fig.align = "center")
set.seed(12345)
library(dplyr)
library(tidyr)
library(ggplot2)
library(lattice)
library(gridExtra)
library(patchwork)
source(here::here("datasets.R"))
theme_set(theme_minimal() +
theme(text = element_text(size = 16),
panel.border = element_rect(color = "grey30", fill = NA)))
```
## Scalability
Scatter plots work well for hundreds of observations
Over-plotting becomes an issue once the number of observations gets
into tens of thousands.
For some output formats storage also becomes an issue as the number of
points plotted increases.
Some simulated data:
```{r, class.source = "fold-hide"}
n <- 50000
## n50K <- data.frame(x = rnorm(n), y = rnorm(n))
x <- rnorm(n)
y <- x + 0.4 * x ^ 2 + rnorm(n)
y <- y / sd(y)
n50K <- data.frame(x = x, y = y)
n10K <- n50K[1 : 10000, ]
n1K <- n50K[1 : 1000, ]
p1 <- ggplot(n1K, aes(x, y)) +
geom_point() +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n1K)))
p2 <- ggplot(n10K, aes(x, y)) +
geom_point() +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n10K)))
library(patchwork)
p1 | p2
```
## Some Simple Options
Simple options to address over-plotting:
* sampling
* reducing the point size
* alpha blending
Reducing the point size helps when the number of points is in the low
tens of thousands:
```{r, class.source = "fold-hide"}
ggplot(n10K, aes(x, y)) +
geom_point(size = 0.1) +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n10K)))
```
Alpha blending can also be effective, on its own or in combination
with point size adjustment:
```{r, class.source = "fold-hide"}
ggplot(n50K, aes(x, y)) +
geom_point(alpha = 0.01, size = 0.5) +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n50K)))
```
Experimentation is usually needed to identify a good point size and
alpha level.
Both alpha blending and point size reduction inhibit the use of color
for encoding a grouping variable.
The best choices may vary from one output format to another.
## Density Estimation Methods
Some methods based on density estimation or binning:
* Displaying contours of a 2D density estimate.
* Encoding density estimates in point size.
* Hexagonal binning.
### Density Contours
A 2D density estimate can be displayed in terms of its _contours_, or
_level curves_.
```{r, fig.height = 4.75, class.source = "fold-hide"}
p <- ggplot(n50K, aes(x, y)) +
coord_equal() +
ggtitle(sprintf("%d Points", nrow(n50K)))
pp <- geom_point(alpha = 0.01, size = 0.5)
dd <- geom_density_2d(color = "red")
p + dd
```
These are the contours of the estimated density surface:
```{r, fig.height = 4.75, class.source = "fold-hide"}
d <- MASS::kde2d(n50K$x, n50K$y, n = 50)
v <- expand.grid(x = d$x, y = d$y)
v$z <- as.numeric(d$z)
lattice::wireframe(z ~ x + y,
data = v,
screen = list(z = 70,
x = -60))
```
2D density estimate contours can be superimposed on a set of points or
placed beneath a set of points:
```{r, fig.width = 8, fig.height = 6, class.source = "fold-hide"}
p1 <- p + list(pp, dd)
p2 <- p + list(dd, pp)
p1 | p2
```
### Density Levels Encoded with Point Size
Density levels can also be encoded in point size in a grid of points:
```{r, class.source = "fold-hide"}
p + stat_density_2d(
aes(size = after_stat(density)),
geom = "point",
n = 30,
contour = FALSE) +
scale_size(range = c(0, 6))
```
This scales well computationally
It does not easily support encoding a grouping with color or shape.
It introduces some distracting visual artifacts that are related to
some optical illusions [seen
earlier](percep.html#some-optical-illusions).
This effect can be reduced somewhat by jittering:
```{r, class.source = "fold-hide"}
jit_amt <- 0.03
p + stat_density_2d(
aes(size = after_stat(density)),
geom = "point",
position = position_jitter(jit_amt,
jit_amt),
n = 30, contour = FALSE) +
scale_size(range = c(0, 6))
```
### Hexagonal Binning
Hexagonal binning divides the plane into hexagonal bins and displays
the number of points in each bin:
```{r, class.source = "fold-hide"}
p + geom_hex()
```
This also scales very well to larger data sets.
The default color scheme seems less than ideal.
An alternative fill color choice:
```{r, class.source = "fold-hide"}
p + geom_hex() +
scale_fill_gradient(low = "gray",
high = "black")
```
Again it is possible to use a scaled point representation:
```{r, class.source = "fold-hide"}
p + stat_bin_hex(
geom = "point",
aes(size = after_stat(density)),
fill = NA)
```
This representation still produces some visual distractions, but less
than a rectangular grid.
Again, jittering can help:
```{r, class.source = "fold-hide"}
jit_amt <- 0.05
p + stat_bin_hex(
aes(size = after_stat(density)),
geom = "point",
position = position_jitter(jit_amt,
jit_amt),
fill = NA)
```
### Other Density Plots
The `hdrcde` package computes and plots density contours containing
specified proportions of the data.
The `hdr.boxplot.2d` function plots these contours and shows the
points not in the outermost contour:
```{r, class.source = "fold-hide"}
library(hdrcde)
with(n50K,
hdr.boxplot.2d(x, y,
prob = c(0.1, 0.5, 0.75, 0.9)))
```
## Some Enhancements
### Encoding Additional Variables
Scatter plots can encode information about other variables using
* symbol color
* symbol size
* symbol shape
An example using the `mpg` data set:
```{r, class.source = "fold-hide"}
p <- ggplot(mpg, aes(cty, hwy,
color = factor(cyl)))
p + geom_point(aes(size = drv))
```
Some encodings work better than others:
* Size is not a good fit for a discrete variable
* Even though `cyl` is numeric, it is best encoded as categorical.
* Size and color interfere with each other as the color of smaller
objects is harder to perceive.
* Similarly, size and shape interfere with each other.
Using `shape` instead of `size` for `drv`:
```{r, class.source = "fold-hide"}
p + geom_point(aes(shape = drv))
```
Increasing the size makes shapes and the colors easier to distinguish:
```{r}
p + geom_point(aes(shape = drv), size = 3)
```
### Marginal Plots
The `ggMargin` function in the `ggExtra` package attaches marginal
histograms to (some) plots produced by `ggplot`:
```{r, class.source = "fold-hide"}
library(ggExtra)
p <- ggplot(n50K, aes(x, y)) +
geom_point(alpha = 0.01, size = 0.5)
ggMarginal(p,
type = "histogram",
bins = 50,
fill = "lightgrey")
```
The default type is `"density"` for a marginal density plot:
```{r, class.source = "fold-hide"}
ggMarginal(p, fill = "lightgrey")
```
For data sets of more modest size rug plots along the axes can be
useful:
```{r, class.source = "fold-hide"}
ggplot(faithful,
aes(eruptions, waiting)) +
geom_point() +
geom_rug(alpha = 0.2)
```
### Adding a Smooth Curve
When the variables on the $y$ axis is a response variable a useful
enhancement is to add a smooth curve to a plot.
The default method is a form of local averaging, and includes a
representation of uncertainty:
```{r, fig.width = 9, fig.height = 4.25, class.source = "fold-hide"}
p1 <- ggplot(faithful, aes(eruptions, waiting)) +
geom_point() +
geom_smooth() +
ggtitle("Old Faithful Eruptions")
p2 <- ggplot(n50K, aes(x, y)) +
pp +
geom_smooth() +
ggtitle(sprintf("%d Points", nrow(n50K)))
p1 | p2
```
### Axis Transformation
Variable transformations are often helpful.
Variable transformations can be applied by
* plotting the transformed data
* plotting on transformed axes
Axis transformations `ggplot` supports:
* `log10` with `scale_x_log10`, scale_y_log10
* square root with `scale_x_sqrt`, `scale_y_sqrt`
* reversed axes with `scale_x_reverse`, `scale_y_reverse`
Changing scales can make plot shapes simpler, but non-linear axes are
harder to interpret.
```{r, fig.width = 9, class.source = "fold-hide"}
library(gapminder)
gap <- filter(gapminder, year == 2007)
p <- ggplot(gap,
aes(x = gdpPercap,
y = lifeExp,
color = continent,
size = pop)) +
geom_point() +
scale_size_area(max_size = 8)
p1 <- p +
guides(color = "none",
size = "none")
p2 <- p +
scale_x_log10() +
guides(size = "none")
p1 + p2
```
## Conditioning
When the $y$ variable is a response it can also be useful to explore
the conditional distribution of $y$ given different values of the $x$
variable.
One way to do this is to focus on the data for narrow ranges of the
conditioning variable.
Two approaches for creating groups to focus on:
* Use the `base` function `cut`, or one of the `gglot2` functions
* `cut_interval`
* `cut_number`
* `cut_width`
* Use the `round` function.
Rounding to an integer or using
```{r, eval = FALSE}
cut_width(x, width = 1, center = 0)
```
produces these bins:
```{r, class.source = "fold-hide"}
p <- ggplot(n50K, aes(x, y)) +
geom_point(alpha = 0.05, size = 0.5)
p + geom_vline(xintercept =
seq(-3.5, 3.5, by = 1),
linetype = 2,
color = "red")
```
Data within each group can then be shown using box plots (you need to
use the `group` aesthetic to plot on a continuous axis):
```{r}
n50K_trm <- filter(n50K, x > -2.5, x < 2.5)
mutate(n50K_trm, xrnd = round(x)) %>%
ggplot(aes(x, y)) +
geom_boxplot(aes(group = xrnd))
```
Using `cut_width` and violin plots:
```{r, class.source = "fold-hide"}
mutate(n50K_trm,
xcut = cut_width(x,
width = 1,
center = 0)) %>%
group_by(xcut) %>%
mutate(xmed = median(x)) %>%
ungroup() %>%
ggplot(aes(x, y)) +
geom_violin(aes(group = xmed),
draw_quantiles = 0.5)
```
Another option for visualizing the conditional distributions is
faceted density plots:
```{r, class.source = "fold-hide"}
mutate(n50K_trm, xrnd = round(x)) %>%
ggplot(aes(x)) +
geom_density(aes(y)) +
facet_wrap(~ xrnd, ncol = 1)
```
Density ridges work as well:
```{r, class.source = "fold-hide"}
#| warning: false
library(ggridges)
mutate(n50K_trm, xrnd = round(x)) %>%
ggplot(aes(y, xrnd, group = xrnd)) +
geom_density_ridges(quantile_lines = TRUE,
quantiles = 2)
```
Using a larger set of narrower bins centered on multiples of 1/2:
```{r, class.source = "fold-hide"}
mutate(n50K_trm, xrnd = round(2 * x) / 2) %>%
ggplot(aes(y, xrnd, group = xrnd)) +
geom_density_ridges(quantile_lines = TRUE,
quantiles = 2)
```
Sometimes it is useful to use a small number of narrow bins:
```{r, fig.width = 9, fig.height = 4, class.source = "fold-hide"}
rdata <- data.frame(xmin = (-2 : 2) - 0.1,
xmax = (-2 : 2) + 0.1,
ymin = -Inf,
ymax = Inf)
p1 <- p + geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax),
data = rdata,
inherit.aes = FALSE,
fill = "red",
alpha = 0.2)
p2 <- mutate(n50K_trm, xrnd = round(x)) %>%
filter(abs(x - xrnd) < 0.1) %>%
ggplot(aes(y, xrnd, group = xrnd)) +
geom_density_ridges(quantile_lines = TRUE, quantiles = 2)
p1 | p2
```
For examining conditional distributions, bins need to be:
* narrow enough to focus on a particular $x$ value;
* wide enough to include enough observations for a useful visualization.
For smaller data sets it can also be useful to allow bins to overlap.
* `ggplot` does not provide support for this.
* `lattice` does support this with _shingles_.
The lattice functions `shingle` and `equal.count` create shingles that
can be used in `lattice`-style faceting.
## Example: Diamond Prices
The `diamonds` data set contains prices and other attributes for
`r scales::comma(nrow(diamonds))` diamonds.
The `cut` variable indicates the _quality_ of a diamond's cut.
You might expect 'better' cuts to cost more, but that is not true on
average:
```{r, class.source = "fold-hide"}
mm <- group_by(diamonds, cut) %>%
summarize(med_price = median(price),
avg_price = mean(price),
n = length(price)) %>%
pivot_longer(2 : 3,
names_to = "which",
values_to = "price")
ggplot(mm, aes(x = price,
y = cut,
color = which)) +
geom_point(, size = 3)
```
The proportion of diamonds at each `cut` level is also perhaps surprising:
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_bar(aes(x = cut),
fill = "deepskyblue3")
```
And the price distributions within each `cut` level differ in shape:
```{r, class.source = "fold-hide"}
ggplot(diamonds, aes(x = price, y = cut)) +
geom_violin() +
geom_point(aes(color = which), data = mm)
```
Another important factor is the size, measured in `carat`.
```{r, class.source = "fold-hide"}
ggplot(diamonds, aes(x = carat, y = cut)) +
geom_violin()
```
Histograms with a narrow bin width may help understand the unusual shapes.
```{r, class.source = "fold-hide"}
ggplot(diamonds) +
geom_histogram(aes(carat),
binwidth = 0.01)
```
Try faceting on `cut`:
```{r, fig.height = 6.5, class.source = "fold-hide"}
ggplot(diamonds) +
geom_histogram(aes(carat),
binwidth = 0.01) +
facet_wrap(~ cut, ncol = 1)
```
To focus on the shapes of the conditional distributions of carat given
cut use `y = after_stat(density)`:
```{r, fig.height = 6.5, class.source = "fold-hide"}
ggplot(diamonds) +
geom_histogram(
aes(x = carat,
y = after_stat(density)),
binwidth = 0.01) +
facet_wrap(~ cut, ncol = 1)
```
Alternatively, you can facet with `scales = "free_y"`:
```{r, fig.height = 6.5, class.source = "fold-hide"}
ggplot(diamonds) +
geom_histogram(aes(carat),
binwidth = 0.01) +
facet_wrap(~ cut,
scales = "free_y",
ncol = 1)
```
Larger diamonds have a higher price:
```{r, class.source = "fold-hide"}
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point()
```
There is a lot of over-plotting, so a good opportunity to try some of the
techniques we have learned.
Some explorations:
Reduced point size:
```{r, class.source = "fold-hide"}
p <- ggplot(diamonds,
aes(x = carat, y = price))
p + geom_point(size = 0.1)
```
Reduced point size and alpha level:
```{r, class.source = "fold-hide"}
p + geom_point(size = 0.1, alpha = 0.1)
```
Try log scale for price:
```{r, class.source = "fold-hide"}
p + geom_point(size = 0.1, alpha = 0.1) +
scale_y_log10()
```
Log scale for both:
```{r, class.source = "fold-hide"}
p + geom_point(size = 0.1, alpha = 0.1) +
scale_x_log10() +
scale_y_log10()
```
Add a smooth:
```{r, class.source = "fold-hide"}
p + geom_point(size = 0.1, alpha = 0.1) +
geom_smooth() +
scale_x_log10() +
scale_y_log10()
```
Separate smooths for each cut:
```{r, class.source = "fold-hide"}
p + geom_point(size = 0.1, alpha = 0.1) +
geom_smooth(aes(color = cut)) +
scale_x_log10() +
scale_y_log10()
```
Exploring a sample:
```{r}
d500 <- sample_n(diamonds, 500)
```
Distribution of `cut` in the sample:
```{r, fig.height = 4, class.source = "fold-hide"}
ggplot(d500, aes(x = cut)) +
geom_bar()
```
```{r, class.source = "fold-hide"}
ggplot(d500, aes(x = carat, y = price)) +
geom_point()
```
You can also take a _stratified sample_ stratified on cut using
`group_by`, `sample_frac`, and `ungroup`.
Explorations using facets:
```{r, class.source = "fold-hide"}
p0 <- ggplot(diamonds, aes(x = carat, y = price))
dNoCut <- mutate(diamonds, cut = NULL)
p1 <- p0 + geom_point(alpha = 0.01, color = "grey", data = dNoCut)
p <- p1 + geom_point(color = "red", size = 1, alpha = 0.05)
## p + facet_wrap(~ cut)
p + facet_wrap(~ cut) + scale_x_log10() + scale_y_log10()
```
Facets with a sample:
```{r, class.source = "fold-hide"}
p500 <- p1 +
geom_point(data = d500,
color = "red",
size = 1)
## p500
## p500 + facet_wrap(~ cut)
p500 +
facet_wrap(~ cut) +
scale_x_log10() +
scale_y_log10()
```
Muted data with smooths:
```{r, class.source = "fold-hide"}
p1 +
geom_smooth(aes(color = cut),
se = FALSE) +
scale_x_log10() +
scale_y_log10()
```
Conditioning, `carat` values near 1, 1.5, or 2:
```{r, class.source = "fold-hide"}
drnd <- mutate(diamonds,
crnd = round(2 * carat) / 2)
## Look at a bar chart of count(drnd, crnd)
## Drop higher carat values from density ridges
## map cut to color or fill (need to use group = interaction(crnd, cut))
## try log scale for price
filter(drnd,
crnd <= 2, crnd >= 1,
carat >= crnd, carat <= crnd + 0.1) %>%
ggplot(aes(x = price, y = cut)) +
geom_density_ridges(quantile_lines = TRUE,
quantiles = 2) +
facet_wrap(~crnd, ncol = 1)
```
## Reading
Chapter [_Visualizing associations among two or more quantitative
variables_](https://clauswilke.com/dataviz/visualizing-associations.html)
in [_Fundamentals of Data
Visualization_](https://clauswilke.com/dataviz/).
## Exercises
1. A plot of arrival delay against departure delay for the NYC flight
data shows a lot of over-plotting, even for a 10% sample.
```{r}
library(dplyr)
library(ggplot2)
library(nycflights13)
fl <- filter(flights, dep_delay < 120) %>%
sample_frac(0.1)
p <- ggplot(fl, aes(x = dep_delay, arr_delay)) +
geom_point()
p
```
This masks the fact that three quarters of the flights in this
sample have departure delays of less than 10
minutes. Superimposing density contours is one way to address this
issue.
Which of the following adds four red density contours to the plot?
a. `p + geom_density_2d(color = "red")`
b. `p + geom_density_2d(bins = 4, color = "red")`
c. `p + geom_hex(bins = 5)`
d. `p + geom_density_2d(bins = 5)`
2. The `diamonds` data set is large enough for a scatter plot of
`price` against `carat` to suffer from a significant amount of
over-plotting. With `p` created as
```{r}
library(ggplot2)
p <- ggplot(diamonds, aes(x = carat, y = price))
```
which of the following is the best choice to address the
over-plotting issue?
a. `p + geom_point()`
b. `p + geom_point(size = 0.5)`
c. `p + geom_point(size = 0.1, alpha = 0.1)`
d. `p + geom_point(position = "jitter")`
3. Consider the code
```{r, eval = FALSE}
library(dplyr)
library(ggplot2)
filter(diamonds, carat < 3) %>%
mutate(crnd = round(carat)) %>%
filter(---) %>%
ggplot(aes(x = price, y = crnd, group = crnd)) +
geom_density_ridges()
```
Which replacement for `---` produces a plot showing the
conditional density of `price` given `carat` for `carat` values
near one and the conditional density of `price` given `carat` for
`carat` values near two?
a. `abs(carat - crnd) < 0.1`
b. `carat < 0.1`
c. `abs(crnd) < 1`
d. `carat + crnd < 2`