--- title: "Scatter Plot Scalability and Enhancements" output: html_document: toc: yes code_folding: show code_download: true --- ```{r setup, include = FALSE, message = FALSE} source(here::here("setup.R")) knitr::opts_chunk$set(collapse = TRUE, message = FALSE, fig.height = 5, fig.width = 6, fig.align = "center") set.seed(12345) library(dplyr) library(tidyr) library(ggplot2) library(lattice) library(gridExtra) library(patchwork) source(here::here("datasets.R")) theme_set(theme_minimal() + theme(text = element_text(size = 16), panel.border = element_rect(color = "grey30", fill = NA))) ``` ## Scalability Scatter plots work well for hundreds of observations Over-plotting becomes an issue once the number of observations gets into tens of thousands. For some output formats storage also becomes an issue as the number of points plotted increases. Some simulated data: ```{r, class.source = "fold-hide"} n <- 50000 ## n50K <- data.frame(x = rnorm(n), y = rnorm(n)) x <- rnorm(n) y <- x + 0.4 * x ^ 2 + rnorm(n) y <- y / sd(y) n50K <- data.frame(x = x, y = y) n10K <- n50K[1 : 10000, ] n1K <- n50K[1 : 1000, ] p1 <- ggplot(n1K, aes(x, y)) + geom_point() + coord_equal() + ggtitle(sprintf("%d Points", nrow(n1K))) p2 <- ggplot(n10K, aes(x, y)) + geom_point() + coord_equal() + ggtitle(sprintf("%d Points", nrow(n10K))) library(patchwork) p1 | p2 ``` ## Some Simple Options Simple options to address over-plotting: * sampling * reducing the point size * alpha blending Reducing the point size helps when the number of points is in the low tens of thousands: ```{r, class.source = "fold-hide"} ggplot(n10K, aes(x, y)) + geom_point(size = 0.1) + coord_equal() + ggtitle(sprintf("%d Points", nrow(n10K))) ``` Alpha blending can also be effective, on its own or in combination with point size adjustment: ```{r, class.source = "fold-hide"} ggplot(n50K, aes(x, y)) + geom_point(alpha = 0.01, size = 0.5) + coord_equal() + ggtitle(sprintf("%d Points", nrow(n50K))) ``` Experimentation is usually needed to identify a good point size and alpha level. Both alpha blending and point size reduction inhibit the use of color for encoding a grouping variable. The best choices may vary from one output format to another. ## Density Estimation Methods Some methods based on density estimation or binning: * Displaying contours of a 2D density estimate. * Encoding density estimates in point size. * Hexagonal binning. ### Density Contours A 2D density estimate can be displayed in terms of its _contours_, or _level curves_. ```{r, fig.height = 4.75, class.source = "fold-hide"} p <- ggplot(n50K, aes(x, y)) + coord_equal() + ggtitle(sprintf("%d Points", nrow(n50K))) pp <- geom_point(alpha = 0.01, size = 0.5) dd <- geom_density_2d(color = "red") p + dd ``` These are the contours of the estimated density surface: ```{r, fig.height = 4.75, class.source = "fold-hide"} d <- MASS::kde2d(n50K$x, n50K$y, n = 50) v <- expand.grid(x = d$x, y = d$y) v$z <- as.numeric(d$z) lattice::wireframe(z ~ x + y, data = v, screen = list(z = 70, x = -60)) ``` 2D density estimate contours can be superimposed on a set of points or placed beneath a set of points: ```{r, fig.width = 8, fig.height = 6, class.source = "fold-hide"} p1 <- p + list(pp, dd) p2 <- p + list(dd, pp) p1 | p2 ``` ### Density Levels Encoded with Point Size Density levels can also be encoded in point size in a grid of points: ```{r, class.source = "fold-hide"} p + stat_density_2d( aes(size = after_stat(density)), geom = "point", n = 30, contour = FALSE) + scale_size(range = c(0, 6)) ``` This scales well computationally It does not easily support encoding a grouping with color or shape. It introduces some distracting visual artifacts that are related to some optical illusions [seen earlier](percep.html#some-optical-illusions). This effect can be reduced somewhat by jittering: ```{r, class.source = "fold-hide"} jit_amt <- 0.03 p + stat_density_2d( aes(size = after_stat(density)), geom = "point", position = position_jitter(jit_amt, jit_amt), n = 30, contour = FALSE) + scale_size(range = c(0, 6)) ``` ### Hexagonal Binning Hexagonal binning divides the plane into hexagonal bins and displays the number of points in each bin: ```{r, class.source = "fold-hide"} p + geom_hex() ``` This also scales very well to larger data sets. The default color scheme seems less than ideal. An alternative fill color choice: ```{r, class.source = "fold-hide"} p + geom_hex() + scale_fill_gradient(low = "gray", high = "black") ``` Again it is possible to use a scaled point representation: ```{r, class.source = "fold-hide"} p + stat_bin_hex( geom = "point", aes(size = after_stat(density)), fill = NA) ``` This representation still produces some visual distractions, but less than a rectangular grid. Again, jittering can help: ```{r, class.source = "fold-hide"} jit_amt <- 0.05 p + stat_bin_hex( aes(size = after_stat(density)), geom = "point", position = position_jitter(jit_amt, jit_amt), fill = NA) ``` ### Other Density Plots The `hdrcde` package computes and plots density contours containing specified proportions of the data. The `hdr.boxplot.2d` function plots these contours and shows the points not in the outermost contour: ```{r, class.source = "fold-hide"} library(hdrcde) with(n50K, hdr.boxplot.2d(x, y, prob = c(0.1, 0.5, 0.75, 0.9))) ``` ## Some Enhancements ### Encoding Additional Variables Scatter plots can encode information about other variables using * symbol color * symbol size * symbol shape An example using the `mpg` data set: ```{r, class.source = "fold-hide"} p <- ggplot(mpg, aes(cty, hwy, color = factor(cyl))) p + geom_point(aes(size = drv)) ``` Some encodings work better than others: * Size is not a good fit for a discrete variable * Even though `cyl` is numeric, it is best encoded as categorical. * Size and color interfere with each other as the color of smaller objects is harder to perceive. * Similarly, size and shape interfere with each other. Using `shape` instead of `size` for `drv`: ```{r, class.source = "fold-hide"} p + geom_point(aes(shape = drv)) ``` Increasing the size makes shapes and the colors easier to distinguish: ```{r} p + geom_point(aes(shape = drv), size = 3) ``` ### Marginal Plots The `ggMargin` function in the `ggExtra` package attaches marginal histograms to (some) plots produced by `ggplot`: ```{r, class.source = "fold-hide"} library(ggExtra) p <- ggplot(n50K, aes(x, y)) + geom_point(alpha = 0.01, size = 0.5) ggMarginal(p, type = "histogram", bins = 50, fill = "lightgrey") ``` The default type is `"density"` for a marginal density plot: ```{r, class.source = "fold-hide"} ggMarginal(p, fill = "lightgrey") ``` For data sets of more modest size rug plots along the axes can be useful: ```{r, class.source = "fold-hide"} ggplot(faithful, aes(eruptions, waiting)) + geom_point() + geom_rug(alpha = 0.2) ``` ### Adding a Smooth Curve When the variables on the $y$ axis is a response variable a useful enhancement is to add a smooth curve to a plot. The default method is a form of local averaging, and includes a representation of uncertainty: ```{r, fig.width = 9, fig.height = 4.25, class.source = "fold-hide"} p1 <- ggplot(faithful, aes(eruptions, waiting)) + geom_point() + geom_smooth() + ggtitle("Old Faithful Eruptions") p2 <- ggplot(n50K, aes(x, y)) + pp + geom_smooth() + ggtitle(sprintf("%d Points", nrow(n50K))) p1 | p2 ``` ### Axis Transformation Variable transformations are often helpful. Variable transformations can be applied by * plotting the transformed data * plotting on transformed axes Axis transformations `ggplot` supports: * `log10` with `scale_x_log10`, scale_y_log10 * square root with `scale_x_sqrt`, `scale_y_sqrt` * reversed axes with `scale_x_reverse`, `scale_y_reverse` Changing scales can make plot shapes simpler, but non-linear axes are harder to interpret. ```{r, fig.width = 9, class.source = "fold-hide"} library(gapminder) gap <- filter(gapminder, year == 2007) p <- ggplot(gap, aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) + geom_point() + scale_size_area(max_size = 8) p1 <- p + guides(color = "none", size = "none") p2 <- p + scale_x_log10() + guides(size = "none") p1 + p2 ``` ## Conditioning When the $y$ variable is a response it can also be useful to explore the conditional distribution of $y$ given different values of the $x$ variable. One way to do this is to focus on the data for narrow ranges of the conditioning variable. Two approaches for creating groups to focus on: * Use the `base` function `cut`, or one of the `gglot2` functions * `cut_interval` * `cut_number` * `cut_width` * Use the `round` function. Rounding to an integer or using ```{r, eval = FALSE} cut_width(x, width = 1, center = 0) ``` produces these bins: ```{r, class.source = "fold-hide"} p <- ggplot(n50K, aes(x, y)) + geom_point(alpha = 0.05, size = 0.5) p + geom_vline(xintercept = seq(-3.5, 3.5, by = 1), linetype = 2, color = "red") ``` Data within each group can then be shown using box plots (you need to use the `group` aesthetic to plot on a continuous axis): ```{r} n50K_trm <- filter(n50K, x > -2.5, x < 2.5) mutate(n50K_trm, xrnd = round(x)) %>% ggplot(aes(x, y)) + geom_boxplot(aes(group = xrnd)) ``` Using `cut_width` and violin plots: ```{r, class.source = "fold-hide"} mutate(n50K_trm, xcut = cut_width(x, width = 1, center = 0)) %>% group_by(xcut) %>% mutate(xmed = median(x)) %>% ungroup() %>% ggplot(aes(x, y)) + geom_violin(aes(group = xmed), draw_quantiles = 0.5) ``` Another option for visualizing the conditional distributions is faceted density plots: ```{r, class.source = "fold-hide"} mutate(n50K_trm, xrnd = round(x)) %>% ggplot(aes(x)) + geom_density(aes(y)) + facet_wrap(~ xrnd, ncol = 1) ``` Density ridges work as well: ```{r, class.source = "fold-hide"} #| warning: false library(ggridges) mutate(n50K_trm, xrnd = round(x)) %>% ggplot(aes(y, xrnd, group = xrnd)) + geom_density_ridges(quantile_lines = TRUE, quantiles = 2) ``` Using a larger set of narrower bins centered on multiples of 1/2: ```{r, class.source = "fold-hide"} mutate(n50K_trm, xrnd = round(2 * x) / 2) %>% ggplot(aes(y, xrnd, group = xrnd)) + geom_density_ridges(quantile_lines = TRUE, quantiles = 2) ``` Sometimes it is useful to use a small number of narrow bins: ```{r, fig.width = 9, fig.height = 4, class.source = "fold-hide"} rdata <- data.frame(xmin = (-2 : 2) - 0.1, xmax = (-2 : 2) + 0.1, ymin = -Inf, ymax = Inf) p1 <- p + geom_rect(aes(xmin = xmin, xmax = xmax, ymin = ymin, ymax = ymax), data = rdata, inherit.aes = FALSE, fill = "red", alpha = 0.2) p2 <- mutate(n50K_trm, xrnd = round(x)) %>% filter(abs(x - xrnd) < 0.1) %>% ggplot(aes(y, xrnd, group = xrnd)) + geom_density_ridges(quantile_lines = TRUE, quantiles = 2) p1 | p2 ``` For examining conditional distributions, bins need to be: * narrow enough to focus on a particular $x$ value; * wide enough to include enough observations for a useful visualization. For smaller data sets it can also be useful to allow bins to overlap. * `ggplot` does not provide support for this. * `lattice` does support this with _shingles_. The lattice functions `shingle` and `equal.count` create shingles that can be used in `lattice`-style faceting. ## Example: Diamond Prices The `diamonds` data set contains prices and other attributes for `r scales::comma(nrow(diamonds))` diamonds. The `cut` variable indicates the _quality_ of a diamond's cut. You might expect 'better' cuts to cost more, but that is not true on average: ```{r, class.source = "fold-hide"} mm <- group_by(diamonds, cut) %>% summarize(med_price = median(price), avg_price = mean(price), n = length(price)) %>% pivot_longer(2 : 3, names_to = "which", values_to = "price") ggplot(mm, aes(x = price, y = cut, color = which)) + geom_point(, size = 3) ``` The proportion of diamonds at each `cut` level is also perhaps surprising: ```{r, class.source = "fold-hide"} ggplot(diamonds) + geom_bar(aes(x = cut), fill = "deepskyblue3") ``` And the price distributions within each `cut` level differ in shape: ```{r, class.source = "fold-hide"} ggplot(diamonds, aes(x = price, y = cut)) + geom_violin() + geom_point(aes(color = which), data = mm) ``` Another important factor is the size, measured in `carat`. ```{r, class.source = "fold-hide"} ggplot(diamonds, aes(x = carat, y = cut)) + geom_violin() ``` Histograms with a narrow bin width may help understand the unusual shapes. ```{r, class.source = "fold-hide"} ggplot(diamonds) + geom_histogram(aes(carat), binwidth = 0.01) ``` Try faceting on `cut`: ```{r, fig.height = 6.5, class.source = "fold-hide"} ggplot(diamonds) + geom_histogram(aes(carat), binwidth = 0.01) + facet_wrap(~ cut, ncol = 1) ``` To focus on the shapes of the conditional distributions of carat given cut use `y = after_stat(density)`: ```{r, fig.height = 6.5, class.source = "fold-hide"} ggplot(diamonds) + geom_histogram( aes(x = carat, y = after_stat(density)), binwidth = 0.01) + facet_wrap(~ cut, ncol = 1) ``` Alternatively, you can facet with `scales = "free_y"`: ```{r, fig.height = 6.5, class.source = "fold-hide"} ggplot(diamonds) + geom_histogram(aes(carat), binwidth = 0.01) + facet_wrap(~ cut, scales = "free_y", ncol = 1) ``` Larger diamonds have a higher price: ```{r, class.source = "fold-hide"} ggplot(diamonds, aes(x = carat, y = price)) + geom_point() ``` There is a lot of over-plotting, so a good opportunity to try some of the techniques we have learned. Some explorations: Reduced point size: ```{r, class.source = "fold-hide"} p <- ggplot(diamonds, aes(x = carat, y = price)) p + geom_point(size = 0.1) ``` Reduced point size and alpha level: ```{r, class.source = "fold-hide"} p + geom_point(size = 0.1, alpha = 0.1) ``` Try log scale for price: ```{r, class.source = "fold-hide"} p + geom_point(size = 0.1, alpha = 0.1) + scale_y_log10() ``` Log scale for both: ```{r, class.source = "fold-hide"} p + geom_point(size = 0.1, alpha = 0.1) + scale_x_log10() + scale_y_log10() ``` Add a smooth: ```{r, class.source = "fold-hide"} p + geom_point(size = 0.1, alpha = 0.1) + geom_smooth() + scale_x_log10() + scale_y_log10() ``` Separate smooths for each cut: ```{r, class.source = "fold-hide"} p + geom_point(size = 0.1, alpha = 0.1) + geom_smooth(aes(color = cut)) + scale_x_log10() + scale_y_log10() ``` Exploring a sample: ```{r} d500 <- sample_n(diamonds, 500) ``` Distribution of `cut` in the sample: ```{r, fig.height = 4, class.source = "fold-hide"} ggplot(d500, aes(x = cut)) + geom_bar() ``` ```{r, class.source = "fold-hide"} ggplot(d500, aes(x = carat, y = price)) + geom_point() ``` You can also take a _stratified sample_ stratified on cut using `group_by`, `sample_frac`, and `ungroup`. Explorations using facets: ```{r, class.source = "fold-hide"} p0 <- ggplot(diamonds, aes(x = carat, y = price)) dNoCut <- mutate(diamonds, cut = NULL) p1 <- p0 + geom_point(alpha = 0.01, color = "grey", data = dNoCut) p <- p1 + geom_point(color = "red", size = 1, alpha = 0.05) ## p + facet_wrap(~ cut) p + facet_wrap(~ cut) + scale_x_log10() + scale_y_log10() ``` Facets with a sample: ```{r, class.source = "fold-hide"} p500 <- p1 + geom_point(data = d500, color = "red", size = 1) ## p500 ## p500 + facet_wrap(~ cut) p500 + facet_wrap(~ cut) + scale_x_log10() + scale_y_log10() ``` Muted data with smooths: ```{r, class.source = "fold-hide"} p1 + geom_smooth(aes(color = cut), se = FALSE) + scale_x_log10() + scale_y_log10() ``` Conditioning, `carat` values near 1, 1.5, or 2: ```{r, class.source = "fold-hide"} drnd <- mutate(diamonds, crnd = round(2 * carat) / 2) ## Look at a bar chart of count(drnd, crnd) ## Drop higher carat values from density ridges ## map cut to color or fill (need to use group = interaction(crnd, cut)) ## try log scale for price filter(drnd, crnd <= 2, crnd >= 1, carat >= crnd, carat <= crnd + 0.1) %>% ggplot(aes(x = price, y = cut)) + geom_density_ridges(quantile_lines = TRUE, quantiles = 2) + facet_wrap(~crnd, ncol = 1) ``` ## Reading Chapter [_Visualizing associations among two or more quantitative variables_](https://clauswilke.com/dataviz/visualizing-associations.html) in [_Fundamentals of Data Visualization_](https://clauswilke.com/dataviz/). ## Exercises 1. A plot of arrival delay against departure delay for the NYC flight data shows a lot of over-plotting, even for a 10% sample. ```{r} library(dplyr) library(ggplot2) library(nycflights13) fl <- filter(flights, dep_delay < 120) %>% sample_frac(0.1) p <- ggplot(fl, aes(x = dep_delay, arr_delay)) + geom_point() p ``` This masks the fact that three quarters of the flights in this sample have departure delays of less than 10 minutes. Superimposing density contours is one way to address this issue. Which of the following adds four red density contours to the plot? a. `p + geom_density_2d(color = "red")` b. `p + geom_density_2d(bins = 4, color = "red")` c. `p + geom_hex(bins = 5)` d. `p + geom_density_2d(bins = 5)` 2. The `diamonds` data set is large enough for a scatter plot of `price` against `carat` to suffer from a significant amount of over-plotting. With `p` created as ```{r} library(ggplot2) p <- ggplot(diamonds, aes(x = carat, y = price)) ``` which of the following is the best choice to address the over-plotting issue? a. `p + geom_point()` b. `p + geom_point(size = 0.5)` c. `p + geom_point(size = 0.1, alpha = 0.1)` d. `p + geom_point(position = "jitter")` 3. Consider the code ```{r, eval = FALSE} library(dplyr) library(ggplot2) filter(diamonds, carat < 3) %>% mutate(crnd = round(carat)) %>% filter(---) %>% ggplot(aes(x = price, y = crnd, group = crnd)) + geom_density_ridges() ``` Which replacement for `---` produces a plot showing the conditional density of `price` given `carat` for `carat` values near one and the conditional density of `price` given `carat` for `carat` values near two? a. `abs(carat - crnd) < 0.1` b. `carat < 0.1` c. `abs(crnd) < 1` d. `carat + crnd < 2`