General Issues

General Comments

1. Choosing Between Faceting and Color

The faceted plot shows each of the seven groups in a sub-plot, or facet, using the same axis scales for all plots.

library(ggplot2)
ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_point() +
    facet_wrap(~ class, nrow = 2)

The plots are small and there is some over-plotting. The over-plotting could be reduced by reducing the point size.

A single plot that maps class to color benefits from a larger point size to improve discriminability of the colors:

ggplot(mpg, aes(x = displ, y = hwy, color = class)) +
    geom_point(size = 2.5)

The number of colors is large, which makes discrimination more difficult, even with the increased point size. But once groups are identified, their relative positions are easier to see in the colored plot as all comparisons are within a common set of scales.

Faceting reduces plot size and thus increases over-plotting for larger data sets. Reducing point size is an option that can be effective if color and shape are not being used as channels. A significant drawback of faceting is that some group comparisons are moved from common scale comparisons to unaligned scale comparisons. This can sometimes be alleviated somewhat by showing a muted image of the complete data in the background.

Overall, color may have a slight edge in this data set. But it should be kept in mind that color is not effective on all display devices or for all viewers.

In larger data sets color becomes less effective as there will be a considerable amount of over-plotting, given the point size needed to support good color discrimination. Faceting will also suffer from more over-plotting in larger data sets for a given point size, but there is more flexibility to reduce point size. The shape of the data also plays a role, so both approaches are worth considering.

2. Faceting with Muted Full Data

The full data can be added as a background layer in a muted color, such as a light grey:

library(ggplot2)
library(dplyr)
ggplot(mpg, aes(x = displ, y = hwy)) +
    geom_point(data = mutate(mpg, class = NULL), color = "lightgrey") +
    geom_point() +
    facet_wrap(~ class, nrow = 2)

With the full data group-to-whole comparisons are again on aligned scales. For example, with the full data in the background it is easy to see that the 2-seaters are quite different than the other cars. Seeing this in the basic faceted plot shown above is also possible, but it requires some work.

3. Gun Murders in US States

if (! file.exists("murders.csv"))
    download.file("https://www.stat.uiowa.edu/~luke/data/murders.csv",
                  "murders.csv")
murders <- read.csv("murders.csv")

The following graph shows a plot of the total number of gun murders against the population of each state and the District of Columbia. Log axes are used as the distributions of both variables are highly skewed. The points are colored to show the region associated with each state.

ggplot(murders, aes(x = population, y = total, color = region)) +
    geom_point(size = 2.5) +
    scale_x_log10() +
    scale_y_log10()

The relationship between the number of murders and the population size appears to be close to linear. The states in the southern region are mostly towards the top of the set of points: for a given population size the number of murders in southern states appears to be higher than in others.

4. Comparing Some Visualizations

All three plots clearly show that the 5 cylinder group is the smallest. Distinguishing the sizes of the other groups is more challenging.

Plot B uses aligned scales. It is easy to see the ordering, even though the values for 8, 6, and 4 cylinders are quite close.

Plot A relies on length comparisons; it seems possible to recognize that the 8 cylinder group is the smallest among the 4, 6, and 8 cylinder groups, but determining which of the 4 and 6 cylinder groups is smaller is very hard.

Plot C relies on area comparisons. The sizes of the 4, 6, and 8 cylinder groups are very hard to distinguish.

For comparing the group sizes in this data set Plot B is best, followed by Plot A, and then Plot C.

