Some Features to Look For

Some things to keep an eye out for when looking at data on a numeric variable:

Strip Plots

Basics

A variant of the dot plot is known as a strip plot. A strip plot for the city temperature data is

p1 <- stripplot(~ temp, data = citytemps)
p2 <- ggplot(citytemps) + geom_point(aes(x = temp, y = "All"))
grid.arrange(p1, p2, nrow = 1)

One way to reduce the vertical space is to use the chunk option fig.height = 2, which produces

The strip plot can reveal gaps and outliers.

After looking at the plot we might want to examine the high and low values:

filter(citytemps, temp > 85)
##             city temp
## 1       Asuncion   95
## 2        Caracas   90
## 3  Dar es Salaam   86
## 4       Kinshasa   86
## 5          Lagos   91
## 6        Managua   88
## 7 Rio de Janeiro   88
## 8      São Paulo   90
filter(citytemps, temp < 10)
##       city temp
## 1   Anadyr    3
## 2  Calgary    1
## 3 Edmonton  -11
## 4     Kyiv    9
## 5    Minsk    0
## 6   Moscow    3
## 7 Winnipeg    8

Multiple Samples

The strip plot is most useful for showing subsets corresponding to a categorical variable.

A strip plot for the yields for different varieties in the barley data is

ggplot(barley) + geom_point(aes(x = yield, y = variety))

Scalability

Scalability in this form is limited due to over-plotting.

A simple strip plot of price within the different cut levels in the diamonds data is not very helpful:

ggplot(diamonds) + geom_point(aes(x = price, y = cut))

Several approaches are available to reduce the impact of over-plotting:

  • reduce the point size;

  • random displacement of points, called jittering;

  • making the points translucent, or alpha blending.

Combining all three produces

ggplot(diamonds) +
    geom_point(aes(x = price, y = cut),
               size = 0.2, position = "jitter", alpha = 0.2)

Skewness of the price distributions can be seen in this plot, though other approaches will show this more clearly.

A peculiar feature reveled by this plot is the gap below 2000. Examining the subset with price < 2000 shows the gap is roughly symmetric around 1500:

ggplot(filter(diamonds, price < 2000)) +
    geom_point(aes(x = price, y = cut),
               size = 0.2, position = "jitter", alpha = 0.2)

Some Notes

  • With a good combination of point size choice, jittering, and alpha blending the strip plot for groups of data can scale to several hundred thousand observations and ten to twenty of groups.

  • Strip plots can reveal gaps, outliers, and data outside of the expected range.

  • Skewness and multi-modality can be seen, but other visualizations show these more clearly.

  • Storage needed for vector graphics images grows linearly with the number of observations.

Base graphics provides stripchart:

stripchart(yield ~ variety, data = barley)

Lattice provides stripplot:

stripplot(variety ~ yield, data = barley)