Some things to keep an eye out for when looking at data on a numeric variable:
skewness, multimodality
gaps, outliers
rounding, e.g. to integer values, or heaping, i.e. a few particular values occur very frequently
impossible or suspicious values
A variant of the dot plot is known as a strip plot. A strip plot for the city temperature data is
p1 <- stripplot(~ temp, data = citytemps)
p2 <- ggplot(citytemps) + geom_point(aes(x = temp, y = "All"))
grid.arrange(p1, p2, nrow = 1)
One way to reduce the vertical space is to use the chunk option fig.height = 2
, which produces
The strip plot can reveal gaps and outliers.
After looking at the plot we might want to examine the high and low values:
filter(citytemps, temp > 85)
## city temp
## 1 Asuncion 95
## 2 Caracas 90
## 3 Dar es Salaam 86
## 4 Kinshasa 86
## 5 Lagos 91
## 6 Managua 88
## 7 Rio de Janeiro 88
## 8 São Paulo 90
filter(citytemps, temp < 10)
## city temp
## 1 Anadyr 3
## 2 Calgary 1
## 3 Edmonton -11
## 4 Kyiv 9
## 5 Minsk 0
## 6 Moscow 3
## 7 Winnipeg 8
The strip plot is most useful for showing subsets corresponding to a categorical variable.
A strip plot for the yields for different varieties in the barley data is
ggplot(barley) + geom_point(aes(x = yield, y = variety))
Scalability in this form is limited due to over-plotting.
A simple strip plot of price
within the different cut
levels in the diamonds
data is not very helpful:
ggplot(diamonds) + geom_point(aes(x = price, y = cut))
Several approaches are available to reduce the impact of over-plotting:
reduce the point size;
random displacement of points, called jittering;
making the points translucent, or alpha blending.
Combining all three produces
ggplot(diamonds) +
geom_point(aes(x = price, y = cut),
size = 0.2, position = "jitter", alpha = 0.2)
Skewness of the price distributions can be seen in this plot, though other approaches will show this more clearly.
A peculiar feature reveled by this plot is the gap below 2000. Examining the subset with price < 2000
shows the gap is roughly symmetric around 1500:
ggplot(filter(diamonds, price < 2000)) +
geom_point(aes(x = price, y = cut),
size = 0.2, position = "jitter", alpha = 0.2)
With a good combination of point size choice, jittering, and alpha blending the strip plot for groups of data can scale to several hundred thousand observations and ten to twenty of groups.
Strip plots can reveal gaps, outliers, and data outside of the expected range.
Skewness and multi-modality can be seen, but other visualizations show these more clearly.
Storage needed for vector graphics images grows linearly with the number of observations.
Base graphics provides stripchart
:
stripchart(yield ~ variety, data = barley)
Lattice provides stripplot
:
stripplot(variety ~ yield, data = barley)