The gghdr
imports the hdrcde
package and provides tools for plotting highest density regions with ggplot2
.
The following vignette assumes familiarity with the grammar of graphics as implemented in ggplot2.
Let’s begin by using ggplot2 to create a boxplot of the eruptions
variable of the faithful
dataset, built into the hdrcde
library.
In a standard boxplot:
library(ggplot2)
theme_set(theme_minimal())
ggplot(data = faithful, aes(y = eruptions)) +
geom_boxplot()
If we look more closely at our data, however, we can see that it’s actually multimodal, but our choice of visualisation technique limited our ability to see this:
ggplot(data = faithful, aes(x = eruptions)) + geom_density()
The hdrcde package
is an R package that allows us to address this. It provides a way of calculating and visualising so-called high-density regions (HDRs), which can intuitively be understood as boxplot-like plots, with darker boxes showing regions of higher data density. For example, a light rectangle might be used to show the region in which over 99% of the data is contained, whereas a smaller, darker rectangle would highlight a region within that which contains the most closely co-localised 50% of the dataset.
Formally, HDRs are defined in the following manner (Hyndman 1996 American Statistician, 50, 120-126):
If \(f(x)\) is the density function of a random variable X, then the \(100 (1 -\alpha)\) HDR is the subset R(\(f_\alpha\)) of the sample space of X such that \(R(f_\alpha) = \{x: f(x) \geq f_\alpha\}\), where \(f_\alpha\) is the largest constant such that Pr(X \(\in R(f_\alpha)) \geq 1-\alpha\))
The hdrcde package
provides tools for computing these highest density regions in one and two dimensions, kernel estimates of univariate density functions conditional on one covariate, and multimodal regression.
We can use this package to create a density plot of the eruptions
variable.
library(hdrcde)
hdr_info <- hdr.den(
faithful$eruptions,
col = c("skyblue", "slateblue2", "slateblue4")
)
This can be visualised more compactly using the hdr.boxplot
function, in which:
hdr.boxplot(faithful$eruptions,
prob = c(99, 95, 50),
col = c("skyblue", "slateblue2", "slateblue4"))
hdr.boxplot()
uses base plotting tools to create the visualisation, limiting users’ ability to customise and combine it with other ggplot2
visualisations.
geom_hdr_boxplot()
We can visualise the same dataset using geom_hdr_boxplot()
. By default, it will plot respectively darker boxes for 99%, 95% or 50% of the data.
library(gghdr)
ggplot(data = faithful,
aes(y = eruptions)) +
geom_hdr_boxplot()
We can specify the base colour we’d like to use.
ggplot(data = faithful,
aes(y = eruptions)) +
geom_hdr_boxplot(fill = c("blue"))
We can also specify which set of probability coverages we’d like to display HDRs for. For example, we can visualise the 25%, 50%, 75%, 95%, and 99% HDR of the faithful eruptions.
ggplot(data = faithful,
aes(y = eruptions)) +
geom_hdr_boxplot(prob = c(0.25, 0.5, 0.75, 0.95, 0.99),
fill = c("blue"))
We can also specify the probabilities as percentages, i.e. out of 100.
ggplot(data = faithful,
aes(y = eruptions)) +
geom_hdr_boxplot(prob = c(25, 50, 75, 95, 99),
fill = c("blue"))
#> Warning: Probability values should be on a scale between 0 to 1. If not, values
#> will be converted to decimal values.
This will produce a warning, but achieve the same result. Note that mixing the two will result in all of your probabilities \(\leq1\) being divided by 100.
ggplot(data = faithful,
aes(y = eruptions)) +
geom_hdr_boxplot(prob = c(2.55, 50, 75, 95, 99),
fill = c("blue"))
#> Warning: Probability values should be on a scale between 0 to 1. If not, values
#> will be converted to decimal values.
geom_hdr_boxplot()
with geom_jitter()
to visualise both the raw data and HDRs
We can combine all sorts of ggplot2 geometries with geom_hdr_boxplot()
. For example, we can use geom_jitter()
to showcase the underlying data.
ggplot(data = faithful,
aes(y = eruptions)) +
geom_hdr_boxplot(fill = c("blue")) +
geom_jitter(aes(x = 0))
geom_hdr_boxplot()
with geom_rug()
to visualise the HDRs and distribution of the data
We can use geom_rug()
to show this distribution in the margin along the y axis.
ggplot(data = faithful, aes(y = eruptions)) +
geom_hdr_boxplot(fill = c("blue")) +
geom_rug()
geom_hdr_rug()
Inspired by this feature of geom_rug()
, we provide geom_hdr_rug()
, which enables us to plot HDRs in the margins of other plots.
ggplot(data = faithful, aes(x = waiting, y = eruptions)) +
geom_point() +
geom_hdr_rug(fill = "blue")
geom_hdr_rug()
supports multiple probability specifications, just like geom_hdr_boxplot()
. It expects probabilities between 0 and 1, but, similarly to geom_hdr_boxplot()
, it will complain but accept probabilities between 0 and 100.
ggplot(data = faithful, aes(x = waiting, y = eruptions)) +
geom_point() +
geom_hdr_rug(prob = c(10, 20, 50, 99), fill = "blue")
#> Warning: Probability values should be on a scale between 0 to 1.
#> If not, values will be converted to decimal values.
One of the most exciting features geom_hdr_boxplot()
supports (unlike hdr.boxplot
) is the ability to create separate HDR boxplots for different groups in a dataset. We use ggplot2’s native syntax to do this. Let’s explore the ggplot2::mpg
dataset. First, let’s look at a histogram of the data, faceted by number of cylinders
ggplot(data = mpg,
aes(x = hwy, fill = as.factor(cyl))) +
facet_grid(as.factor(cyl)~.) +
geom_histogram(bins = 50)
We can see that cars with 6 cylinders seem to be bimodally distributed. As expected, geom_hdr_boxplot()
nicely shows us both modes in the data.
ggplot(data = mpg,
# make sure to change x to y from geom_density to geom_hdr_boxplot
aes(y = hwy, fill = as.factor(cyl))) +
geom_hdr_boxplot()
gghdr
provides the hdr_bin()
function for annotating data with the HDR they fall under. Providing this to the colour
aesthetic produces scatterplots with the HDR information.
ggplot(data = faithful, aes(x = waiting, y = eruptions)) +
geom_point(aes(colour = hdr_bin(x = waiting, y = eruptions))) +
scale_colour_viridis_d(direction = -1)
Feel free to combine this with a rug for more gghdr
goodness.
ggplot(data = faithful, aes(x = waiting, y = eruptions)) +
geom_point(aes(colour = hdr_bin(x = waiting, y = eruptions))) +
geom_hdr_rug() +
scale_colour_viridis_d(direction = -1)