Comparing Distributions Using Violin Plots

The following explores how you can use small multiples of a simple Toyplot visualization to produce a more complex visualization (violin plots in this case).

First, we will generate eight data sets drawn from randomly-chosen distributions:

[1]:

import numpy
import toyplot

numpy.random.seed(1234)

# Generate 8 sets of samples, each with different counts and distributions
datasets = []
for i in numpy.arange(8):
    mean = numpy.random.uniform()
    scale = numpy.random.uniform()
    size = numpy.random.randint(100, 2000)
    datasets.append(numpy.random.normal(mean, scale, size=size))

If we wanted to look at the distribution of the first dataset, we could use a simple one-liner histogram plot:

[2]:

dataset = datasets[0]
toyplot.bars(numpy.histogram(dataset, 25), width=600, height=300);

However, unlike a visualization with multiple line plots, multiple histograms don’t work very well:

[3]:

canvas = toyplot.Canvas(width=600, height=300)
axes = canvas.cartesian()
for dataset in datasets:
    axes.bars(numpy.histogram(dataset, 25))

Instead, let’s plot the histograms separately, but in a single canvas:

[4]:

canvas = toyplot.Canvas(width=400, height=1000)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(len(datasets), 1, index))
    axes.bars(numpy.histogram(dataset, 25))

This plot is too tall to be useful, so let’s turn it on its side by orienting each plot along the Y axis:

[5]:

canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.bars(numpy.histogram(dataset, 25), along="y")

The X axes for each plot are too short to be useful, and our goal is to make purely qualitative comparisons of the shapes of the distributions anyway, so let’s hide the axes and replace them with human-readable labels, including a single Y axis label along the left side of the graph:

[6]:

canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.x.spine.show = False
    axes.x.ticks.labels.show = False
    axes.x.label.text = "Series {}".format(index)
    if index == 0:
        axes.y.show = True
        axes.y.label.text = "Range"
    axes.bars(numpy.histogram(dataset, 25), along="y")

Notice that the Y axis values vary from plot-to-plot, since the domain of each dataset differs. To make meaningful comparisons between distributions, we need to have consistency along the axes, so let’s force each Y axis to display a range of values that incorporates all of the datasets:

[7]:

canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.x.spine.show = False
    axes.x.ticks.labels.show = False
    axes.x.label.text = "Series {}".format(index)
    if index == 0:
        axes.y.show = True
        axes.y.label.text = "Range"
    axes.y.domain.min = numpy.min(numpy.concatenate(datasets))
    axes.y.domain.max = numpy.max(numpy.concatenate(datasets))
    axes.bars(numpy.histogram(dataset, 25), along="y")

Note that, now that the Y axis domains are properly aligned with one another, the bar widths vary, which can be an unwanted distration. To eliminate this, we will replace the bar plots with fill plots:

[8]:

canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.x.spine.show = False
    axes.x.ticks.labels.show = False
    axes.x.label.text = "Series {}".format(index)
    if index == 0:
        axes.y.show = True
        axes.y.label.text = "Range"
    axes.y.domain.min = numpy.min(numpy.concatenate(datasets))
    axes.y.domain.max = numpy.max(numpy.concatenate(datasets))
    counts, bins = numpy.histogram(dataset, 25)
    centers = (bins[:-1] + bins[1:]) / 2
    axes.fill(centers, counts, along="y")

Finally, we mirror each fill plot around the Y axis to emphasize its shape, producing a classic violin plot:

[9]:

canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.x.spine.show = False
    axes.x.ticks.labels.show = False
    axes.x.label.text = "Series {}".format(index)
    if index == 0:
        axes.y.show = True
        axes.y.label.text = "Range"
    axes.y.domain.min = numpy.min(numpy.concatenate(datasets))
    axes.y.domain.max = numpy.max(numpy.concatenate(datasets))
    counts, bins = numpy.histogram(dataset, 25)
    centers = (bins[:-1] + bins[1:]) / 2
    axes.fill(centers, counts * 2, baseline=-counts, along="y")

The final visualization makes it easy to compare the shapes and domains of the distributions.

As always, don’t forget to render out an identical high-quality figure for publication:

[10]:

import toyplot.pdf
toyplot.pdf.render(canvas, "violin-plot.pdf")