_images/toyplot.png

Comparing Distributions Using Violin Plots

The following explores how you can use small multiples of a simple Toyplot visualization to produce a more complex visualization (violin plots in this case).

First, we will generate eight data sets drawn from randomly-chosen distributions:

[1]:
import numpy
import toyplot

numpy.random.seed(1234)

# Generate 8 sets of samples, each with different counts and distributions
datasets = []
for i in numpy.arange(8):
    mean = numpy.random.uniform()
    scale = numpy.random.uniform()
    size = numpy.random.randint(100, 2000)
    datasets.append(numpy.random.normal(mean, scale, size=size))

If we wanted to look at the distribution of the first dataset, we could use a simple one-liner histogram plot:

[2]:
dataset = datasets[0]
toyplot.bars(numpy.histogram(dataset, 25), width=600, height=300);
-2-10123050100150

However, unlike a visualization with multiple line plots, multiple histograms don’t work very well:

[3]:
canvas = toyplot.Canvas(width=600, height=300)
axes = canvas.cartesian()
for dataset in datasets:
    axes.bars(numpy.histogram(dataset, 25))
-20250100200

Instead, let’s plot the histograms separately, but in a single canvas:

[4]:
canvas = toyplot.Canvas(width=400, height=1000)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(len(datasets), 1, index))
    axes.bars(numpy.histogram(dataset, 25))
-2-10123050100150-101201020-20240100200-101204080120-10120102030-20250204060-0.20.00.20.50.8050100150200-10120100200

This plot is too tall to be useful, so let’s turn it on its side by orienting each plot along the Y axis:

[5]:
canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.bars(numpy.histogram(dataset, 25), along="y")
050100150-2-1012301020-10120100200-202404080120-10120102030-10120204060-2025050100150200-0.20.00.20.50.80100200-1012

The X axes for each plot are too short to be useful, and our goal is to make purely qualitative comparisons of the shapes of the distributions anyway, so let’s hide the axes and replace them with human-readable labels, including a single Y axis label along the left side of the graph:

[6]:
canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.x.spine.show = False
    axes.x.ticks.labels.show = False
    axes.x.label.text = "Series {}".format(index)
    if index == 0:
        axes.y.show = True
        axes.y.label.text = "Range"
    axes.bars(numpy.histogram(dataset, 25), along="y")
Series 0-2-10123RangeSeries 1-1012Series 2-2024Series 3-1012Series 4-1012Series 5-2025Series 6-0.20.00.20.50.8Series 7-1012

Notice that the Y axis values vary from plot-to-plot, since the domain of each dataset differs. To make meaningful comparisons between distributions, we need to have consistency along the axes, so let’s force each Y axis to display a range of values that incorporates all of the datasets:

[7]:
canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.x.spine.show = False
    axes.x.ticks.labels.show = False
    axes.x.label.text = "Series {}".format(index)
    if index == 0:
        axes.y.show = True
        axes.y.label.text = "Range"
    axes.y.domain.min = numpy.min(numpy.concatenate(datasets))
    axes.y.domain.max = numpy.max(numpy.concatenate(datasets))
    axes.bars(numpy.histogram(dataset, 25), along="y")
Series 0-2025RangeSeries 1-2025Series 2-2025Series 3-2025Series 4-2025Series 5-2025Series 6-2025Series 7-2025

Note that, now that the Y axis domains are properly aligned with one another, the bar widths vary, which can be an unwanted distration. To eliminate this, we will replace the bar plots with fill plots:

[8]:
canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.x.spine.show = False
    axes.x.ticks.labels.show = False
    axes.x.label.text = "Series {}".format(index)
    if index == 0:
        axes.y.show = True
        axes.y.label.text = "Range"
    axes.y.domain.min = numpy.min(numpy.concatenate(datasets))
    axes.y.domain.max = numpy.max(numpy.concatenate(datasets))
    counts, bins = numpy.histogram(dataset, 25)
    centers = (bins[:-1] + bins[1:]) / 2
    axes.fill(centers, counts, along="y")
Series 0-2025RangeSeries 1-2025Series 2-2025Series 3-2025Series 4-2025Series 5-2025Series 6-2025Series 7-2025

Finally, we mirror each fill plot around the Y axis to emphasize its shape, producing a classic violin plot:

[9]:
canvas = toyplot.Canvas(width=1000, height=400)
for index, dataset in enumerate(datasets):
    axes = canvas.cartesian(grid=(1, len(datasets), index))
    axes.x.spine.show = False
    axes.x.ticks.labels.show = False
    axes.x.label.text = "Series {}".format(index)
    if index == 0:
        axes.y.show = True
        axes.y.label.text = "Range"
    axes.y.domain.min = numpy.min(numpy.concatenate(datasets))
    axes.y.domain.max = numpy.max(numpy.concatenate(datasets))
    counts, bins = numpy.histogram(dataset, 25)
    centers = (bins[:-1] + bins[1:]) / 2
    axes.fill(centers, counts * 2, baseline=-counts, along="y")
Series 0-2025RangeSeries 1-2025Series 2-2025Series 3-2025Series 4-2025Series 5-2025Series 6-2025Series 7-2025

The final visualization makes it easy to compare the shapes and domains of the distributions.

As always, don’t forget to render out an identical high-quality figure for publication:

[10]:
import toyplot.pdf
toyplot.pdf.render(canvas, "violin-plot.pdf")