Python Charts

Python visualization demystified

Histograms in Matplotlib

Plotting histograms in Matplotlib

Histograms are useful for visualizing distributions of data and are pretty simple in Maplotlib.

The Basics

Let's get the tips data from Seaborn:

from matplotlib import pyplot as plt
import numpy as np
import seaborn as sns

df = sns.load_dataset('tips')
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
# Plot a simple, default histogram.
# ax.hist() returns a tuple of three objects describing the histogram.
# The default number of bins is 10.
n, bins, patches = plt.hist(df['total_bill'])

simple histogram

# Let's plot again but using 6 bins instead.
n, bins, patches = plt.hist(df['total_bill'], bins=6)

simple histogram binned

# Plot the density instead
n, bins, patches = plt.hist(df['total_bill'], density=True)

simple histogram with density

Styling the Histogram

The histogram bars have no separation by default since the edgecolor is the same as the bar. By making the edgecolor the same as the background color, you create some separation between the bar.

n, bins, patches = plt.hist(df['total_bill'], edgecolor='white')

histogram with edgecolor

An alternative is just to make the bars skinnier using rwidth.

# We adjust the color as well as the relative width of the bars.
n, bins, patches = plt.hist(df['total_bill'], color='teal', rwidth=0.9)

histogram with rwidth

Binned / Aggregated Data

Have data that is already aggregated and binned? You can still use plt.hist or you can just use

# If you have data that is already binned, e.g.:
counts, bins = np.histogram(df['total_bill'])
# You can plot it directly... or just use
n, bins, patches = plt.hist(bins[:-1], bins, weights=counts)

histogram with binned data

Cumulative Histograms

Cumulative histograms are simple as well:

n, bins, patches = plt.hist(
    df['total_bill'], cumulative=True, edgecolor='white')

cumulative histogram

Understanding Bin Borders

Histograms separate data into bins with a start value and end value. The start value is included in the bin and the end value is not, it's included in the next bin. This is true for all bins except the last bin, which includes the end value as well (since there's no next bin).

Here we show the bin values on the histogram.

# Bins are: [, ) except for the last one which is [, ]
n, bins, patches = plt.hist(df['total_bill'], align='mid', edgecolor='black')
for num in bins:
  plt.text(num, 1, round(num, 1), ha='center', color='white')


histogram with bin values

Comparing Histograms Across Data

Let's load in the iris dataset and show an example of how to plot and compare histograms against each other with multiple variables or data.

df = sns.load_dataset('iris')
b, bins, patches = plt.hist(
    [df.loc[df['species'] == 'setosa', 'sepal_length'],
     df.loc[df['species'] == 'versicolor', 'sepal_length'],
     df.loc[df['species'] == 'virginica', 'sepal_length']],
    label=['Setosa', 'Versicolor', 'Virginica'])

histogram with multiple data

You can either plot them side by side, as above, or stacked, as below.

b, bins, patches = plt.hist(
    [df.loc[df['species'] == 'setosa', 'sepal_length'],
     df.loc[df['species'] == 'versicolor', 'sepal_length'],
     df.loc[df['species'] == 'virginica', 'sepal_length']],
    label=['Setosa', 'Versicolor', 'Virginica'],

histogram with multiple data stacked