Plotting with matplotlib

We all explore matplotlib together!

Plotting with matplotlib

Outline:

  1. What is matplotlib?
    • Library of functions to represent data (plots, charts, images, etc.)
    • Making plots is easy, but...
    • Utilizing the full power of mpl requires understanding what's going on under the hood (a little, at least)
  1. Pythonic vs. Matlabby
    • How python likes to do things: explicitly reference figure you want to work on (state-independent)
    • How matlab does things: work on active figure (state-dependent)
    • matplotlib allows both (with incomplete overlap). We will try to be pythonic.
  1. Understanding matplotlib objects
    • plt.subplots() -- not so intuitive way to create a figure
    • The top-level object created by mpl is the figure
    • A figure object contains one or more axes objects (shitty name...these are like subplots or panels)
    • Axes objects contain all the things to make a plot: axis, colors, labels, etc.
    • Anatomy of a figure (picture)
  1. Let's plot!
    • Make a 2-panel figure using python and matlab styles
    • Complex figures are made using separate calls to create the various parts (data, axes, titles, etc.)
    • Infinite customization: example of gridspec + image display

What is matplotlib?

Matplotlib is a bunch of code (Python calls them "modules") that allows us to plot and visualize data using Python. Many of us have used it, and making a basic plot is really easy:

In [263]:
# Import matplotlib's pyplot and numpy
import matplotlib.pyplot as plt
import numpy as np
In [264]:
# Use numpy's random number generation to invent some data that will make a nifty plot
rng = np.arange(50)
rnd = np.random.randint(0,10,size=(3, rng.size))
yrs = 1950 + rng
In [265]:
# Look up the syntax on the internet, and plot!
plt.stackplot(yrs, rng + rnd)
fig.tight_layout() # makes a 'cleaner' plot with less whitespace

We can then look up a bunch of different plots, ways to add titles, format axis ticks, colors, etc. etc. etc. This works, and it's probably as far as many of us get. But if any of you have dug a little farther, you start running into some confusing stuff. For example, these functions produce some "objects"--what the hell are those? If I make a multi-panel figure, how do I control the different parts? When I google stuff, why does it seem like different people are using totally different syntax to do the same thing (this isn't Perl, dammit!)

If you're going to do anything beyond basic plotting, it's super useful to understand a little bit of "theory" (used loosely here) behind matplotlib. So let's dig in and try to clear up some of the confusion and understand what's going on. Luckily, it's pretty straightforward--you just need a little info.

Python vs. Matlab: a little history to explain some matplotlib quirks

History is useful here. John D. Hunter (a neurobiologist) started matplotlib in 2003 as an attempt to basically implement the plotting/visualization tools form Matlab (pay software) to Python (open-source). He did a great job, and it has since expanded beyond its Matlab origins. BUT, those origins lead to some of the confusion experienced while using matplotlib. The problem is that Python and Matlab have significantly different styles. To illustrate, let's say you have a two panel figure and are working on two plots, which we'll call plot1 and plot2, and you want to call some function on each of them. Using PSEUDOCODE, the "pythonic" way to talk to work on these would be:

plot1.somefunction()  
plot2.somefunction()

The "matlabby" way to do this would be:

activate(plot1)  
somefunction(plot1)  
activate(plot2)  
somefunction(plot2)

Matlab prefers functions that "belong" to the system, and which plot they work on depends on the state of the system--which plot is active. One plot is active at a time, functions you call work on the active plot, to work on a plot you have to activate it. Simple enough.

In Python, each plot is a distinct object, and you use object syntax to refer to each plot explicitly and call functions that "belong" to the object. Critically, these commands don't reference any "state" of the system--the function simply works on its parent object and doesn't need to know which plot is active.

FYI: The Matlab way of doing things is referred to as "stateful" or "state-based", and the python style is "stateless", "object-oriented", or "OO". Just in case you see those terms around.

Which does matplotlib use?

Both! In matplotlib, you can use object-oriented style to work directly on the objects that you generate, or you can use system-level functions. This is great, right? It just means you can pick the style you prefer. Except...the two styles aren't completely overlapping. Most methods can be called both ways, but there are some methods that can only be called one way.

If you're thinking "well that's just super fucking dumb what in the actual hell?" the answer is "yes".

But have no fear! Once you understand that both of these exist, it clears up confusion and it's easy to handle. I think best practice (assuming you aren't an old Matlab user) is to use the pythonic style wherever possible, though managing active figures is super easy and we will learn that too.

What's in a name? Headaches, that's what

You may have a different experience, but I have found that one of the barriers to using matplotlib is that many of the names are terribly unintuitive. I'll try to explain the names in a way that makes more sense to me and hopefully some sense to you...

Creating and understanding matplotlib objects

In the above example, we used matlab style, calling the system function plt.stackplot() on our data. What we didn't see is that this function actually created some "objects", and learning to work with these objects is the key to really getting good at making matplotlib do what you want.

So we're going to start over and plot in the python way, which will make the object structures much clearer. To start plotting, we use the subplots function from pyplot, which is the not-terribly-intuitive standard way to get started plotting in the python way:

In [266]:
fig, ax = plt.subplots() # this is, in effect, the "make new figure" function

The subplots function generated two things: a figure object, which we named fig, and an axes object, which we named ax. The relationship between these is hierarchical: figure objects are the highest level, and they contain one or more axes objects. So here, subplots() created a new figure object that contains one axes object, which is currently blank.

What are these? Figure is essentially the whole area in which you're going to plot. axes objects are really poorly named, and would probably be better described as subplots or panels. To better illustrate this, let's try making a figure with multiple panels, or multiple axes:

In [267]:
fig, ax = plt.subplots(1,2, figsize=(7, 3))

So we still have one figure, but now we've got two axes. The variable ax is now a list containing the two axes objects:

In [268]:
print(type(ax))
print(type(ax[0]))
<class 'numpy.ndarray'>
<class 'matplotlib.axes._subplots.AxesSubplot'>

Because they are hierarchical and the figure contains the axes, we can also use the figure get to the individual axes objects:

In [269]:
print(type(fig.axes))
print(type(fig.axes[0]))
<class 'list'>
<class 'matplotlib.axes._subplots.AxesSubplot'>

The axes object is the heart of matplotlib

Most of the actual work you do will be on axes objects. These are essentially the plots: they contain the data, the plot methods, and the paramenters. When you want to create a certain type of plot with certain colors and tick marks and axis labels and whatnot, you'll be calling a bunch of stuff on an axes objects.

It can't be denied that a central confusion about these is their name. I think were just called "panel" or something, it would be super easy. Your figure contains multiple panels, if you want to work on panel 1 you do panel1.function() and life is grand. axes makes you think of x-axis and y-axis (wouldn't you know...axes objects contain objects called x and y axis!). So the name doesn't work great, but if you just remember that each axes object is a panel or subplot, you've got it.

Below the axes class are the guts of the plot: the data, the x- and y-axis. They can be accessed using hierarchical "dot" notation of python objects. Here for example is code to get the first tick of the y axis:

first_tick = fig.axes[0].yaxis.get_major_ticks()[0]

You'll only rarely need to get this granular to have to dig down into manipulating axis ticks, but this structure gives you the power to really make the plots exactly what you want for, say, publication figures. As always, the most straightforward way to find what you're looking for is to Google it, but the documentation for the axes class is also a useful reference.

Here are a couple graphical representations of the structure of matplotlib figures, lifted from the tutorial linked above:

test

test2

Let's plot!

So what does this look like in practice? Here's a simple example of managing a two-panel figure in the python style:

In [270]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(7, 3)) #create a figure with 2 axes objects (subplots)
ax1.stackplot(yrs, rng + rnd) #call a method belonging to ax1
ax2.scatter(yrs, (rng + rnd)[0]) #call a method belonging to ax2
Out[270]:
<matplotlib.collections.PathCollection at 0x120286a90>

That's the pythonic way. How about the matlab way?

In [271]:
fig, (ax1, ax2) = plt.subplots(1,2, figsize=(7, 3))
plt.sca(ax1) # sca = set current axis (it's useful cousin is gca 'get current axis')
plt.stackplot(yrs, rng + rnd) #system method stackplot works on the active axes object (ax1)
plt.sca(ax2)
plt.scatter(yrs, (rng + rnd)[0])
Out[271]:
<matplotlib.collections.PathCollection at 0x1206b8438>

Both ways produce the same result. For more complex plotting, I think the pythonic style is generally preferable, but honestly I'm pretty new to matplotlib so who knows.

Adding the details: **kwargs and set_ methods

To build complete, detailed plots is a little different in matplotlib from some other languages. Whereas in R or Matlab you might have a single plot function and you would supply information about the titles, axes, etc. with arguments to that function, in matplotlib you build the different parts of the plot with distinct functions. In the following example:

* the data are plotted by the stackplot() function
* the legend is created by the legend() function 
* the axis labels are created by set_xlabel() and set_ylabel()  


Each of these functions can take a large number of arguments to specify exactly how you want to make the structure in question. These arguments are generally of the form "size=12' and are referred to as keyword-args or kwargs in python annotation.

In [315]:
fig, ax = plt.subplots(figsize=(5, 3))
ax.stackplot(yrs, rng + rnd, colors=["blue", "gold", "green"], labels=['Eastasia', 'Eurasia', 'Oceania'], alpha=0.7)
ax.set_title('Combined debt growth over time')
ax.legend(loc='upper left')
ax.set_ylabel('$Total$ $debt$', size=12)
ax.set_xlabel('$Year$', size=12)
ax.set_xlim(xmin=yrs[0], xmax=yrs[-1])
fig.tight_layout()

Endless possibilities

So there's just an absolutely endless supply of matplotlib packages, methods, etc. to visualize data in any way you can think of (and a whole bunch you can't). I'm sure we will explore a lot of this in coding club. Hopefully this introduction will clear up some confusing things about matplotlib, and set you up with a basic understanding of how it does things. I really recommend taking a look at the tutorial I based this on, as it goes a little deeper and farther than I have here. I leave you with an example of a fairly complex figure from that tutorial:

In [318]:
# Load some data about California housing
from io import BytesIO
import tarfile
from urllib.request import urlopen

url = 'http://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.tgz'
b = BytesIO(urlopen(url).read())
fpath = 'CaliforniaHousing/cal_housing.data'

with tarfile.open(mode='r', fileobj=b) as archive:
    housing = np.loadtxt(archive.extractfile(fpath), delimiter=',')
    
# Arrange the data to plot later    
y = housing[:, -1]
pop, age = housing[:, [4, 7]].T


# Make a custom function for adding a titlebox
def add_titlebox(ax, text):
    ax.text(.55, .8, text,
        horizontalalignment='center',
        transform=ax.transAxes,
        bbox=dict(facecolor='white', alpha=0.6),
        fontsize=12.5)
    return 

# use matplotlibs gridspec module to make subplots of unequal size
gridsize = (3, 2)
fig = plt.figure(figsize=(12, 8))
ax1 = plt.subplot2grid(gridsize, (0, 0), colspan=2, rowspan=2)
ax2 = plt.subplot2grid(gridsize, (2, 0))
ax3 = plt.subplot2grid(gridsize, (2, 1))

# Throw down some plots
ax1.set_title('Home value as a function of home age & area population',
              fontsize=14)
sctr = ax1.scatter(x=age, y=pop, c=y, cmap='RdYlGn')
plt.colorbar(sctr, ax=ax1, format='$%d')
ax1.set_yscale('log')
ax2.hist(age, bins='auto')
ax3.hist(pop, bins='auto', log=True)

add_titlebox(ax2, 'Histogram: home age')
add_titlebox(ax3, 'Histogram: area population (log scl.)')