Introduction to Python


## Using virtual environment '/github/home/.virtualenvs/r-reticulate' ...

Data Visualization with Python

Today’s goals:

  • Introduce Python’s utilities for data visualization
  • Learn how to customize several types of plots for publication-worthy data visualizations
  • Explore online documentation

Workflow for Generating Figures

In general, the recipe for creating a figure is as follows:

  1. Generate or import the data to be plotted. This is generally a list of x values and a list of y values.
  2. Import matplotlib.pyplot to have access to the library and functions
  3. Generate the figure and axes
  4. Plot the data on the appropriate axis or axes using functions such as plot(), scatter(), bar(), etc..
  5. Customize your plot using the built-in options for the function you used to plot (found in documentation).
  6. Render your plot using show() and save or export it if you wish.
Data visualization workflow
Data visualization workflow

Matplotlib

Matplotlib is Python’s library for visualization. It has extensive documentation available online, including many tutorials. Within Matplotlib, you will mostly be working with pyplot to generate simple plots. You can view the documentation for pyplot here. Each function within pyplot has detailed descriptions of the arguments it takes - these will be very useful when you would like to customize your plots.

Importing matplotlib.pyplot

To import matplotlib.pyplot, simply type import matplotlib.pyplot as plt at the top of your code. You can then refer to the library as plt in your code as needed. Note that this isn’t strictly necessary, but you will find that this is an almost-universal naming convention (other libraries follow similar conventions too).

import matplotlib.pyplot as plt

Figures, Plots, and Subplots

In Python, the best way to make a figure is by using the subplots() function to define a figure and set(s) of axes. The reason we use the subplots() function is that it makes it easy to add multiple plots/axes to a figure, which is commonly done in the visualization of scientific data. To define a figure, you can write:

fig, ax = plt.subplots()

The First Plot

fig, ax = plt.subplots()

A Closer Look

fig, ax = plt.subplots()
  • fig is your figure (think: shape and size of your plot)
  • ax is your set of axes where you will plot your data and customize how the plot looks
  • You can have multiple axes in a single figure (which we will see later)
  • subplots() is a function which has multiple arguments that you can use to specify the size and shape of your figure, as well as other parameters for your axes.

Let’s visit the documentation and take a look at the options.

Figure Size

  • You can specify the size of the figure using the figsize=([width], [height]) option (dimensions should be in inches based on default 100 dpi - may change depending on your monitor).
  • At the bottom of your code, type plt.show() to render your plot
fig, ax = plt.subplots(figsize=(4, 3))
plt.show()

Axis Limits

  • You can set the limits of your x and y axes by applying the set_xlim() or set_ylim() function to your axes, ax.
fig, ax = plt.subplots(figsize=(3, 2))
ax.set_xlim(-3, 4)
## (-3.0, 4.0)
plt.show()

Multiple Subplots

Looking at the subplots() documentation again, we can see that we can specify the number and arrangement of subplots we want:

  • ncols for the number of columns
  • nrows for the number of rows

You can then define a corresponding axis for each subplot. Below is the code to generate two horizontally (fig1) and vertically (fig2) stacked subplots.

fig1, (ax1, ax2) = plt.subplots(nrows=1, ncols=2)

fig2, (ax3, ax4) = plt.subplots(nrows=2, ncols=1)

Example with 4 Subplots

Let’s make a 2x2 grid of subplots. Note that we use nested brackets to specify the positions of each subplot within the figure.

fig, ((ax_nw, ax_ne), (ax_sw, ax_se)) = plt.subplots(nrows=2, ncols=2)

plt.show()

Arranging Subplots

Advanced Subplot Options

  • Other subplots() options:
    • width_ratios and height_ratios to adjust the relative sizes of rows and columns
    • sharex and sharey to force subplots to share an x or y axis
  • gridspec for arbitrary/custom subplots (ex: different number of plots in each row)

Importing Data

Now that we can make figures and axes, let’s grab some data to plot. We are going to use some patient data that contains sex (‘Male’ or ‘Female’), weight (kg) and height (cm). We will then have 3 arrays of data to work with.

import numpy as np
sex, height, weight = np.genfromtxt('data/height-weight.csv', unpack = True, delimiter = ",", skip_header=True, dtype=None, encoding='UTF-8')
print(sex)
## ['Male' 'Male' 'Male' 'Male' 'Female' 'Female' 'Female' 'Female' 'Male'
##  'Male' 'Female' 'Female' 'Male' 'Female' 'Female' 'Male' 'Male' 'Male'
##  'Male' 'Male' 'Male' 'Female' 'Male' 'Female' 'Female' 'Male' 'Male'
##  'Male' 'Male' 'Male' 'Female' 'Male' 'Male' 'Female' 'Male' 'Female'
##  'Male' 'Female' 'Male' 'Female' 'Male' 'Female' 'Female' 'Female'
##  'Female' 'Female' 'Female' 'Female' 'Male' 'Female' 'Female' 'Female'
##  'Female' 'Female' 'Male' 'Female' 'Female' 'Female' 'Male' 'Male'
##  'Female' 'Male' 'Female' 'Male' 'Male' 'Female' 'Male' 'Female' 'Male'
##  'Female' 'Female' 'Male' 'Male' 'Female' 'Male' 'Female' 'Male' 'Male'
##  'Female' 'Female' 'Female' 'Female' 'Male' 'Female' 'Female' 'Female'
##  'Female' 'Male' 'Female' 'Male' 'Male' 'Male' 'Female' 'Male' 'Female'
##  'Female' 'Female' 'Female' 'Female' 'Female']

Manipulating our Data

Let’s separate our weight and height data by sex. Can you see what the code below does?

height_m = []
height_f = []
weight_m = []
weight_f = []

for i in range(len(sex)):
    if sex[i] == 'Male':
        height_m.append(height[i])
        weight_m.append(weight[i])
    else:
        height_f.append(height[i])
        weight_f.append(weight[i])

Scatter Plots

Scatter plots are used for displaying discrete data points, where each point has a set of coordinates \((x,y)\). If you want to plot data points \((x_1, y_1), (x_2, y_2) ... (x_n, y_n)\) from lists \(x = (x_1, x_2,...,x_n)\) and \(y = (y_1, y_2,...,y_n)\), you can use the scatter() function, applied to the axis you want to plot on.

Let’s create a plot of height vs weight using our patient data. We are going to generate a 2x1 subplot and create a scatter plot on the left subplot.

# Generate figure and axes
fig, (ax1, ax2) = plt.subplots(ncols = 2, nrows = 1, figsize=(10, 4))

# Plot data on ax1
ax1.scatter(weight_m, height_m)
ax1.scatter(weight_f, height_f)

plt.show()

Scatter Plots

# Generate figure and axes
fig, (ax1, ax2) = plt.subplots(ncols = 2, nrows = 1, figsize=(10, 4))

# Plot data on ax1
ax1.scatter(weight_m, height_m)
ax1.scatter(weight_f, height_f)

plt.show()

Customizing Scatter Plots

Take a look at the options in the scatter() documentation. Some common parameters you could use to customize your scatter plot are:

  • s: marker size in points ^2 (don’t ask why…).
  • color (c): marker color. Enter a string that could include a named color, RBG code, or hex color code. Find a full guide to specifying colors here.
  • marker: marker style. Choose between a variety of preset options, the default being ‘o’ for circles. View the full list of options here.
  • linewidths: width of the marker outline. Enter number in pts.
  • edgecolors: color of the marker outline. Enter as a string, similar to the value of c.
  • alpha: transparency (0 = transparent, 1 = opaque)

Named Colors

Python has a number of named colors.

Matplotlib named colors
Matplotlib named colors

Named Colors

You can use these color names as strings to define colors within the parameters of scatter(), or you can also specify hex or RGB color codes as strings.

  • Named colors: color = ‘magenta’
  • HEX: color = ’#1D7308
  • RGB: color = (0.1, 0.2, 0.5)

Customize Your Plot

Using these colors and the list of parameters below, take a second to customize your plot of weight vs height.

  • s: marker size in points ^2 (don’t ask why…).
  • color (c): marker color. Enter a string that could include a named color, RBG code, or hex color code. Find a full guide to specifying colors here.
  • marker: marker style. Choose between a variety of preset options, the default being ‘o’ for circles. View the full list of options here.
  • linewidths: width of the marker outline. Enter number in pts.
  • edgecolors: color of the marker outline. Enter as a string, similar to the value of c.
  • alpha: transparency (0 = transparent, 1 = opaque)

Customized Plot

# Generate figure and axes
fig, (ax1, ax2) = plt.subplots(ncols = 2, nrows = 1, figsize=(10, 4))

# Plot data on ax1
ax1.scatter(weight_m, height_m, c = 'royalblue', alpha = 0.5, marker = 's')
ax1.scatter(weight_f, height_f, c = 'magenta', alpha = 0.5, marker = 'o')

plt.show()

Titles and Axis Labels

Let’s add a title and some axis labels to our plot. To do this, we can use the following functions:

ax1.set_title("Height vs Weight")
ax1.set_xlabel("Weight (kg)")
ax1.set_ylabel("Height (cm)")

Be sure to add all of this code before the plt.show() line, which renders the plot. Anything after show() will not be applied to the figure you see.

Add Your Labels

# Generate figure and axes
fig, (ax1, ax2) = plt.subplots(ncols = 2, nrows = 1, figsize=(10, 4))

# Plot data on ax1
ax1.scatter(weight_m, height_m, c = 'royalblue', alpha = 0.5, marker = 's')
ax1.scatter(weight_f, height_f, c = 'magenta', alpha = 0.5, marker = 'o')

ax1.set_title("Height vs Weight")
ax1.set_xlabel("Weight (kg)")
ax1.set_ylabel("Height (cm)")

plt.show()

Resulting Plot

Histograms

Another common plot type is a histogram. We are going to put a histogram of height distributions by sex in the blank subplot. We will do this using the plt.hist() function. You can find the documentation here.

At a minimum, hist() requires the data points you wish to plot as an argument. You may also specify the bins argument as an integer (the default is 10).

Let’s take our previous plot and add histograms for the male and female height distributions to the subplot axes on the right, each with 5 bins. Also, add a title and axis labels for the histogram. Note: You may have to change the figsize parameter so that the labels all fit.

Histograms

ax2.hist(height_m, bins=5, color = 'royalblue', alpha = 0.5)
ax2.hist(height_f, bins=5, color = 'magenta', alpha = 0.5)


ax2.set_title('Height Distributions by Sex')
ax2.set_ylabel('Count')
ax2.set_xlabel('Height (cm)')

Histograms

Legends

We can add a legend to a set of axes by using the legend() function. You can find the documentation here.

We automatically generate a legend by adding a parameter called label as a string in each plot we would like to include in the legend, and then calling the legend() function.

You can pass arguments to this function to specify the formatting and location of the legend, but we’ll skip that part today. Check out the documentation for full details!

In the box below, add a legend for each set of axes.

Legends

# Add legend
ax1.legend()
ax2.legend()

Legends

Line Plots

Another common plot is a line plot. This uses the plot() function from matplotlib.pyplot (check out the documentation here). The minimum arguments for plot() are the x- and y-coordinates to be plotted, which will be output with a line connecting them. Let’s plot the function \(y = x^2\).

The first thing we need to do is to define the list of coordinates to be plotted. Remember that even though the function we are plotting is continuous mathematically, we will still be plotting a discrete line of points. In the bow below:

  1. Define the list of x-values using numpy’s linspace() function to create a list of 100 points between 0 and 2.
  2. Define a function \(f(x) = x^2\)
  3. Use it to generate a list of y-values.

Line Plots


x_values = np.linspace(0, 2, 100)

def f(x):
    return x**2

y_values = f(x_values)

Line Plots

Next, plot your function on the a fresh set of axes. Add any labels and other customizations you would like.

fig, ax = plt.subplots()

ax.plot(x_values, y_values, c = 'teal', linestyle = '--')

ax.set_title("Quadratic Function $y = x^2$")
ax.set_xlabel("x")
ax.set_ylabel("y")

plt.show()

Seaborn

matplotlib.pyplot is the bread and butter of data visualization in Python, and allows you near-arbitrary degrees of customization for your plots.

However, the seaborn library was developed using matplotlib to make nice-looking plots with less code.

We are going to use it to make a violin plot, because that is something that matplotlib.pyplot does not do a nice job of.

Going back to our patient data, we are going to make a violin plot of the patient weight distributions by sex.

We will use the violinplot() function from the seaborn library, which we will import as sns.

Seaborn

import seaborn as sns

fig, ax = plt.subplots(figsize = (4, 4))

sns.violinplot(weight_m, ax = ax)

plt.plot()
## []

Seaborn

fig, ax = plt.subplots(figsize = (4, 4))

sns.violinplot(weight_m, ax = ax, color = 'royalblue', alpha = 0.5, linewidth=0, label = "Male")

sns.violinplot(weight_f, ax = ax, color = 'magenta', alpha = 0.5, linewidth=0, label = "Female")

ax.set_ylabel('Weight (kg)')

ax.set_title("Weight Distribution by Sex")

ax.legend()

plt.show()

Seaborn

Saving and Exporting

Now that we have created several figures, we may want to save and export them. To do this, we will apply the savefig() function to our figure. This function takes your desired filepath as an input, as well as other optional parameters such as dpi (resolution), sizing, and transparency. Let’s save our most recent figure. We will also use fig.tight_layout() to remove any added white space and ensure that all nothing is cut off.

fig.tight_layout()
fig.savefig("my_violin_plot.pdf", bbox_inches = "tight")

Colormaps

Rather than using a single color to plot your data, you may want to use a color map. This is particularly true for things like heatmaps, or when you are displaying an image.

To do this, you can use existing colormaps within matplotlib, or create your own.

Choosing a Colormap

It’s important to choose a colormap that is:

  • Visually faithful to the scale
  • Translates well to greyscale (printing)
  • Is accessible to those with common forms of color blindness.

It turns out that people have thought about this problem a lot and have come up with some color maps that do a great job at maximizing these properties.

Viridis

My personal favourite is called viridis (watch the launch video here - surprisingly interesting), but there is actually a selection of these schemes available.

Perceptually uniform colormaps.
Perceptually uniform colormaps.

Don’t Use Jet!

Some color schemes that may seem natural to use (especially rainbow/jet) actually tend to skew our perceptions of the data values, as seen in the photo below (sourced from here), and therefore are not recommended.

Colors and Accessibility in Other Plots

When you are creating any plot with multiple datasets/colors, keep colorblindness and black-white conversion in mind. Using different dashes in lines and shapes in markers is also a good way to do this!

Accessibility matters for all types of data visualization!

Colormaps in Seaborn

Let’s use the heatmap() function from Seaborn to generate a plot of the time progression of three genes.

data = np.genfromtxt('data/gene_data.csv', unpack = True, delimiter = ",", skip_header=True)

print(data)
## [[0.2 0.3 0.5 0.6 0.7]
##  [0.3 0.4 0.4 0.5 0.4]
##  [0.  0.1 0.2 0.2 0.1]
##  [0.9 0.7 0.6 0.5 0.4]
##  [0.6 0.3 0.5 0.7 0.4]]

Heatmap Plot

fig, ax = plt.subplots(figsize = (5,4))
sns.heatmap(data, linewidth = 0.5, cmap = 'viridis', annot = True)
ax.set_xlabel("Time")
ax.set_ylabel("Gene")
ax.set_title("Gene Progression")
plt.show()

Color Maps in Matplotlib

Let’s make a scatter plot with weight vs height again, but make the color of the points defined by the ratio of weight to height.

# Generate figure and axes
fig, ax = plt.subplots(figsize=(6, 4))
ratio = weight/height
im = ax.scatter(weight, height, c=ratio, cmap='viridis')
fig.colorbar(im, ax=ax)

ax.set_xlabel("Weight (kg)")
ax.set_ylabel("Height (cm)")
ax.set_title("Weight vs Height")

plt.show()

Scatter Plot with Colormap

## <matplotlib.colorbar.Colorbar object at 0x7fc26457fc50>

Time for an exercise!

Exercises around plotting can be found here

Answers to exercise

Answers can be found here