class: middle, inverse, title-slide .title[ # Intro to Python - Session 3 ] .subtitle[ ##
Bioinformatics Resource Center - Rockefeller University ] .author[ ###
http://rockefelleruniversity.github.io/Intro_To_Python/
] .author[ ###
brc@rockefeller.edu
] --- ``` ## Using virtual environment '/github/home/.virtualenvs/r-reticulate' ... ``` --- ## Data Visualization with Python Today's goals: - Introduce Python's utilities for data visualization - Learn how to customize several types of plots for publication-worthy data visualizations - Explore online documentation --- ## Workflow for Generating Figures In general, the recipe for creating a figure is as follows: 1. Generate or import the data to be plotted. This is generally a list of x values and a list of y values. 2. Import matplotlib.pyplot to have access to the library and functions 3. Generate the figure and axes 4. Plot the data on the appropriate axis or axes using functions such as `plot()`, `scatter()`, `bar()`, etc.. 5. Customize your plot using the built-in options for the function you used to plot (found in documentation). 6. Render your plot using `show()` and save or export it if you wish. ---  --- ## Matplotlib Matplotlib is Python's library for visualization. It has extensive documentation available [online](https://matplotlib.org/stable/), including many [tutorials](https://matplotlib.org/stable/tutorials/index.html). Within Matplotlib, you will mostly be working with `pyplot` to generate simple plots. You can view the documentation for pyplot [here](https://matplotlib.org/stable/api/pyplot_summary.html#module-matplotlib.pyplot). Each function within pyplot has detailed descriptions of the arguments it takes - these will be very useful when you would like to customize your plots. --- ## Importing matplotlib.pyplot To import matplotlib.pyplot, simply type `import matplotlib.pyplot as plt` at the top of your code. You can then refer to the library as `plt` in your code as needed. Note that this isn't strictly necessary, but you will find that this is an almost-universal naming convention (other libraries follow similar conventions too). ``` python import matplotlib.pyplot as plt ``` --- ## Figures, Plots, and Subplots In Python, the best way to make a figure is by using the [`subplots()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) function to define a figure and set(s) of axes. The reason we use the `subplots()` function is that it makes it easy to add multiple plots/axes to a figure, which is commonly done in the visualization of scientific data. To define a figure, you can write: ``` python fig, ax = plt.subplots() ``` --- ## The First Plot ``` python fig, ax = plt.subplots() ``` <!-- --> --- ## A Closer Look ``` python fig, ax = plt.subplots() ``` - `fig` is your figure (think: shape and size of your plot) - `ax` is your set of axes where you will plot your data and customize how the plot looks - You can have multiple axes in a single figure (which we will see later) - [`subplots()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) is a function which has multiple arguments that you can use to specify the size and shape of your figure, as well as other parameters for your axes. Let's visit the [documentation](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) and take a look at the options. --- ## Figure Size - You can specify the size of the figure using the `figsize=([width], [height])` option (dimensions should be in inches based on default 100 dpi - may change depending on your monitor). - At the bottom of your code, type `plt.show()` to render your plot ``` python fig, ax = plt.subplots(figsize=(4, 3)) plt.show() ``` <!-- --> --- ## Axis Limits - You can set the limits of your x and y axes by applying the set_xlim() or set_ylim() function to your axes, ax. ``` python fig, ax = plt.subplots(figsize=(3, 2)) ax.set_xlim(-3, 4) ``` ``` ## (-3.0, 4.0) ``` ``` python plt.show() ``` <!-- --> --- ## Multiple Subplots Looking at the [`subplots()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) documentation again, we can see that we can specify the number and arrangement of subplots we want: - `ncols` for the number of columns - `nrows` for the number of rows You can then define a corresponding axis for each subplot. Below is the code to generate two horizontally (`fig1`) and vertically (`fig2`) stacked subplots. ``` python fig1, (ax1, ax2) = plt.subplots(nrows=1, ncols=2) fig2, (ax3, ax4) = plt.subplots(nrows=2, ncols=1) ``` --- ## Example with 4 Subplots Let's make a 2x2 grid of subplots. Note that we use nested brackets to specify the positions of each subplot within the figure. ``` python fig, ((ax_nw, ax_ne), (ax_sw, ax_se)) = plt.subplots(nrows=2, ncols=2) plt.show() ``` --- ## Arranging Subplots  --- ## Advanced Subplot Options - Other [`subplots()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html) options: - `width_ratios` and `height_ratios` to adjust the relative sizes of rows and columns - `sharex` and `sharey` to force subplots to share an x or y axis - [`gridspec`](https://matplotlib.org/stable/users/explain/axes/arranging_axes.html) for arbitrary/custom subplots (ex: different number of plots in each row) --- ## Importing Data Now that we can make figures and axes, let's grab some data to plot. We are going to use some patient data that contains sex ('Male' or 'Female'), weight (kg) and height (cm). We will then have 3 arrays of data to work with. ``` python import numpy as np sex, height, weight = np.genfromtxt('data/height-weight.csv', unpack = True, delimiter = ",", skip_header=True, dtype=None, encoding='UTF-8') print(sex) ``` ``` ## ['Male' 'Male' 'Male' 'Male' 'Female' 'Female' 'Female' 'Female' 'Male' ## 'Male' 'Female' 'Female' 'Male' 'Female' 'Female' 'Male' 'Male' 'Male' ## 'Male' 'Male' 'Male' 'Female' 'Male' 'Female' 'Female' 'Male' 'Male' ## 'Male' 'Male' 'Male' 'Female' 'Male' 'Male' 'Female' 'Male' 'Female' ## 'Male' 'Female' 'Male' 'Female' 'Male' 'Female' 'Female' 'Female' ## 'Female' 'Female' 'Female' 'Female' 'Male' 'Female' 'Female' 'Female' ## 'Female' 'Female' 'Male' 'Female' 'Female' 'Female' 'Male' 'Male' ## 'Female' 'Male' 'Female' 'Male' 'Male' 'Female' 'Male' 'Female' 'Male' ## 'Female' 'Female' 'Male' 'Male' 'Female' 'Male' 'Female' 'Male' 'Male' ## 'Female' 'Female' 'Female' 'Female' 'Male' 'Female' 'Female' 'Female' ## 'Female' 'Male' 'Female' 'Male' 'Male' 'Male' 'Female' 'Male' 'Female' ## 'Female' 'Female' 'Female' 'Female' 'Female'] ``` --- ## Manipulating our Data Let's separate our weight and height data by sex. Can you see what the code below does? ``` python height_m = [] height_f = [] weight_m = [] weight_f = [] for i in range(len(sex)): if sex[i] == 'Male': height_m.append(height[i]) weight_m.append(weight[i]) else: height_f.append(height[i]) weight_f.append(weight[i]) ``` --- ## Scatter Plots Scatter plots are used for displaying discrete data points, where each point has a set of coordinates `\((x,y)\)`. If you want to plot data points `\((x_1, y_1), (x_2, y_2) ... (x_n, y_n)\)` from lists `\(x = (x_1, x_2,...,x_n)\)` and `\(y = (y_1, y_2,...,y_n)\)`, you can use the [`scatter()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) function, applied to the axis you want to plot on. Let's create a plot of height vs weight using our patient data. We are going to generate a 2x1 subplot and create a scatter plot on the left subplot. ``` python # Generate figure and axes fig, (ax1, ax2) = plt.subplots(ncols = 2, nrows = 1, figsize=(10, 4)) # Plot data on ax1 ax1.scatter(weight_m, height_m) ax1.scatter(weight_f, height_f) plt.show() ``` --- ## Scatter Plots ``` python # Generate figure and axes fig, (ax1, ax2) = plt.subplots(ncols = 2, nrows = 1, figsize=(10, 4)) # Plot data on ax1 ax1.scatter(weight_m, height_m) ax1.scatter(weight_f, height_f) plt.show() ``` <!-- --> --- ## Customizing Scatter Plots Take a look at the options in the [`scatter()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html) documentation. Some common parameters you could use to customize your scatter plot are: - `s`: marker size in points ^2 (don't ask why...). - `color (c)`: marker color. Enter a string that could include a named color, RBG code, or hex color code. Find a full guide to specifying colors [here](https://matplotlib.org/stable/users/explain/colors/colors.html#colors-def). - `marker`: marker style. Choose between a variety of preset options, the default being 'o' for circles. View the full list of options [here](https://matplotlib.org/stable/api/markers_api.html#module-matplotlib.markers). - `linewidths`: width of the marker outline. Enter number in pts. - `edgecolors`: color of the marker outline. Enter as a string, similar to the value of `c`. - `alpha`: transparency (0 = transparent, 1 = opaque) --- ## Named Colors Python has a number of [named colors](https://matplotlib.org/stable/gallery/color/named_colors.html).  --- ## Named Colors You can use these color names as strings to define colors within the parameters of scatter(), or you can also specify hex or RGB color codes as strings. * Named colors: color = 'magenta' * HEX: color = '#1D7308 * RGB: color = (0.1, 0.2, 0.5) --- ## Customize Your Plot Using these colors and the list of parameters below, take a second to customize your plot of weight vs height. - `s`: marker size in points ^2 (don't ask why...). - `color (c)`: marker color. Enter a string that could include a named color, RBG code, or hex color code. Find a full guide to specifying colors [here](https://matplotlib.org/stable/users/explain/colors/colors.html#colors-def). - `marker`: marker style. Choose between a variety of preset options, the default being 'o' for circles. View the full list of options [here](https://matplotlib.org/stable/api/markers_api.html#module-matplotlib.markers). - `linewidths`: width of the marker outline. Enter number in pts. - `edgecolors`: color of the marker outline. Enter as a string, similar to the value of `c`. - `alpha`: transparency (0 = transparent, 1 = opaque) --- ## Customized Plot ``` python # Generate figure and axes fig, (ax1, ax2) = plt.subplots(ncols = 2, nrows = 1, figsize=(10, 4)) # Plot data on ax1 ax1.scatter(weight_m, height_m, c = 'royalblue', alpha = 0.5, marker = 's') ax1.scatter(weight_f, height_f, c = 'magenta', alpha = 0.5, marker = 'o') plt.show() ``` <!-- --> --- ## Titles and Axis Labels Let's add a title and some axis labels to our plot. To do this, we can use the following functions: ``` python ax1.set_title("Height vs Weight") ax1.set_xlabel("Weight (kg)") ax1.set_ylabel("Height (cm)") ``` Be sure to add all of this code before the `plt.show()` line, which renders the plot. Anything after `show()` will not be applied to the figure you see. --- ## Add Your Labels ``` python # Generate figure and axes fig, (ax1, ax2) = plt.subplots(ncols = 2, nrows = 1, figsize=(10, 4)) # Plot data on ax1 ax1.scatter(weight_m, height_m, c = 'royalblue', alpha = 0.5, marker = 's') ax1.scatter(weight_f, height_f, c = 'magenta', alpha = 0.5, marker = 'o') ax1.set_title("Height vs Weight") ax1.set_xlabel("Weight (kg)") ax1.set_ylabel("Height (cm)") plt.show() ``` --- ## Resulting Plot <!-- --> --- ## Histograms Another common plot type is a histogram. We are going to put a histogram of height distributions by sex in the blank subplot. We will do this using the `plt.hist()` function. You can find the documentation [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html). At a minimum, `hist()` requires the data points you wish to plot as an argument. You may also specify the `bins` argument as an integer (the default is 10). Let's take our previous plot and add histograms for the male and female height distributions to the subplot axes on the right, each with 5 bins. Also, add a title and axis labels for the histogram. Note: You may have to change the `figsize` parameter so that the labels all fit. --- ## Histograms ``` python ax2.hist(height_m, bins=5, color = 'royalblue', alpha = 0.5) ax2.hist(height_f, bins=5, color = 'magenta', alpha = 0.5) ax2.set_title('Height Distributions by Sex') ax2.set_ylabel('Count') ax2.set_xlabel('Height (cm)') ``` --- ## Histograms <!-- --> --- ## Legends We can add a legend to a set of axes by using the legend() function. You can find the documentation here. We automatically generate a legend by adding a parameter called label as a string in each plot we would like to include in the legend, and then calling the legend() function. You can pass arguments to this function to specify the formatting and location of the legend, but we'll skip that part today. Check out the documentation for full details! In the box below, add a legend for each set of axes. --- ## Legends ``` python # Add legend ax1.legend() ax2.legend() ``` --- ## Legends <!-- --> --- ## Line Plots Another common plot is a line plot. This uses the `plot()` function from `matplotlib.pyplot` (check out the documentation [here](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.plot.html)). The minimum arguments for `plot()` are the x- and y-coordinates to be plotted, which will be output with a line connecting them. Let's plot the function `\(y = x^2\)`. The first thing we need to do is to define the list of coordinates to be plotted. Remember that even though the function we are plotting is continuous mathematically, we will still be plotting a discrete line of points. In the bow below: 1. Define the list of x-values using `numpy`'s `linspace()` function to create a list of 100 points between 0 and 2. 2. Define a function `\(f(x) = x^2\)` 3. Use it to generate a list of y-values. --- ## Line Plots ``` python x_values = np.linspace(0, 2, 100) def f(x): return x**2 y_values = f(x_values) ``` --- ## Line Plots Next, plot your function on the a fresh set of axes. Add any labels and other customizations you would like. ``` python fig, ax = plt.subplots() ax.plot(x_values, y_values, c = 'teal', linestyle = '--') ax.set_title("Quadratic Function $y = x^2$") ax.set_xlabel("x") ax.set_ylabel("y") plt.show() ``` <!-- --> --- ## Seaborn `matplotlib.pyplot` is the bread and butter of data visualization in Python, and allows you near-arbitrary degrees of customization for your plots. However, the `seaborn` library was developed using `matplotlib` to make nice-looking plots with less code. We are going to use it to make a violin plot, because that is something that `matplotlib.pyplot` does not do a nice job of. Going back to our patient data, we are going to make a violin plot of the patient weight distributions by sex. We will use the `violinplot()` function from the `seaborn` library, which we will import as `sns`. --- ## Seaborn ``` python import seaborn as sns fig, ax = plt.subplots(figsize = (4, 4)) sns.violinplot(weight_m, ax = ax) plt.plot() ``` ``` ## [] ``` <!-- --> --- ## Seaborn ``` python fig, ax = plt.subplots(figsize = (4, 4)) sns.violinplot(weight_m, ax = ax, color = 'royalblue', alpha = 0.5, linewidth=0, label = "Male") sns.violinplot(weight_f, ax = ax, color = 'magenta', alpha = 0.5, linewidth=0, label = "Female") ax.set_ylabel('Weight (kg)') ax.set_title("Weight Distribution by Sex") ax.legend() plt.show() ``` --- ## Seaborn <!-- --> --- ## Saving and Exporting Now that we have created several figures, we may want to save and export them. To do this, we will apply the [`savefig()`](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.savefig.html) function to our figure. This function takes your desired filepath as an input, as well as other optional parameters such as `dpi` (resolution), sizing, and transparency. Let's save our most recent figure. We will also use `fig.tight_layout()` to remove any added white space and ensure that all nothing is cut off. ``` python fig.tight_layout() fig.savefig("my_violin_plot.pdf", bbox_inches = "tight") ``` --- ## Colormaps Rather than using a single color to plot your data, you may want to use a color map. This is particularly true for things like heatmaps, or when you are displaying an image. To do this, you can use existing [colormaps](https://matplotlib.org/stable/gallery/color/colormap_reference.html) within `matplotlib`, or [create your own](https://matplotlib.org/stable/users/explain/colors/colormap-manipulation.html#colormap-manipulation). --- ## Choosing a Colormap It's important to choose a [colormap](https://matplotlib.org/stable/users/explain/colors/colormaps.html) that is: - Visually faithful to the scale - Translates well to greyscale (printing) - Is accessible to those with common forms of color blindness. It turns out that people have thought about this problem *a lot* and have come up with some color maps that do a great job at maximizing these properties. --- ## Viridis My personal favourite is called *viridis* (watch the launch video [here](https://www.youtube.com/watch?v=xAoljeRJ3lU&ab_channel=Enthought) - surprisingly interesting), but there is actually a selection of these schemes available.  --- ### Don't Use Jet! Some color schemes that may seem natural to use (especially rainbow/jet) actually tend to skew our perceptions of the data values, as seen in the photo below (sourced from [here](https://www.domestic-engineering.com/drafts/viridis/viridis.html)), and therefore are not recommended.  --- ## Colors and Accessibility in Other Plots When you are creating any plot with multiple datasets/colors, keep colorblindness and black-white conversion in mind. Using different dashes in lines and shapes in markers is also a good way to do this! <img src="imgs/scatter_accessible.png" alt="Accessibility matters for all types of data visualization!" height="400" width="400"> --- ## Colormaps in Seaborn Let's use the [`heatmap()`](https://seaborn.pydata.org/generated/seaborn.heatmap.html) function from Seaborn to generate a plot of the time progression of three genes. ``` python data = np.genfromtxt('data/gene_data.csv', unpack = True, delimiter = ",", skip_header=True) print(data) ``` ``` ## [[0.2 0.3 0.5 0.6 0.7] ## [0.3 0.4 0.4 0.5 0.4] ## [0. 0.1 0.2 0.2 0.1] ## [0.9 0.7 0.6 0.5 0.4] ## [0.6 0.3 0.5 0.7 0.4]] ``` --- ## Heatmap Plot ``` python fig, ax = plt.subplots(figsize = (5,4)) sns.heatmap(data, linewidth = 0.5, cmap = 'viridis', annot = True) ax.set_xlabel("Time") ax.set_ylabel("Gene") ax.set_title("Gene Progression") plt.show() ``` <!-- --> --- ## Color Maps in Matplotlib Let's make a scatter plot with weight vs height again, but make the color of the points defined by the ratio of weight to height. ``` python # Generate figure and axes fig, ax = plt.subplots(figsize=(6, 4)) ratio = weight/height im = ax.scatter(weight, height, c=ratio, cmap='viridis') fig.colorbar(im, ax=ax) ax.set_xlabel("Weight (kg)") ax.set_ylabel("Height (cm)") ax.set_title("Weight vs Height") plt.show() ``` --- ## Scatter Plot with Colormap ``` ## <matplotlib.colorbar.Colorbar object at 0x7fc2724acc50> ``` <!-- --> --- ## Time for an exercise! Exercises around plotting can be found [here](https://rockefelleruniversity.github.io/Intro_To_Python/exercises/exercises/MyExercise6_exercise.html) --- ## Answers to exercise Answers can be found [here](https://rockefelleruniversity.github.io/Intro_To_Python/exercises/answers/MyExercise6_answers.html) --- ## Data Visualization Resources - [Matplotlib Cheatsheets and Handouts](https://matplotlib.org/cheatsheets/) - Matplotlib [Tutorials](https://matplotlib.org/stable/tutorials/index.html) and [User Guide](https://matplotlib.org/stable/users/index.html) - [Fundamentals of Data Visualization by Claus O. Wilke](https://clauswilke.com/dataviz/)