Introduction to Python

## Using virtual environment '/github/home/.virtualenvs/r-reticulate' ...

Recap

Session 1 covered introduction to R data types and import/export of data.

Session Overview

Reading and Writing Files

Reading/writing files

Most of the time, you will not be generating data in Python but will be importing data from external files.

A standard format for this data is a table:

Gene_Name	Sample_1.hi	Sample_2.hi	Sample_3.hi
Gene_a	3.676307	3.682774	2.976411
Gene_b	5.309033	5.543664	2.966018
Gene_c	2.865454	4.093604	5.648436
Gene_d	3.950261	4.979192	3.563585
Gene_e	10.470795	10.601351	10.012884
Gene_f	9.020501	8.526737	10.610281
Gene_g	11.617416	9.318539	9.223805
Gene_h	9.730001	8.579512	10.906766

Working Directory

If you are reading in data, you need to know from which vantage point your computer is looking. The view point of your computer is the Working Directory.

location

os package

There are several utilities in the os (operating system) package. First lets check our current Working Directory using getcwd.

import os

os.getcwd()

## '/Users/mattpaul/'

Setting my Working Directory

The working directory is based on your VS Code workspace. If you have not set this you can do so using the Explorer tab on the left of VS Code. If the workspace is in the wrong place, you can open a new window, set up a new workspace, then open a new python console. Let’s navigate to the course material that we downloaded. There are some example data sets there: Intro_To_Python-master/r_course/

location

Working Directory

You can also directly change your Working Directory in python using the chdir function.

os.chdir("/Users/mattpaul/Downloads/Intro_To_Python-master/r_course/")

os.listdir()

## ['presRaw', 'imgs', 'height-weight_female.csv', 'customCSS', 'data', 'GeneNames_highexpression.txt', '_course.yml', 'Descriptions']

os.listdir("data")

## ['gene_expression.csv', 'ToRead.csv', 'subset_dataset.py', 'Data Visualization', 'myresult.RData', 'gene_data.csv', 'GeneNames.txt', 'height-weight_female.csv', 'Final Project', 'height-weight.csv', 'GeneExpression.txt', 'subset_dataset_args.py', 'GeneExpressionWithMethods.txt', 'readThisTable.csv']

Paths

When you give a path to python it can either be relative or absolute.

location

Paths

To use our map analogy:

Relative path are like directions i.e. take a left, go straight then take a right etc. The context of where you start is essential.
Absolute paths are like an address. They give the final location in absence of any other external information.

Both have their benefits.

Paths in use

The command we used before was using an absolute path. Typically they start with “/” to get to the top level of your computers file structure.

os.chdir("/Users/mattpaul/Downloads/Intro_To_Python-master/r_course/")

This absolute pathway is very precise but it contains specific information about my computer. This means this will only work on this computer.

Given that we started at: /Users/mattpaul, we could have also used a relative path. This uses the knowledge that we are in that start position to find where we are going.

os.chdir("Downloads/Intro_To_Python-master/r_course/")

If we ensure the code is run from an equivalent position, than we can use relative paths across computers.

Back to Reading

There are many ways to read in data. Most of the time we are importing simple 2D tables, so we want to generate a NumPy array using the genfromtxt() function.

Here we are reading a csv file (Comma-Separated Values). This means each value in our file is separated by a comma. So when we read the file we will specify that the delimiter is a comma.

We can use the relative path from our new working directory to find it.

import numpy as np
my_table = np.genfromtxt("data/ToRead.csv", delimiter=",")

type(my_table)

## <class 'numpy.ndarray'>

my_table

## array([[ 4.5702372 ,  3.23046698,  3.35182734,  3.93087741,  4.09824666,
##          4.41872599],
##        [ 3.56173302,  3.63228533,  3.58752332,  4.18528704,  1.38097605,
##          5.93699012],
##        [ 3.79727358,  2.87446167,  4.01691555,  4.17577191,  1.98826299,
##          3.78091724],
##        [ 3.39824235,  4.41520211,  4.89356109,  8.4323419 ,  9.60915099,
##          9.01986467],
##        [10.12878671, 10.22407116,  8.94581254,  2.93617444,  3.89292402,
##          0.91920785],
##        [ 8.4743968 ,  8.612628  ,  7.17082986,  3.29935093,  2.57586964,
##          2.54277726],
##        [10.01002009, 10.31235402, 11.60328984,  9.93070417,  7.74879528,
##          9.79882378],
##        [ 9.39999242, 10.33284375,  9.37821714, 10.06519963, 10.78861857,
##         10.24532579]])

Reading

Here we try to read in a more complex file. This data is a mixture of characters and numbers. The dtype argument can be used to specify the format of the data we are reading in. This data also has column titles, so we will skip over those with skip_header.

The data set contains:

Sex - Unicode string
Height - float
Weight - float


dtype = ['U10' , 'f', 'f' ]

my_table2 = np.genfromtxt('data/height-weight.csv', delimiter = ",", skip_header=True, dtype=dtype)

Reading

When you read in an array with multiple data formats it does not automatically make a 2D array, instead a 1D array containing other arrays.

my_table2

## array([('Male', 182.87, 76.57), ('Male', 179.12, 80.43),
##        ('Male', 169.15, 75.48), ('Male', 175.66, 94.54),
##        ('Female', 164.47, 71.78), ('Female', 158.27, 69.9 ),
##        ('Female', 161.69, 68.85), ('Female', 165.84, 70.44),
##        ('Male', 181.32, 76.9 ), ('Male', 167.37, 79.06),
##        ('Female', 160.06, 72.37), ('Female', 166.48, 67.34),
##        ('Male', 175.39, 92.22), ('Female', 164.7 , 75.69),
##        ('Female', 163.79, 65.76), ('Male', 181.13, 72.33),
##        ('Male', 169.24, 73.3 ), ('Male', 176.22, 97.67),
##        ('Male', 174.09, 72.2 ), ('Male', 180.11, 75.72),
##        ('Male', 179.24, 75.54), ('Female', 161.92, 69.92),
##        ('Male', 169.85, 90.63), ('Female', 160.57, 63.54),
##        ('Female', 168.24, 69.57), ('Male', 177.75, 74.84),
##        ('Male', 183.21, 83.36), ('Male', 167.75, 82.06),
##        ('Male', 181.15, 83.93), ('Male', 181.56, 79.54),
##        ('Female', 160.03, 64.3 ), ('Male', 165.62, 76.72),
##        ('Male', 181.64, 96.91), ('Female', 159.67, 71.88),
##        ('Male', 177.03, 74.04), ('Female', 163.35, 70.46),
##        ('Male', 175.21, 83.65), ('Female', 160.8 , 64.77),
##        ('Male', 166.46, 76.83), ('Female', 157.95, 67.41),
##        ('Male', 180.61, 83.59), ('Female', 159.52, 67.99),
##        ('Female', 163.01, 65.19), ('Female', 165.8 , 71.77),
##        ('Female', 170.03, 66.68), ('Female', 157.16, 69.64),
##        ('Female', 164.58, 72.99), ('Female', 163.47, 72.89),
##        ('Male', 185.43, 87.23), ('Female', 165.34, 70.84),
##        ('Female', 163.45, 67.67), ('Female', 163.97, 66.71),
##        ('Female', 161.38, 73.55), ('Female', 160.09, 65.93),
##        ('Male', 178.64, 97.05), ('Female', 159.78, 68.31),
##        ('Female', 161.57, 67.92), ('Female', 161.83, 66.03),
##        ('Male', 169.66, 77.3 ), ('Male', 166.84, 88.25),
##        ('Female', 159.32, 64.92), ('Male', 170.51, 84.35),
##        ('Female', 161.84, 69.97), ('Male', 171.41, 81.7 ),
##        ('Male', 166.75, 79.06), ('Female', 166.19, 67.46),
##        ('Male', 169.16, 90.08), ('Female', 157.01, 66.56),
##        ('Male', 167.51, 84.15), ('Female', 160.47, 68.2 ),
##        ('Female', 162.33, 66.47), ('Male', 175.67, 88.82),
##        ('Male', 174.25, 80.93), ('Female', 158.94, 65.14),
##        ('Male', 172.72, 67.62), ('Female', 159.23, 69.96),
##        ('Male', 176.54, 90.76), ('Male', 184.34, 90.41),
##        ('Female', 163.94, 71.47), ('Female', 160.09, 68.94),
##        ('Female', 162.32, 72.72), ('Female', 162.59, 69.76),
##        ('Male', 171.94, 82.11), ('Female', 158.07, 69.8 ),
##        ('Female', 158.35, 69.72), ('Female', 162.18, 67.81),
##        ('Female', 159.38, 70.37), ('Male', 171.45, 84.29),
##        ('Female', 163.17, 64.47), ('Male', 183.1 , 82.47),
##        ('Male', 177.14, 88.7 ), ('Male', 171.08, 72.51),
##        ('Female', 159.33, 70.68), ('Male', 185.43, 73.63),
##        ('Female', 162.65, 73.99), ('Female', 159.44, 66.21),
##        ('Female', 164.11, 70.66), ('Female', 159.13, 66.96),
##        ('Female', 160.58, 71.49), ('Female', 164.88, 68.07)],
##       dtype=[('f0', '<U10'), ('f1', '<f4'), ('f2', '<f4')])

Reading

While reading we can also assign our data into multiple objects using the unpack argument. Here we assign each column into it’s own separate array.

sex, height, weight = np.genfromtxt('data/height-weight.csv', unpack = True, delimiter = ",", skip_header=True, dtype=dtype)

sex

## array(['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female',
##        'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female',
##        'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female',
##        'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male',
##        'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male',
##        'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female',
##        'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female',
##        'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female',
##        'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male',
##        'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male',
##        'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female',
##        'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female',
##        'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female',
##        'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female'],
##       dtype='<U10')

Reading

An alternative to setting the data type explicitly, you can let genfromtxt determine it automatically by setting dtype to None and encoding to None. This works pretty well most of the time.

sex, height, weight = np.genfromtxt('data/height-weight.csv', delimiter = ",", skip_header=True, unpack = True, dtype=None, encoding=None)

sex

## array(['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female',
##        'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female',
##        'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female',
##        'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male',
##        'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male',
##        'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female',
##        'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female',
##        'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female',
##        'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male',
##        'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male',
##        'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female',
##        'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female',
##        'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female',
##        'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female'],
##       dtype='<U6')

Quick Recap - Array Filtering

Once you have performed some analysis you will want to export it into a new file.

Let’s do some quick filtering of the data based on what we learned yesterday. We can then export the results later.

First we will make a 2D array.


new_array = np.array([sex, height, weight])

Next we do a logical test to see which entries in our first row are Female.

to_subset = new_array[0,:]=="Female"
to_subset

## array([False, False, False, False,  True,  True,  True,  True, False,
##        False,  True,  True, False,  True,  True, False, False, False,
##        False, False, False,  True, False,  True,  True, False, False,
##        False, False, False,  True, False, False,  True, False,  True,
##        False,  True, False,  True, False,  True,  True,  True,  True,
##         True,  True,  True, False,  True,  True,  True,  True,  True,
##        False,  True,  True,  True, False, False,  True, False,  True,
##        False, False,  True, False,  True, False,  True,  True, False,
##        False,  True, False,  True, False, False,  True,  True,  True,
##         True, False,  True,  True,  True,  True, False,  True, False,
##        False, False,  True, False,  True,  True,  True,  True,  True,
##         True])

Quick Recap

We can use the boolean array we created to subset our array, based on whether they are Female or not.

subset_array = new_array[:,to_subset]
subset_array

## array([['Female', 'Female', 'Female', 'Female', 'Female', 'Female',
##         'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
##         'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
##         'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
##         'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
##         'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
##         'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
##         'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
##         'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
##         'Female'],
##        ['164.47', '158.27', '161.69', '165.84', '160.06', '166.48',
##         '164.7', '163.79', '161.92', '160.57', '168.24', '160.03',
##         '159.67', '163.35', '160.8', '157.95', '159.52', '163.01',
##         '165.8', '170.03', '157.16', '164.58', '163.47', '165.34',
##         '163.45', '163.97', '161.38', '160.09', '159.78', '161.57',
##         '161.83', '159.32', '161.84', '166.19', '157.01', '160.47',
##         '162.33', '158.94', '159.23', '163.94', '160.09', '162.32',
##         '162.59', '158.07', '158.35', '162.18', '159.38', '163.17',
##         '159.33', '162.65', '159.44', '164.11', '159.13', '160.58',
##         '164.88'],
##        ['71.78', '69.9', '68.85', '70.44', '72.37', '67.34', '75.69',
##         '65.76', '69.92', '63.54', '69.57', '64.3', '71.88', '70.46',
##         '64.77', '67.41', '67.99', '65.19', '71.77', '66.68', '69.64',
##         '72.99', '72.89', '70.84', '67.67', '66.71', '73.55', '65.93',
##         '68.31', '67.92', '66.03', '64.92', '69.97', '67.46', '66.56',
##         '68.2', '66.47', '65.14', '69.96', '71.47', '68.94', '72.72',
##         '69.76', '69.8', '69.72', '67.81', '70.37', '64.47', '70.68',
##         '73.99', '66.21', '70.66', '66.96', '71.49', '68.07']],
##       dtype='<U32')

Writing files

To export the subset array with the savetxt function. This has similar arguments to genfromtxt. A key addition is fmt, the format in which the data is saved. You can specify scientific notation, significant digits or in our case simply that we will use a string.


np.savetxt("height-weight_female.csv", subset_array, delimiter=',', fmt='%s')

Complex reading/writing

Though a lot of the time we may use these NumPy approaches, there are specific functions dedicated to a variety of data types.

The most common is pandas. This is a specialized library for managing data frames. These are similar to arrays but a little more flexible for multiple data types. It also has the ability to read/write from excel spreadsheets.

BioPython and BioNumPy have a range of utilities for managing biological data types i.e. fasta, fastq, etc.

Running Scripts

Scripts

So far we have predominately been working interactively with the console: asking it questions and getting answers back immediately.

When you want to run all your code, or if you want to start working on automation you can run a whole script instead. In this case we have all our code written out in a *.py document. This is good practice for matured analysis that you have finalized to ensure that you have everything properly documented.

Lets have a look at an example python script.

data/subset_dataset.py

Example Script

This is our script:

import numpy as np

# Read in dataset
sex, height, weight = np.genfromtxt('data/height-weight.csv', delimiter = ",", skip_header=True, unpack = True, encoding=None, dtype=None)
new_array = np.array([sex, height, weight])
# Subset array
to_subset = new_array[0,:]=="Female"
subset_array = new_array[:,to_subset]
# Print out number of Females in data set and export subset
print(str(sum(to_subset)) + " are female")
np.savetxt("data/height-weight_female.csv", subset_array, delimiter=',', fmt='%s')

Comments

We can use the number/pound/hash sign to indicate that everything subsequent is “commented out”. This means that python will not evaluate these sections. This gives us room to annotate our code.

Comments are a core part of good coding etiquette. If you are sharing scripts or need to figure out what you did at a later date it helps to have a short statement to explain each step of what you are doing.

The longer and more complex your code gets, the longer and more complex your comments should get.

Running scripts

There are a couple of options for running scripts. In VS Code we can simply press the Run button. It looks like a Play icon.

location

Running scripts

VS code is helping us out here by running the script without us having to directly work in terminal to initiate python. You can see what is doing in terminal though when this runs.

More traditionally we would invoke python directly and provide the script. You will have to do this directly if you want to run a more complex script i.e. one that takes arguments.

/Users/mattpaul/Deskt
op/miniconda3/envs/intro_to_python/bin/python /Users/mattpaul/Documents/RU/Train
ing/Intro_To_Python/r_course/data/subset_dataset.py

Passing Arguments

To use arguments we first have to modify our script. Arguments are parsed by the sys library. sys stores them in sys.argv. The first entry is the scripts name. Subsequent entries are the different arguments.

import sys
import numpy as np

print("My Script Name:", sys.argv[0]) 
print("My Argument:", sys.argv[1])
arg1 = sys.argv[1]

# Read in dataset
sex, height, weight = np.genfromtxt('data/height-weight.csv', delimiter = ",", skip_header=True, unpack = True, encoding=None, dtype=None)
new_array = np.array([sex, height, weight])
# Subset array
to_subset = new_array[0,:]==arg1
subset_array = new_array[:,to_subset]
# Print out number of Females in data set and export subset
print(str(sum(to_subset)) + " are " + arg1)
np.savetxt("data/height-weight_"+arg1+".csv", subset_array, delimiter=',', fmt='%s')

Passing Arguments

To pass the argument to our script when we run it, we simply add it after our python script.

/Users/mattpaul/Desktop/miniconda3/envs/intro_to_python/bin/python /Users/mattpaul/Documents/RU/Training/Intro_To_Python/r_course/data/subset_dataset_args.py Male

Script Messages

You should see that when this runs we create a new file. Also even though we do not have python open, the print statement is returned to the terminal.

When you have long and complex scripts these statements become even more important as checkpoints to make sure your code is working as expected. Any important result should be saved as a document though as these screen messages do not persist.

An extra note

Keeping Your code nice can be annoying. There exists many ways in which to store your code. Most of the time we are not writing scripts that are production level. Instead you will be doing an analysis of a data set and making decisions in an interactive manner.

Notebooks give you a means to tie the code, the analysis decisions and the result of the code together into a single file.

There are several options for python. But Jupyter Notebook is best known for Python. Quarto is growing in popularity as its universal and language agnostic.

Time for an exercise!

Exercise on the data types we have covered so far can be found here

Answers to the exercise

Answers can be found here here

Further Support

When you hit bugs: * Google/ChatGPT/Claude, etc. * Stackoverflow * Biostars * Reach out on GitHub

Other Reference Material: * Harvard’s Python Course * Geeks For Geeks

Intro to Python - Session 2

Bioinformatics Resource Center - Rockefeller University

http://rockefelleruniversity.github.io/Intro_To_Python/

brc@rockefeller.edu

Introduction to Python

Recap

Session Overview

Reading and Writing Files

Reading/writing files

Working Directory

os package

Setting my Working Directory

Working Directory

Paths

Paths

Paths in use

Back to Reading

Reading

Reading

Reading

Reading

Quick Recap - Array Filtering

Quick Recap

Writing files

Complex reading/writing

Running Scripts

Scripts

Example Script

Comments

Running scripts

Running scripts

Passing Arguments

Passing Arguments

Script Messages

An extra note

Time for an exercise!

Answers to the exercise

Further Support