## Using virtual environment '/github/home/.virtualenvs/r-reticulate' ...
Python is a high-level, general-purpose programming language.
Guido van Rossum began working on Python in the late 1980s as a successor to the ABC programming language and first released it in 1991
One of the most noticeable differences comes from the emphasis on code readability:
it uses significant indentation.
Python 3.0, released in 2008, was a major revision not completely backward-compatible with earlier versions. Python 2.7 and older is officially unsupported but some tools require it.
Python has a huge user base across many disciplines. These users have developed a wide range of libraries which you can use.
As python is used by many fields, it is a useful language for adapting novel approaches for bioinformatics: such as deep learning methods for scRNAseq or Image Analysis
Though R has long been the cornerstone of bioinformatics, python is growing in use.
Though there are core utility packages that have been around a long time such as Biopython. It is the new technique-specific packages that are driving the surging popularity of python i.e Scanpy.
The strengths discussed when considering these languages are clear. That said, in both cases Python and R can handle their supposed weaknesses.
| R | Python |
|---|---|
| Plotting | Large Data |
| Statistics | Machine Learning |
| Bioconductor | Most Popular Language |
Realms of Python specifically relevant to those who are interested in Bioinformatics include Biopython, PyMOL, sciKit, scanPy, and Image Analysis.
All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
Many laptops will come with python installed so getting started is easy. You will simply be able write python into Terminal or Command Prompt and a Python console will open. This is an interactive python session. This will work with most of what will do in the next few sessions as we are working on the basics.
Each version of python and python packages will respond slightly differently to commands given. We therefore want to make sure when you use python for analysis that you control which version you use. This is important for reproducibility.
We will run a custom install of python using conda. Conda manages software and packages. Using Conda should make installations much easier and allow us to keep track of software versions.
We can install Conda from here. We will specifically want to use Miniconda:
https://conda.io/projects/conda/en/latest/user-guide/install/index.html
If you have installed this correctly, running this on terminal/command prompt should give you a list of conda commands
Conda is built on the idea of environments. An environment is a directory that contains a specific collection of Conda packages that you have installed. For today we will make sure we have python and the python packages we need for the training.
The first step is we create a new environment. We will then activate it to expose the environment.
Lastly we can install python.
We also want some specific Python packages. These mostly contain functions that are not present in the base Python distribution.
You can run python from within your terminal/command prompt. This will give you access to the python console.
Many people prefer to use an Integrated Development Environment to augment their experience while coding. They allow easy writing of scripts, visualization of plots, file navigation, and access to many customization features in additional panes.
Examples include RStudio, pyCharm, Xcode etc.
We will be using Visual Studio Code from Windows.
We are now all set up. We will now create a python script. Just click
New File... on the Welcome Page (or File > New File…),
then choose python script.
In this scripting panel we have opened we often will type code, write notes and build scripts.
As we mentioned before we will be working with our Python interactively. This means we need a Python console to work with. This is wehre the code is actually evaluated.
We can manually open one up by Terminal > New Terminal.
Once the Terminal is open, you can then just open Python, by typing
python into the new terminal window.
There is is an easier shortcut to do this. Once you start developing
code the easiest way to open it is to use the Shift + Enter
shortcut.
As VS code is an IDE it allows us to access and run python. But also do several other things from the same portal i.e.
We won’t full dive into everything but getting familiar early is best.
As with many languages you can work with python in two main ways: interactive or scripts.
When people think about coding they are often thinking about the interactive console. This is what we saw earlier when we first opened python. When you work in this way lines of code are submitted as you enter them. You often do this when you are developing an analysis and trying out parameters. This is mostly how we will work in the training.
When you want to automate something, i.e. an analysis workflow, you will write a script. You can then run this script with python and it will run every line of code for you sequentially.
At its simplest you can just use python as a fancy calculator:
## 2
## 10
To take things further there are many functions built-in to python. These are saved chunks of code that will do a task based on the arguments you provide. You can tell there is a function when there is a string immediately followed by a set of parenthesis:
myfunction()
Here we use the round function:
## 3
To get help with a function you can use the help()
function. This will open up the help page for this function. Hopefully
this will contain information about what arguments it are accepted and
what is returned by the function. In this case we can see there is an
additional optional argument ndigits. This has a default
value of None.
## Help on built-in function round in module builtins:
##
## round(number, ndigits=None)
## Round a number to a given precision in decimal digits.
##
## The return value is an integer if ndigits is omitted or None. Otherwise
## the return value has the same type as the number. ndigits may be negative.
If we want to update our rounding result to allow for more decimal places we can add the second argument. For simple function like this, s long as the order is correct we do not to specify the argument.
## 3.142
We can still run a function with disordered arguments by naming them.
## 3.142
Often you will want to save something in your environment for use
later on. We do this by creating a variable by assignment with the
= sign.
## 'Hello!'
## 3.14159
When assigning a variable there are certain things you can and cannot do.
Good:
Bad:
The best thing to do is name it something short and simple, that makes sense.
Once we have a variable we can then use it inside functions. The vector name is acting as an alias for what it contains.
## 3.14159
## 3
There are many kinds of variables. The most basic types are:
str,float, int and
boolean. You can always check what kind you have with the
type() function.
## <class 'str'>
## <class 'float'>
## 3
## <class 'int'>
## <class 'bool'>
We can manually set the type using: str(),
float(), int() and bool().
## '3.14159'
## 3.14159
## 3
## True
These functions do not always work, if there is not a clear rationale for how to resolve the function.
## could not convert string to float: 'Hello!'
You will run into errors coding.
DON’T PANIC.
Most of the time the error messages are very clear. And if they are not a quick google will often clear it up.
In this case we can break it down:
Strings can be concatenated easily.
## 'Hi there'
## 'Hi there'
## 'Hi there'
## 'HiHiHiHiHi'
Exercise on the data types we have covered so far can be found here
Answers can be found here here
Python has many options for storing data. The simplest is a list.
A list has a few key characteristics: * the order of the elements matters and can be used for indexing * they are mutable and dynamic (elements can be modified and length can be changed) * they can hold mixed types of data
Lists are denoted with square brackets.
## ['a', 'b', 'c', 'd', 'e']
## [1, 2, 3, 4, 5]
## [1.1, 2.2, 3.3, 4.4, 5.5]
We can also use the square brackets to extract specific values from our list.
## 'c'
Here we get the third value from our list using the number 2. That is because python uses zero indexing; counting in python starts at 0, not 1.
## 'a'
Sometimes we have a long list, but we know we want the final value.
We can use a - to indicate how far from the end we want to
index.
## 'e'
We can also create a sublist by slicing our list with the
:.
## ['c', 'd']
## ['c', 'd']
## ['c', 'd', 'e']
Key point: You’ll notice that slicing in Python is inclusive of the
first element, but exclusive of the last element. So in our example,
my_strs[2:4] starts with my_strs[2] but does
not include my_strs[4].
List are general containers for a variety of data types. This means you can make a list of lists!
## ['a', ['b1', 'b2'], ['c1', ['c2']]]
We can still use indexing to deal with this mess of lists
## 'c2'
We can use the assignment we have been using this whole time to break open this nested list structure.
## ['a', ['b1', 'b2'], ['c1', ['c2']]]
## 'a'
## ['b1', 'b2']
## ['c1', ['c2']]
We can concatenate lists, just as we did with strings.
## [1, 2, 3, 4, 5, 'a', 'b', 'c', 'd', 'e']
## [1, 2, 3, 4, 5, 'a', 'b', 'c', 'd', 'e']
## ['a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e', 'a', 'b', 'c', 'd', 'e']
There are many useful functions for working with lists. Many of these functions work directly on the list. This means you don’t need to assign the result back to the object. The structure is VARIABLE.function(). These are called attributes. Here we use the append function to extend our list.
## ['a', 'b', 'c', 'd', 'e', 'f']
## ['a', 'b', 'c', 'd', 'e', 'f', 1]
There are many useful functions built into the base version of python.
.insert() inserts an argument into a specific position
in the list## ['a', 'b', 'c', 'c', 'd', 'e', 'f', 1]
.remove() removes something from the list, but will
only remove the first instance## ['a', 'b', 'c', 'd', 'e', 'f', 1]
There are many useful functions built into the base version of python.
.index() reveals which position in the list is the
supplied argument## 2
del statement removes an argument from a
specific position in the list## ['a', 'b', 'c', 'e', 'f', 1]
in
operator## True
.sort() will sort your list. This works both with
numerical and string data.## [1, 4, 4, 6, 9, 11, 12]
## [12, 11, 9, 6, 4, 4, 1]
## ['a', 'b', 'c']
You may have noticed that the sort function does not
actually return the sorted list. It returns ‘None’ and modifies the list
in place. This might be different from other languages you have
used.
Some python functions do return the modified object and leave the
original object unchanged. We can try this with the sorted
function.
new_list = [1,4,9,4,11,12,6]
sorted_result = sorted(new_list)
sorted_result # function modified object## [1, 4, 4, 6, 9, 11, 12]
## [1, 4, 9, 4, 11, 12, 6]
Tuples are another type of object in python. They look and behave a lot like lists. But where lists are dynamic and mutable, tuples cannot be changed. As a result tuples are more memory efficient than lists.
When making a tuple you use parentheses instead of square brackets.
## ['z', 'b', 'c', 'd', 'e']
## 'tuple' object does not support item assignment
As with str/int/float/bool you can easily convert a list to a tuple
and vice versa. Simply use the list() and
tuple() functions.
## ('a', 'b', 'c', 'd', 'e')
## <class 'tuple'>
## ['a', 'b', 'c', 'd', 'e']
## <class 'list'>
Remember!
Square brackets [] for indexing and lists.
Parentheses () for functions and tuples.
Another data type are dictionaries. Dictionaries are made of key:value pairs.
This structure allows the organization of your data, and gives you the ability to grab out values using the key.
Dictionaries are made with the curly brackets. Each entry consists of
a pair of objects. The key identifier, and the
value. Here you can see we have multiple types and shapes
of data contained in our values.
my_dict = {
'my_list': [1,2,3],
'my_tuple': (4,5,6),
'language': 'python',
'technique': 'scRNAseq'
}
my_dict## {'my_list': [1, 2, 3], 'my_tuple': (4, 5, 6), 'language': 'python', 'technique': 'scRNAseq'}
There are attribute functions that we can use to access the keys and values from our dictionary.
## dict_keys(['my_list', 'my_tuple', 'language', 'technique'])
## dict_values([[1, 2, 3], (4, 5, 6), 'python', 'scRNAseq'])
We can index our dictionary using the key values and the square brackets, similar to other objects.
## [1, 2, 3]
We can also use the .get() attribute.
## 'python'
## 'python'
Unlike lists, dictionaries cannot be subset with a numeric index and must be indexed with a key value.
## 0
It is easy to add additional entries with the
.setdefault() attribute. We just provide a new
key/value pair.
## True
## {'my_list': [1, 2, 3], 'my_tuple': (4, 5, 6), 'language': 'python', 'technique': 'scRNAseq', 'metadata': True}
We check our addition using the in operator. This
performs a logical test. We can test specifically on the keys, or the
dictionary as a whole.
## True
## True
Often we want to stick multiple dictionaries together.
There are 3 options:
## {'my_list': [1, 2, 3], 'my_tuple': (4, 5, 6), 'a': 1, 'b': 2}
## {'my_list': [1, 2, 3], 'my_tuple': (4, 5, 6), 'a': 1, 'b': 2}
## {'my_list': [1, 2, 3], 'my_tuple': (4, 5, 6), 'a': 1, 'b': 2}
The last object type within base Python are sets. These are unordered
and each entry is unique. Sets can be created using the curly brackets,
or by coercing another object using the set() function.
## {'c', 'b', 'a'}
## {'c', 'b', 'a'}
The .add()/remove() attributes allow the easy
modification of sets.
## {'c', 'b', 'a', 'd'}
## {'c', 'b', 'a'}
As sets have no order they can’t be subset in the same way that other objects can be.
## 'set' object is not subscriptable
Even if you provide duplicate entries to a set, the set will only contain unique values.
## {'c', 'b', 'a'}
Sets are really useful for checking intersections between two objects.
## {3, 4}
## {1, 2, 3, 4, 5, 6}
## {1, 2}
Exercise on the data types we have covered so far can be found here
Answers can be found here here
Many of the data types we have looked at thus far are either one-dimensional, or get quite complex when built up into multidimensional data frames. These can become relatively slow and cumbersome to work with if you have large datasets.
NumPy is a Python library used for working with arrays, that are
common in biological data. It is not included in the base distribution
of Python so it has to be installed and loaded in separately. We
installed NumPy earlier. Here we load it into our python session with
import.
Often an alias is used when you import a library. Here we are
importing NumPy as np.
Within our imported NumPy library we have many different functions.
Here we will use the array() function to create an array.
In this case we are essentially are creating a list (square brackets),
than coercing it into an array.
## <class 'numpy.ndarray'>
## array([1, 2, 3, 4, 5])
In most data objects we have looked at so far there are limited types
of data: str,float, int and
boolean. Arrays accept all of these. We can always check
the type with the dtype attribute.
## dtype('int64')
## dtype('bool')
When you create the array you can specify what type you want the data to be. This can coerce the input data … within reason.
## array([b'34', b'29', b'40'], dtype='|S2')
## invalid literal for int() with base 10: 'a'
Arrays contain only one data type. While they will accept lists of different data types, but these elements will be coerced into a common type.
## array(['a', '2', '3'], dtype='<U21')
Typically we think about 2D arrays as this is often the rectangular data we deal with. It is possible to create many kinds of arrays, with differing dimensionality. They can be 1D,2D,3D etc.
Here we again create a list to coerce into a array. This time we have a list of lists. Each list will become equivalent to a row in our array.
## array([['Patient1', '34', 'True'],
## ['Patient2', '29', 'True'],
## ['Patient3', '41', 'False']], dtype='<U21')
We can find out the dimension attribute by using ndim.
In this case we have rectangular data so it is two dimensions.
## 2
We can confirm the shape of the array using theshape
attribute
## (3, 3)
The shape of an array can easily be changed with the
reshape method. Note that you will get an error if the
number of values in the array doesn’t fit into the dimensions
specified.
## array([['Patient1'],
## ['34'],
## ['True'],
## ['Patient2'],
## ['29'],
## ['True'],
## ['Patient3'],
## ['41'],
## ['False']], dtype='<U21')
We can use the same square brackets we used for other data objects to index our arrays. The big difference is we now have 2 dimensions. We therefore need to provide 2 indexes, separated by a comma.
The first number will correspond to row. The second number will correspond to column.
## np.str_('True')
We can also do more complex indexing operations like slicing, to get ranges of values from our array.
## array(['True', 'True', 'False'], dtype='<U21')
## array(['34', 'True'], dtype='<U21')
Booleans can be used to directly subset arrays. True
entries are kept.
## array([['Patient1', '34', 'True'],
## ['Patient2', '29', 'True'],
## ['Patient3', '41', 'False']], dtype='<U21')
## array([['Patient1', '34', 'True'],
## ['Patient3', '41', 'False']], dtype='<U21')
We can use this along with logical testing to subset our arrays. Let’s look back at our 2D array. We want to subset this based on the patient age (the second column) i.e. all patients over 30.
## array([['Patient1', '34', 'True'],
## ['Patient2', '29', 'True'],
## ['Patient3', '41', 'False']], dtype='<U21')
It was read in as a ‘U<21’. This is a type of string. We need to coerce the second column to a integer to be able to run a logical expression.
## array([34, 29, 41], dtype=int32)
## array([ True, False, True])
## array([['Patient1', '34', 'True'],
## ['Patient3', '41', 'False']], dtype='<U21')
Doing these kind of logical operations and subsetting on other data objects can be tricky. Many objects do not like working like this. Instead they use a process called list comprehension.
We will not go into this here, but it is a useful tool for performing a repeated action for each data point across an entire list i.e. checking if it is equal to a given value.
NumPy arrays can easily be joined together with the
concatenate function. These are a simple 1D arrays.
2D arrays can be merged just as easily. The orientation of the merge
can be controlled using the axis argument.
arr1 = np.array([[1, 2], [3, 4]])
arr2 = np.array([[5, 6], [7, 8]])
arr = np.concatenate((arr1, arr2), axis=1)
arr## array([[1, 2, 5, 6],
## [3, 4, 7, 8]])
## array([[1, 2],
## [3, 4],
## [5, 6],
## [7, 8]])
NumPy is not just for arrays. Many mathematical functions are already included in base Python (we met round() earlier). When you import NumPy you gain access to a lot more.
Mathematical constants: * pi - np.pi
Mathematical functions: * Exponents and logs - np.exp(my_array), np.log(my_array) * Powers and roots - np.sqrt(my_array) * Trigonometery - np.sin(my_array) * Element-wise operators - np.add(my_array1, my_array2)
So which data objects do you use?
As with most of programming there are often multiple ways to do things and often the optimal data object will be very context dependent. Most of the time you will be working with different Python libraries and functions, each with different preferences for the input/output data object. This will help define which object is appropriate for you.
Here is a rough guide:
Often we will want to define our own functions. This way we can easily repeat the same process.
Key parts: * function name * arguments * code * return values (sometimes)
def - indicates you will define a function.function name - in this case
myFirstFunction.arguments - any arguments that the function expects in
parentheses.code - the start of the code is indicated by a colon,
new line and indentation.return - this is the result that you wish to be
delivered from the function.NOTE: In python functions, indentation after defining the function is required. Otherwise you will get an error.
def myFirstFunction(num1, num2):
sumNum = num1 + num2
return sumNum
myResult = myFirstFunction(num1=2, num2=3)
myResult## 5
We can only return 1 object at a time from function. Here we create the multiple of our numbers and try and return alongside our sum. It wont even let us define the function.
def myFirstFunction(num1, num2):
sumNum = num1 + num2
multipleNum = num1*num2
return sumNum multipleNum## File "<string>", line 1
## return sumNum multipleNum
## IndentationError: unexpected indent
A simple solution is to pass back an object that contains both results. Here we create a quick list.
def myFirstFunction(num1, num2):
sumNum = num1 + num2
multipleNum = num1*num2
return [sumNum, multipleNum]
myResult = myFirstFunction(num1=2, num2=3)
myResult## [5, 6]
In a function containing a return statement, the code up until the return statement is evaluated and anything after the return statement is not evaluated.
def myFirstFunction(num1, num2):
sumNum = num1 + num2
multipleNum = num1*num2
print("Before return")
return [sumNum, multipleNum]
print("After return")
myFirstFunction(num1=2, num2=3)## Before return
## [5, 6]
If a function does not contain a return statement nothing will come back. In other languages this is not the case.
Variables that are defined in the arguments or within the function exist only within the environment of the function. If we try and use the argument outside of the function it will not work.
def myFirstFunction(num1, num2):
sumNum = num1 + num2
multipleNum = num1*num2
return [sumNum, multipleNum]
myFirstFunction(num1=2, num2=3)
sumNum## [5, 6]
## name 'sumNum' is not defined
If a function makes changes to variables defined in the global environment they will not be updated in the global environment.
num3 = 4
def myFirstFunction(num1, num2, num3):
num3 = num1+num2+num3
return num3
myFirstFunction(num1=2, num2=3, num3=num3)## 9
## 4
Functions have local scope. This means they have access to global variables (which can be used anywhere) and local variables which were made within the function.
Once you exit the function you are back to a global scope. Local variables from the function can not be accessed at this point.
Code in a function’s local scope cannot use variables in any other local scope i.e. between functions.
Though it is possible to have local and global variables with the same name, try and give everything unique names so you can keep track of everything.
Functions can have defaults for their arguments which will be used when arguments are not specified.
## 4
## 15
Once I have made a function and I want to keep reusing it I can easily save it i.e. I have a function that I want to use regularly to process some data in the same way. To do this you save it as a script.
First open a new script: File > New File > Python Script. We can then add our original function and save the python script as myFirstFunction.py.
First open a new script: File > New File > Python Script. We can then add our orginal function and save the python script as myFirstFunction_script.py.
Exercise on the data types we have covered so far can be found here
Answers can be found here here
There are several ways to control how your code is evaluated. There are two main classes:
While I’m analyzing data, if I need to execute complex statistical procedures on the data I will use Python else I will use a calculator.
Conditional branching is the evaluation of a logical to determine whether a chunk of code is executed.
In Python, we use the if statement with the logical to be evaluated immediately after. The dependent code is indicated by a colon, new line and indentation.
## x is true
More often, we construct the logical value within the if statement itself. This can be termed the condition.
## The value of x is 10 which is greater than 4
The message is printed above because x is greater than y.
x is now no longer greater than y, so no message is printed.
We really still want a message telling us what was the result of the condition.
If we want to perform an operation when the condition is false we can
follow the if statement with an else
statement.
## 3 is less than to 5
## 10 is greater than or equal to 5
We may wish to execute different procedures under multiple
conditions. This can be controlled using the elif following
an initial if statement.
x = 5
if x < 5:
print(x, "is less than to 5")
elif x > 5:
print(x, "is greater than 5")
else:
print(x, "is 5")## 5 is 5
While and for loops iterate over a block of code, and keep rerunning it.
While loops do this while a specific condition is met
(or until that condition is not met).
For loops will do this for a given number of
iterations.
While loops have a similar structure to if statements. We start by designating the while loop, ten follow with the logical to be evaluated immediately after. The dependent code is indicated by a colon, new line and indentation.
## x is 1
## x is 2
For loops do not have a conditional. Instead you supply an object that you want to be iterate over. This can be a list, tuple, dictionary, set or string. Here we use a list.
## Alpha
## Bravo
## Charlie
The range() function provides us with a nice input for our for loops. It returns a sequence of numbers, starting from 0 and stops before the specified number.
## i is 0
## i is 1
## i is 2
When we have a numeric range, we can use it to index out from existing objects. This often allows for more complex code evaluation.
geneName = ["Ikzf1","Myc","Igll1"]
expression = [10.4, 4.3, 6.5]
iterations = len(geneName)
for i in range(iterations):
print(geneName[i]," has an TPM of ",expression[i])## Ikzf1 has an TPM of 10.4
## Myc has an TPM of 4.3
## Igll1 has an TPM of 6.5
Loops can be combined with conditional statements to allow for complex control of their execution over Python objects.
To help us write complex code we often use pseudocode as a starting point.
When we write pseudocode we are trying to write out each computational step in a human readable way.
It is important to be specific, simple, concise and include the control structures that would be in your final code.
for 0 to 7
if value is greater than 5
print the value and a statement saying it is greater than 5
else if value is 5
print the value and a statement saying it is equal to 5
else if value is less than 5
print the value and a statement saying it is less than 5
Though these can be tough to read and create, by starting with pseudocode and keeping an eye of the hierarchy of indentation we can follow the logic.
for 0 to 7
if value is greater than 5
print the value and a statement saying it is greater than 5
else if value is 5
print the value and a statement saying it is equal to 5
else if value is less than 5
print the value and a statement saying it is less than 5
for i in range(8):
if i > 5:
print("Number",i,"is greater than 5")
elif i == 5:
print("Number",i,"is 5")
else:
print("Number",i,"is less than 5") ## Number 0 is less than 5
## Number 1 is less than 5
## Number 2 is less than 5
## Number 3 is less than 5
## Number 4 is less than 5
## Number 5 is 5
## Number 6 is greater than 5
## Number 7 is greater than 5
We can use conditionals to exit a loop if a condition is satisfied, just like a while loop.
x = range(8)
for i in range(8):
if i > 5:
print("Number",i,"is greater than 5")
elif i == 5:
print("Number",i,"is 5")
break
else:
print("Number",i,"is less than 5") ## Number 0 is less than 5
## Number 1 is less than 5
## Number 2 is less than 5
## Number 3 is less than 5
## Number 4 is less than 5
## Number 5 is 5
Exercises around control structures can be found here
Answers can be found here
When you hit bugs: * Google/ChatGPT/Claude, etc. * Stackoverflow * Biostars * Reach out on GitHub
Other Reference Material: * Harvard’s Python Course * Geeks For Geeks