## Using virtual environment '/github/home/.virtualenvs/r-reticulate' ...
Session 1 covered introduction to R data types and import/export of data.
Most of the time, you will not be generating data in Python but will be importing data from external files.
A standard format for this data is a table:
| Gene_Name | Sample_1.hi | Sample_2.hi | Sample_3.hi |
|---|---|---|---|
| Gene_a | 3.676307 | 3.682774 | 2.976411 |
| Gene_b | 5.309033 | 5.543664 | 2.966018 |
| Gene_c | 2.865454 | 4.093604 | 5.648436 |
| Gene_d | 3.950261 | 4.979192 | 3.563585 |
| Gene_e | 10.470795 | 10.601351 | 10.012884 |
| Gene_f | 9.020501 | 8.526737 | 10.610281 |
| Gene_g | 11.617416 | 9.318539 | 9.223805 |
| Gene_h | 9.730001 | 8.579512 | 10.906766 |
If you are reading in data, you need to know from which vantage point your computer is looking. The view point of your computer is the Working Directory.
There are several utilities in the os (operating system) package.
First lets check our current Working Directory using
getcwd.
## '/Users/mattpaul/'
The working directory is based on your VS Code workspace. If you have not set this you can do so using the Explorer tab on the left of VS Code. If the workspace is in the wrong place, you can open a new window, set up a new workspace, then open a new python console. Let’s navigate to the course material that we downloaded. There are some example data sets there: Intro_To_Python-master/r_course/
You can also directly change your Working Directory in python using
the chdir function.
## ['presRaw', 'imgs', 'height-weight_female.csv', 'customCSS', 'data', 'GeneNames_highexpression.txt', '_course.yml', 'Descriptions']
## ['gene_expression.csv', 'ToRead.csv', 'subset_dataset.py', 'Data Visualization', 'myresult.RData', 'gene_data.csv', 'GeneNames.txt', 'height-weight_female.csv', 'Final Project', 'height-weight.csv', 'GeneExpression.txt', 'subset_dataset_args.py', 'GeneExpressionWithMethods.txt', 'readThisTable.csv']
When you give a path to python it can either be relative or absolute.
To use our map analogy:
Relative path are like directions i.e. take a left, go straight then take a right etc. The context of where you start is essential.
Absolute paths are like an address. They give the final location in absence of any other external information.
Both have their benefits.
The command we used before was using an absolute path. Typically they start with “/” to get to the top level of your computers file structure.
This absolute pathway is very precise but it contains specific information about my computer. This means this will only work on this computer.
Given that we started at: /Users/mattpaul, we could have also used a relative path. This uses the knowledge that we are in that start position to find where we are going.
If we ensure the code is run from an equivalent position, than we can use relative paths across computers.
There are many ways to read in data. Most of the time we are
importing simple 2D tables, so we want to generate a NumPy array using
the genfromtxt() function.
Here we are reading a csv file (Comma-Separated Values). This means each value in our file is separated by a comma. So when we read the file we will specify that the delimiter is a comma.
We can use the relative path from our new working directory to find it.
## <class 'numpy.ndarray'>
## array([[ 4.5702372 , 3.23046698, 3.35182734, 3.93087741, 4.09824666,
## 4.41872599],
## [ 3.56173302, 3.63228533, 3.58752332, 4.18528704, 1.38097605,
## 5.93699012],
## [ 3.79727358, 2.87446167, 4.01691555, 4.17577191, 1.98826299,
## 3.78091724],
## [ 3.39824235, 4.41520211, 4.89356109, 8.4323419 , 9.60915099,
## 9.01986467],
## [10.12878671, 10.22407116, 8.94581254, 2.93617444, 3.89292402,
## 0.91920785],
## [ 8.4743968 , 8.612628 , 7.17082986, 3.29935093, 2.57586964,
## 2.54277726],
## [10.01002009, 10.31235402, 11.60328984, 9.93070417, 7.74879528,
## 9.79882378],
## [ 9.39999242, 10.33284375, 9.37821714, 10.06519963, 10.78861857,
## 10.24532579]])
Here we try to read in a more complex file. This data is a mixture of
characters and numbers. The dtype argument can be used to
specify the format
of the data we are reading in. This data also has column titles, so
we will skip over those with skip_header.
The data set contains:
When you read in an array with multiple data formats it does not automatically make a 2D array, instead a 1D array containing other arrays.
## array([('Male', 182.87, 76.57), ('Male', 179.12, 80.43),
## ('Male', 169.15, 75.48), ('Male', 175.66, 94.54),
## ('Female', 164.47, 71.78), ('Female', 158.27, 69.9 ),
## ('Female', 161.69, 68.85), ('Female', 165.84, 70.44),
## ('Male', 181.32, 76.9 ), ('Male', 167.37, 79.06),
## ('Female', 160.06, 72.37), ('Female', 166.48, 67.34),
## ('Male', 175.39, 92.22), ('Female', 164.7 , 75.69),
## ('Female', 163.79, 65.76), ('Male', 181.13, 72.33),
## ('Male', 169.24, 73.3 ), ('Male', 176.22, 97.67),
## ('Male', 174.09, 72.2 ), ('Male', 180.11, 75.72),
## ('Male', 179.24, 75.54), ('Female', 161.92, 69.92),
## ('Male', 169.85, 90.63), ('Female', 160.57, 63.54),
## ('Female', 168.24, 69.57), ('Male', 177.75, 74.84),
## ('Male', 183.21, 83.36), ('Male', 167.75, 82.06),
## ('Male', 181.15, 83.93), ('Male', 181.56, 79.54),
## ('Female', 160.03, 64.3 ), ('Male', 165.62, 76.72),
## ('Male', 181.64, 96.91), ('Female', 159.67, 71.88),
## ('Male', 177.03, 74.04), ('Female', 163.35, 70.46),
## ('Male', 175.21, 83.65), ('Female', 160.8 , 64.77),
## ('Male', 166.46, 76.83), ('Female', 157.95, 67.41),
## ('Male', 180.61, 83.59), ('Female', 159.52, 67.99),
## ('Female', 163.01, 65.19), ('Female', 165.8 , 71.77),
## ('Female', 170.03, 66.68), ('Female', 157.16, 69.64),
## ('Female', 164.58, 72.99), ('Female', 163.47, 72.89),
## ('Male', 185.43, 87.23), ('Female', 165.34, 70.84),
## ('Female', 163.45, 67.67), ('Female', 163.97, 66.71),
## ('Female', 161.38, 73.55), ('Female', 160.09, 65.93),
## ('Male', 178.64, 97.05), ('Female', 159.78, 68.31),
## ('Female', 161.57, 67.92), ('Female', 161.83, 66.03),
## ('Male', 169.66, 77.3 ), ('Male', 166.84, 88.25),
## ('Female', 159.32, 64.92), ('Male', 170.51, 84.35),
## ('Female', 161.84, 69.97), ('Male', 171.41, 81.7 ),
## ('Male', 166.75, 79.06), ('Female', 166.19, 67.46),
## ('Male', 169.16, 90.08), ('Female', 157.01, 66.56),
## ('Male', 167.51, 84.15), ('Female', 160.47, 68.2 ),
## ('Female', 162.33, 66.47), ('Male', 175.67, 88.82),
## ('Male', 174.25, 80.93), ('Female', 158.94, 65.14),
## ('Male', 172.72, 67.62), ('Female', 159.23, 69.96),
## ('Male', 176.54, 90.76), ('Male', 184.34, 90.41),
## ('Female', 163.94, 71.47), ('Female', 160.09, 68.94),
## ('Female', 162.32, 72.72), ('Female', 162.59, 69.76),
## ('Male', 171.94, 82.11), ('Female', 158.07, 69.8 ),
## ('Female', 158.35, 69.72), ('Female', 162.18, 67.81),
## ('Female', 159.38, 70.37), ('Male', 171.45, 84.29),
## ('Female', 163.17, 64.47), ('Male', 183.1 , 82.47),
## ('Male', 177.14, 88.7 ), ('Male', 171.08, 72.51),
## ('Female', 159.33, 70.68), ('Male', 185.43, 73.63),
## ('Female', 162.65, 73.99), ('Female', 159.44, 66.21),
## ('Female', 164.11, 70.66), ('Female', 159.13, 66.96),
## ('Female', 160.58, 71.49), ('Female', 164.88, 68.07)],
## dtype=[('f0', '<U10'), ('f1', '<f4'), ('f2', '<f4')])
While reading we can also assign our data into multiple objects using
the unpack argument. Here we assign each column into it’s
own separate array.
sex, height, weight = np.genfromtxt('data/height-weight.csv', unpack = True, delimiter = ",", skip_header=True, dtype=dtype)
sex## array(['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female',
## 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female',
## 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female',
## 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male',
## 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male',
## 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female',
## 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male',
## 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male',
## 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female',
## 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female',
## 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female',
## 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female'],
## dtype='<U10')
An alternative to setting the data type explicitly, you can let
genfromtxt determine it automatically by setting dtype to
None and encoding to None. This works
pretty well most of the time.
sex, height, weight = np.genfromtxt('data/height-weight.csv', delimiter = ",", skip_header=True, unpack = True, dtype=None, encoding=None)
sex## array(['Male', 'Male', 'Male', 'Male', 'Female', 'Female', 'Female',
## 'Female', 'Male', 'Male', 'Female', 'Female', 'Male', 'Female',
## 'Female', 'Male', 'Male', 'Male', 'Male', 'Male', 'Male', 'Female',
## 'Male', 'Female', 'Female', 'Male', 'Male', 'Male', 'Male', 'Male',
## 'Female', 'Male', 'Male', 'Female', 'Male', 'Female', 'Male',
## 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Male', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female',
## 'Male', 'Male', 'Female', 'Male', 'Female', 'Male', 'Male',
## 'Female', 'Male', 'Female', 'Male', 'Female', 'Female', 'Male',
## 'Male', 'Female', 'Male', 'Female', 'Male', 'Male', 'Female',
## 'Female', 'Female', 'Female', 'Male', 'Female', 'Female', 'Female',
## 'Female', 'Male', 'Female', 'Male', 'Male', 'Male', 'Female',
## 'Male', 'Female', 'Female', 'Female', 'Female', 'Female', 'Female'],
## dtype='<U6')
Once you have performed some analysis you will want to export it into a new file.
Let’s do some quick filtering of the data based on what we learned yesterday. We can then export the results later.
First we will make a 2D array.
Next we do a logical test to see which entries in our first row are Female.
## array([False, False, False, False, True, True, True, True, False,
## False, True, True, False, True, True, False, False, False,
## False, False, False, True, False, True, True, False, False,
## False, False, False, True, False, False, True, False, True,
## False, True, False, True, False, True, True, True, True,
## True, True, True, False, True, True, True, True, True,
## False, True, True, True, False, False, True, False, True,
## False, False, True, False, True, False, True, True, False,
## False, True, False, True, False, False, True, True, True,
## True, False, True, True, True, True, False, True, False,
## False, False, True, False, True, True, True, True, True,
## True])
We can use the boolean array we created to subset our array, based on
whether they are Female or not.
## array([['Female', 'Female', 'Female', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
## 'Female', 'Female', 'Female', 'Female', 'Female', 'Female',
## 'Female'],
## ['164.47', '158.27', '161.69', '165.84', '160.06', '166.48',
## '164.7', '163.79', '161.92', '160.57', '168.24', '160.03',
## '159.67', '163.35', '160.8', '157.95', '159.52', '163.01',
## '165.8', '170.03', '157.16', '164.58', '163.47', '165.34',
## '163.45', '163.97', '161.38', '160.09', '159.78', '161.57',
## '161.83', '159.32', '161.84', '166.19', '157.01', '160.47',
## '162.33', '158.94', '159.23', '163.94', '160.09', '162.32',
## '162.59', '158.07', '158.35', '162.18', '159.38', '163.17',
## '159.33', '162.65', '159.44', '164.11', '159.13', '160.58',
## '164.88'],
## ['71.78', '69.9', '68.85', '70.44', '72.37', '67.34', '75.69',
## '65.76', '69.92', '63.54', '69.57', '64.3', '71.88', '70.46',
## '64.77', '67.41', '67.99', '65.19', '71.77', '66.68', '69.64',
## '72.99', '72.89', '70.84', '67.67', '66.71', '73.55', '65.93',
## '68.31', '67.92', '66.03', '64.92', '69.97', '67.46', '66.56',
## '68.2', '66.47', '65.14', '69.96', '71.47', '68.94', '72.72',
## '69.76', '69.8', '69.72', '67.81', '70.37', '64.47', '70.68',
## '73.99', '66.21', '70.66', '66.96', '71.49', '68.07']],
## dtype='<U32')
To export the subset array with the savetxt function.
This has similar arguments to genfromtxt. A key addition is
fmt, the format in which the data is saved. You can specify
scientific notation, significant digits or in our case simply that we
will use a string.
Though a lot of the time we may use these NumPy approaches, there are specific functions dedicated to a variety of data types.
The most common is pandas. This is a specialized library for managing data frames. These are similar to arrays but a little more flexible for multiple data types. It also has the ability to read/write from excel spreadsheets.
BioPython and BioNumPy have a range of utilities for managing biological data types i.e. fasta, fastq, etc.
So far we have predominately been working interactively with the console: asking it questions and getting answers back immediately.
When you want to run all your code, or if you want to start working on automation you can run a whole script instead. In this case we have all our code written out in a *.py document. This is good practice for matured analysis that you have finalized to ensure that you have everything properly documented.
Lets have a look at an example python script.
data/subset_dataset.py
This is our script:
import numpy as np
# Read in dataset
sex, height, weight = np.genfromtxt('data/height-weight.csv', delimiter = ",", skip_header=True, unpack = True, encoding=None, dtype=None)
new_array = np.array([sex, height, weight])
# Subset array
to_subset = new_array[0,:]=="Female"
subset_array = new_array[:,to_subset]
# Print out number of Females in data set and export subset
print(str(sum(to_subset)) + " are female")
np.savetxt("data/height-weight_female.csv", subset_array, delimiter=',', fmt='%s')There are a couple of options for running scripts. In VS Code we can simply press the Run button. It looks like a Play icon.
VS code is helping us out here by running the script without us having to directly work in terminal to initiate python. You can see what is doing in terminal though when this runs.
More traditionally we would invoke python directly and provide the script. You will have to do this directly if you want to run a more complex script i.e. one that takes arguments.
/Users/mattpaul/Deskt
op/miniconda3/envs/intro_to_python/bin/python /Users/mattpaul/Documents/RU/Train
ing/Intro_To_Python/r_course/data/subset_dataset.py
To use arguments we first have to modify our script. Arguments are parsed by the sys library. sys stores them in sys.argv. The first entry is the scripts name. Subsequent entries are the different arguments.
import sys
import numpy as np
print("My Script Name:", sys.argv[0])
print("My Argument:", sys.argv[1])
arg1 = sys.argv[1]
# Read in dataset
sex, height, weight = np.genfromtxt('data/height-weight.csv', delimiter = ",", skip_header=True, unpack = True, encoding=None, dtype=None)
new_array = np.array([sex, height, weight])
# Subset array
to_subset = new_array[0,:]==arg1
subset_array = new_array[:,to_subset]
# Print out number of Females in data set and export subset
print(str(sum(to_subset)) + " are " + arg1)
np.savetxt("data/height-weight_"+arg1+".csv", subset_array, delimiter=',', fmt='%s')To pass the argument to our script when we run it, we simply add it after our python script.
/Users/mattpaul/Desktop/miniconda3/envs/intro_to_python/bin/python /Users/mattpaul/Documents/RU/Training/Intro_To_Python/r_course/data/subset_dataset_args.py Male
You should see that when this runs we create a new file. Also even though we do not have python open, the print statement is returned to the terminal.
When you have long and complex scripts these statements become even more important as checkpoints to make sure your code is working as expected. Any important result should be saved as a document though as these screen messages do not persist.
Keeping Your code nice can be annoying. There exists many ways in which to store your code. Most of the time we are not writing scripts that are production level. Instead you will be doing an analysis of a data set and making decisions in an interactive manner.
Notebooks give you a means to tie the code, the analysis decisions and the result of the code together into a single file.
There are several options for python. But Jupyter Notebook is best known for Python. Quarto is growing in popularity as its universal and language agnostic.
Exercise on the data types we have covered so far can be found here
Answers can be found here here
When you hit bugs: * Google/ChatGPT/Claude, etc. * Stackoverflow * Biostars * Reach out on GitHub
Other Reference Material: * Harvard’s Python Course * Geeks For Geeks
Comments
We can use the number/pound/hash sign to indicate that everything subsequent is “commented out”. This means that python will not evaluate these sections. This gives us room to annotate our code.
Comments are a core part of good coding etiquette. If you are sharing scripts or need to figure out what you did at a later date it helps to have a short statement to explain each step of what you are doing.
The longer and more complex your code gets, the longer and more complex your comments should get.