Exercise 5

These exercises are about logical operators and reading/writing and scripts from Session 2.

Exercise 1 - Reading/Writing

Read in the tab delimited file “GeneExpression.txt”. (Hint: The delimiter for a tab file is “.)

import numpy as np

geneExpression = np.genfromtxt("data/GeneExpression.txt", delimiter="\t", skip_header=True)

Read in the tab delimited file “GeneExpressionWithMethods.txt”. This file contains information on analysis steps used to produce file. We will want to skip those. Hint: Check the skip_header argument in the help page

geneExpression = np.genfromtxt("data/GeneExpressionWithMethods.txt", delimiter="\t", skip_header=4)

geneExpression

## array([[ 5.74251  ,  3.214303 ,  4.11682  ,  3.212353 ,  5.742333 ,
##          5.9350948],
##        [ 6.444368 ,  5.896076 ,  2.592581 ,  5.089549 ,  3.624812 ,
##          2.6313925],
##        [ 3.083392 ,  3.414723 ,  3.706069 ,  4.535536 ,  5.104273 ,
##          5.7149521],
##        [ 4.726498 ,  3.023746 ,  3.033173 ,  8.017895 ,  8.0988   ,
##          8.1964109],
##        [ 9.909185 ,  9.174323 ,  9.957153 ,  2.053501 ,  3.276533 ,
##          0.7332521],
##        [10.680459 ,  9.951243 ,  8.985412 ,  3.360963 ,  3.566663 ,
##          3.8519471],
##        [10.516534 , 10.176163 ,  9.778173 , 11.78152  ,  9.005437 ,
##         11.1733928],
##        [ 9.01702  ,  9.342291 ,  9.895636 , 12.046704 , 11.00324  ,
##          9.90325  ]])

Find the mean expression across rows. (Hint: NumPy objects have a mean function attribute)

geneExpression[0,].mean()

## np.float64(4.660568966666667)

geneExpression[1,].mean()

## np.float64(4.379796416666666)

geneExpression[2,].mean()

## np.float64(4.259824183333333)

geneExpression[3,].mean()

## np.float64(5.849420483333333)

geneExpression[4,].mean()

## np.float64(5.85065785)

geneExpression[5,].mean()

## np.float64(6.732781183333333)

geneExpression[6,].mean()

## np.float64(10.405203300000002)

geneExpression[7,].mean()

## np.float64(10.201356833333334)

# Alternatively we could use the axis argument.
geneExpression.mean(axis=1)

## array([ 4.66056897,  4.37979642,  4.25982418,  5.84942048,  5.85065785,
##         6.73278118, 10.4052033 , 10.20135683])

The corresponding gene names are stored in “GeneNames.txt”. Read this in and subset it based on whether the mean gene expression is over 6. Write this out into a new text file.


sub_idx = geneExpression.mean(axis=1) > 6
geneNames = np.genfromtxt("data/GeneNames.txt", delimiter="\t", dtype="U6")
geneNames_sub = geneNames[sub_idx]
np.savetxt("GeneNames_highexpression.txt", geneNames_sub, delimiter="\t", fmt='%s')

Exercise 2 - Scripts

Lets try to put as much together that we have learnt thus far. This will be a multistep challenge. Break it down and use pseudocode to help. Start by working the code interactively, then turn it into a script.

Read in the “data/GeneExpressionWithMethods.txt” dataset.
Use a for loop to calculate the Z score for each gene (per row). The zscore is (gene_expression - mean)/standard deviation. You should use a function to do this calculation.
Save the result as a NumPy array. The vstack function might be useful.
Find which gene has the highest absolute max Zscore. This is a very rough proxy for the variability of that gene.
Print out the gene name with the highest value
Turn this into a script and run the script
Think about what modifications you would need to make in order to accept a different data set as input.


geneExpression = np.genfromtxt("data/GeneExpressionWithMethods.txt", delimiter="\t", skip_header=4)

my_mean = geneExpression.mean(axis=1)
my_std = geneExpression.std(axis=1)


def zscore(value, mean, std):
  my_z = (value-mean)/std
  return my_z

for i in range(geneExpression.shape[0]):
  geneExpression_zscore = zscore(geneExpression[i], my_mean[i], my_std[i])
  if i==0:
    my_zscore=np.array(geneExpression_zscore)
  else:
    my_zscore=np.vstack((my_zscore,geneExpression_zscore))
  
my_abs = abs(my_zscore)
top_values = my_abs.max(axis=1)
top_value = top_values.max()
my_top_index = top_values == top_value

geneNames = np.genfromtxt("data/GeneNames.txt", delimiter="\t", dtype="U6")
geneNames_sub = geneNames[my_top_index]
print(geneNames_sub)

## ['Gene_h']

Exercise 5

Rockefeller University, Bioinformatics Resource Centre

https://rockefelleruniversity.github.io/Intro_To_Python/