These exercises are about logical operators and reading/writing and scripts from Session 2.

Exercise 1 - Reading/Writing

import numpy as np

geneExpression = np.genfromtxt("data/GeneExpression.txt", delimiter="\t", skip_header=True)
geneExpression = np.genfromtxt("data/GeneExpressionWithMethods.txt", delimiter="\t", skip_header=4)

geneExpression
## array([[ 5.74251  ,  3.214303 ,  4.11682  ,  3.212353 ,  5.742333 ,
##          5.9350948],
##        [ 6.444368 ,  5.896076 ,  2.592581 ,  5.089549 ,  3.624812 ,
##          2.6313925],
##        [ 3.083392 ,  3.414723 ,  3.706069 ,  4.535536 ,  5.104273 ,
##          5.7149521],
##        [ 4.726498 ,  3.023746 ,  3.033173 ,  8.017895 ,  8.0988   ,
##          8.1964109],
##        [ 9.909185 ,  9.174323 ,  9.957153 ,  2.053501 ,  3.276533 ,
##          0.7332521],
##        [10.680459 ,  9.951243 ,  8.985412 ,  3.360963 ,  3.566663 ,
##          3.8519471],
##        [10.516534 , 10.176163 ,  9.778173 , 11.78152  ,  9.005437 ,
##         11.1733928],
##        [ 9.01702  ,  9.342291 ,  9.895636 , 12.046704 , 11.00324  ,
##          9.90325  ]])
geneExpression[0,].mean()
## np.float64(4.660568966666667)
geneExpression[1,].mean()
## np.float64(4.379796416666666)
geneExpression[2,].mean()
## np.float64(4.259824183333333)
geneExpression[3,].mean()
## np.float64(5.849420483333333)
geneExpression[4,].mean()
## np.float64(5.85065785)
geneExpression[5,].mean()
## np.float64(6.732781183333333)
geneExpression[6,].mean()
## np.float64(10.405203300000002)
geneExpression[7,].mean()
## np.float64(10.201356833333334)
# Alternatively we could use the axis argument.
geneExpression.mean(axis=1)
## array([ 4.66056897,  4.37979642,  4.25982418,  5.84942048,  5.85065785,
##         6.73278118, 10.4052033 , 10.20135683])

sub_idx = geneExpression.mean(axis=1) > 6
geneNames = np.genfromtxt("data/GeneNames.txt", delimiter="\t", dtype="U6")
geneNames_sub = geneNames[sub_idx]
np.savetxt("GeneNames_highexpression.txt", geneNames_sub, delimiter="\t", fmt='%s')

Exercise 2 - Scripts

Lets try to put as much together that we have learnt thus far. This will be a multistep challenge. Break it down and use pseudocode to help. Start by working the code interactively, then turn it into a script.

  1. Read in the “data/GeneExpressionWithMethods.txt” dataset.
  2. Use a for loop to calculate the Z score for each gene (per row). The zscore is (gene_expression - mean)/standard deviation. You should use a function to do this calculation.
  3. Save the result as a NumPy array. The vstack function might be useful.
  4. Find which gene has the highest absolute max Zscore. This is a very rough proxy for the variability of that gene.
  5. Print out the gene name with the highest value
  6. Turn this into a script and run the script
  7. Think about what modifications you would need to make in order to accept a different data set as input.

geneExpression = np.genfromtxt("data/GeneExpressionWithMethods.txt", delimiter="\t", skip_header=4)
my_mean = geneExpression.mean(axis=1)
my_std = geneExpression.std(axis=1)

def zscore(value, mean, std):
  my_z = (value-mean)/std
  return my_z
for i in range(geneExpression.shape[0]):
  geneExpression_zscore = zscore(geneExpression[i], my_mean[i], my_std[i])
  if i==0:
    my_zscore=np.array(geneExpression_zscore)
  else:
    my_zscore=np.vstack((my_zscore,geneExpression_zscore))
  
my_abs = abs(my_zscore)
top_values = my_abs.max(axis=1)
top_value = top_values.max()
my_top_index = top_values == top_value

geneNames = np.genfromtxt("data/GeneNames.txt", delimiter="\t", dtype="U6")
geneNames_sub = geneNames[my_top_index]
print(geneNames_sub)
## ['Gene_h']