These exercises cover the sections of Data wrangling with tidy.

All files can be found in the “dataset” directory.

Exercise 7

Calculate CPMs for our expressed genes (tidy_counts_expressed)
Calculate TPMs for our expressed genes
Draw a plot to compare the two

Hint:

Counts per million (CPM) are the gene counts normalized to total counts in a sample, multiplied by a million to give you a sensible number.

gene_A_CPM = (gene_A_counts / sum(all_genes_counts)) * 1,000,000

Transcripts per million (TPM) are the gene counts normalized to total counts in a sample, multiplied by a million to give you a sensible number.

gene_A_TPM = (gene_A_counts / sum(all_genes_counts / all_genes_lengths)) * 1/gene_A_length * 1,000,000

More info on RNAseq counts quantification here: http://luisvalesilva.com/datasimple/rna-seq_units.html

ANSWERS

Load in dataset and packages

load(file='dataset/my_tidy.Rdata')
library(tidyverse)

Answer 1

# Use Group to focus computation on each sample. Use mutate to make 
# new variable that is the CPM
tidy_counts_expressed_norm <- tidy_counts_expressed  %>% 
  group_by(Sample) %>% 
  mutate(CPM=(counts/sum(counts))*1000000)

Answer 2

# Join our tidy data to the metadata, then make new variable that is the TPM
tidy_counts_expressed_norm <- tidy_counts_expressed_norm %>% 
  inner_join(counts_metadata, by = c("ENTREZ" = "ID")) %>%  
  mutate(TPM=(counts/sum(counts/LENGTH))*(1000000/LENGTH))

Answer 3

# Simple X-Y plot comparing TPM and CPM. 
p <- tidy_counts_expressed_norm %>% 
  ggplot(aes(x=CPM, y=TPM)) + 
  geom_point() + 
  scale_x_continuous(name="log2(CPM)",trans='log2') + 
  scale_y_continuous(name="log2(TPM)",trans='log2')
p

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous y-axis

EX7_ANSWERS_RU_tidyverse_dplyr_join

Rockefeller University, Bioinformatics Resource Centre

https://rockefelleruniversity.github.io/RU_tidyverse/

ANSWERS

Load in dataset and packages

Answer 1

Answer 2

Answer 3