These exercises cover the sections of Data wrangling with tidy.
All files can be found in the “dataset” directory.Exercise 7
Hint:
Counts per million (CPM) are the gene counts normalized to total counts in a sample, multiplied by a million to give you a sensible number.
gene_A_CPM = (gene_A_counts / sum(all_genes_counts)) * 1,000,000
Transcripts per million (TPM) are the gene counts normalized to total counts in a sample, multiplied by a million to give you a sensible number.
gene_A_TPM = (gene_A_counts / sum(all_genes_counts / all_genes_lengths)) * 1/gene_A_length * 1,000,000
More info on RNAseq counts quantification here: http://luisvalesilva.com/datasimple/rna-seq_units.html
# Simple X-Y plot comparing TPM and CPM.
p <- tidy_counts_expressed_norm %>%
ggplot(aes(x=CPM, y=TPM)) +
geom_point() +
scale_x_continuous(name="log2(CPM)",trans='log2') +
scale_y_continuous(name="log2(TPM)",trans='log2')
p
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous y-axis