RNAseq (part 2)


What we will cover

  • Session 1: Alignment and counting

  • Session 2: Differential gene expression analysis

  • Session 3: Visualizing results through PCA and clustering

  • Session 4: Gene Set Analysis

  • Session 5: Differential Transcript Utilization analysis

The data

Over these sessions we will review data from Christina Leslie’s lab at MSKCC on T-Reg cells. This can be found on the Encode portal. T-Reg data here and activated T-Reg data here.

The data

Several intermediate files have already been made for you to use. You can download the course content from GitHub here. Once downloaded you will need to unzip the folder and then navigate into the r_course directory of the download. This will mean the paths in the code are correct.

setwd("Path/to/Download/RU_RNAseq-master")

The data

All Gene and Exon counts can be found as .RData obects in data directory

  • Counts in genes can be found at - data/GeneCounts.Rdata

  • Counts in disjoint exons can be found at - data/ExonCounts.Rdata

  • Salmon transcript quantification output directories can be found under - data/Salmon/

Counting with multiple RNAseq datasets


Counts from SummarizeOverlaps

In our last session we counted our reads in genes using the summarize overlaps function to generate our RangedSummarizedExperiment object. We did this with a single BAM file.

To be able to do differntial gene expression analysis we need replicates. We can use similar approaches for handle multiple files. First I am using a BamFileList of all the T-Reg samples to easily organize my samples and save memory with the yieldSize parameter.

library(Rsamtools)
bamFilesToCount <- c("Sorted_Treg_1.bam", "Sorted_Treg_2.bam", "Sorted_Treg_act_1.bam",
    "Sorted_Treg_act_2.bam", "Sorted_Treg_act_3.bam")
names(bamFilesToCount) <- c("Sorted_Treg_1", "Sorted_Treg_2", "Sorted_Treg_act_1",
    "Sorted_Treg_act_2", "Sorted_Treg_act_3")
myBams <- BamFileList(bamFilesToCount, yieldSize = 10000)

Counts from SummarizeOverlaps

We can count over this BamFileList in the same way we did for a single sample to produce our RangedSummarizedExperiment object of counts in genes across all samples.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(GenomicAlignments)
geneExons <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
geneCounts <- summarizeOverlaps(geneExons, myBams, ignore.strand = TRUE)
geneCounts
## class: RangedSummarizedExperiment 
## dim: 24100 5 
## metadata(0):
## assays(1): ''
## rownames(24100): 497097 19888 ... 100040911 170942
## rowData names(0):
## colnames(5): Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1
##   Sorted_T_reg_act_2 Sorted_T_reg_act_3
## colData names(0):

Speeding up SummarizeOverlaps

Bioconductor has a simple, unified system for parallelization through the BiocParallel package. The BiocParallel package allows for differing parallel approaches including multiple cores on the same machine and distributed computing across large cluster environments (i.e a HPC).

Simply by installing and loading library, we are ready for parallelization.

library(BiocParallel)

Speeding up SummarizeOverlaps

We can control the parallelization in BiocParallel often choosing either a serial mode (no parallelization) with SerialParam() function or the number of cores to use with MulticoreParam(workers=NUMBEROFCORES).

We use the register() function to set the desired parallelization. Once registered you can then start counting using a parallelization.

paramMulti <- MulticoreParam(workers = 2)
paramSerial <- SerialParam()
register(paramSerial)

Counts from SummarizeOverlaps

The counts object has already been made for you to save time. It is in the data directory

The counrs object is a RangedSummarizedExperiment, which contains our counts as a matrix accessible by the assay() function.

Each row is named by its Entrez gene ID.

load("data/GeneCounts.RData")
assay(geneCounts)[1:2, ]
##        Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1 Sorted_T_reg_act_2
## 497097              0              1                  0                  0
## 19888               0              0                  0                  0
##        Sorted_T_reg_act_3
## 497097                  0
## 19888                  11

Counts from SummarizeOverlaps

We can retrieve the GRangesList we counted by the rowRanges() accessor function.

Each GRangesList element contains exons used in counting, has Entrez gene ID and is in matching order as rows in our count table.

rowRanges(geneCounts)[1:2, ]
## GRangesList object of length 2:
## $`497097`
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames          ranges strand |   exon_id   exon_name
##          <Rle>       <IRanges>  <Rle> | <integer> <character>
##   [1]     chr1 3214482-3216968      - |      8008        <NA>
##   [2]     chr1 3421702-3421901      - |      8009        <NA>
##   [3]     chr1 3670552-3671498      - |      8012        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
## 
## $`19888`
## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames          ranges strand |   exon_id   exon_name
##          <Rle>       <IRanges>  <Rle> | <integer> <character>
##   [1]     chr1 4290846-4293012      - |      8013        <NA>
##   [2]     chr1 4343507-4350091      - |      8014        <NA>
##   [3]     chr1 4351910-4352081      - |      8015        <NA>
##   [4]     chr1 4352202-4352837      - |      8016        <NA>
##   [5]     chr1 4360200-4360314      - |      8017        <NA>
##   [6]     chr1 4409170-4409241      - |      8018        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

Differential gene expression analysis


Differential expression with RNAseq

To perform differential expression between our sample groups we will need to perform some essential steps.

  • Transform and/or normalize our counts.

  • Calculate significance of differences between groups from normalized/transformed counts.

Log transform or as counts

Two broadly different approaches to RNAseq analysis could be.

  • Log2 transform, normalize, use standard test for normal distribution to find differential expression.

  • Normalize and use test suitable for count data to find differential expression.

Normalization

As with microarrays, for the majority of normalization approaches we assume that the composition of RNA species across samples is similar and total RNA levels as the same.

To account for technical differences in library composition, many of these approached identify scaling factors for each sample which may be used to normalized observed/measured counts.

Simple measures such as reads per million (RPM) are scaled to total mapped reads. More sophisticated methods, such as used in the package DESeq2 find scaling factors from the distribution of genes relative log expression versus their respective average across all samples.

offset

offset

Mean and variance relationship

Most approaches seek to model the relationship between mean expression and variance of expression in order to model and shrink variance of individual genes. This will assist us to detect differential expression with a small number of replicates.

Differential Expression Packages

Options for differential expression analysis in RNAseq data are available in R including DESeq2, EdgeR and Voom/Limma.

Both EdgeR and DESeq2 works with non-transformed count data to detect differential expression.

Voom transforms count data into log2 values with associated mean dependent weighting for use in Limma differential expression functions.

We will use the DESeq2 software to identify differential expression.

DESeq2

To use DESeq2 we must first produce a dataframe of our Sample groups.

The DESeq2 package contains a workflow for assessing local changes in fragment/read abundance between replicated conditions. This workflow includes normalization, variance estimation, outlier removal/replacement as well as significance testing suited to high throughput sequencing data (i.e. integer counts).

To use DESeq2 we must first produce a dataframe of our Sample groups.

metaData <- data.frame(Group = c("Naive", "Naive", "Act", "Act", "Act"), row.names = colnames(geneCounts))
metaData
##                    Group
## Sorted_T_reg_1     Naive
## Sorted_T_reg_2     Naive
## Sorted_T_reg_act_1   Act
## Sorted_T_reg_act_2   Act
## Sorted_T_reg_act_3   Act

Creating a DESeq2 object

We can use the DESeqDataSetFromMatrix() function to create a DESeq2 object.

We must provide our matrix of counts to countData parameter, our metadata data.frame to colData parameter and we include to an optional parameter of rowRanges the non-redundant peak set we can counted on.

Finally we provide the name of the column in our metadata data.frame within which we wish to test to the design parameter.

countMatrix <- assay(geneCounts)
countGRanges <- rowRanges(geneCounts)
dds <- DESeqDataSetFromMatrix(countMatrix, colData = metaData, design = ~Group, rowRanges = countGRanges)
dds
## class: DESeqDataSet 
## dim: 24100 5 
## metadata(1): version
## assays(1): counts
## rownames(24100): 497097 19888 ... 100040911 170942
## rowData names(0):
## colnames(5): Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1
##   Sorted_T_reg_act_2 Sorted_T_reg_act_3
## colData names(1): Group

Creating a DESeq2 object

As an alternative input, we can first update our RangedSummarizedExperiment object to include our metadata using the colData accessor.

colData(geneCounts)$Group <- metaData$Group
geneCounts
## class: RangedSummarizedExperiment 
## dim: 24100 5 
## metadata(0):
## assays(1): ''
## rownames(24100): 497097 19888 ... 100040911 170942
## rowData names(0):
## colnames(5): Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1
##   Sorted_T_reg_act_2 Sorted_T_reg_act_3
## colData names(1): Group

Creating a DESeq2 object

Now we can make use of the DESeqDataSet() function to build directly from our RangedSummarizedExperiment object.

We simply have to specify the design parameter to specify metadata column to test on.

dds <- DESeqDataSet(geneCounts, design = ~Group)
## renaming the first element in assays to 'counts'
dds
## class: DESeqDataSet 
## dim: 24100 5 
## metadata(1): version
## assays(1): counts
## rownames(24100): 497097 19888 ... 100040911 170942
## rowData names(0):
## colnames(5): Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1
##   Sorted_T_reg_act_2 Sorted_T_reg_act_3
## colData names(1): Group

DESeq function

We can now run the DESeq2 workflow on our DESeq2 object using the DESeq() function. This function will normalize library sizes, estimate and shrink variance and test our data in a single step.

Our DESeq2 object is updated to include useful statistics such our normalized values and variance of signal within each gene.

dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

Normalization

In its first step the DEseq() function normalizes our data by evaluating the median expression of genes across all samples to produce a per sample normalization factor (library size).

We can retrieve normalized and unnormalized values from our DESeq2 object using the counts() function and specifying the normalized parameter as TRUE. Normalized counts are counts divided by the library scaling factor.

normCounts <- counts(dds, normalized = TRUE)
normCounts[1:2, ]
##        Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1 Sorted_T_reg_act_2
## 497097              0      0.9202987                  0                  0
## 19888               0      0.0000000                  0                  0
##        Sorted_T_reg_act_3
## 497097            0.00000
## 19888            10.12247

Variance Estimation

In the second step the DEseq() function estimates variances and importantly shrinks variance depending on the mean. This shrinking of variance allows us to detect significant changes with low replicate number.

We can review the Variance/mean relationship and shrinkage using plotDispEsts() function and our DESeq2 object. We can see the adjusted dispersions in blue and original dispersion for genes in black.

plotDispEsts(dds)

DESeq results

Finally we can extract our contrast of interest using the results() function.

We must specify the contrast parameter, listing the metadata column of interest and the two groups to compare.

The resulting DESeqResults can be sorted by pvalue to list most significant changes at top of table.

myRes <- results(dds, contrast = c("Group", "Act", "Naive"))
myRes <- myRes[order(myRes$pvalue), ]
myRes[1:3, ]
## log2 fold change (MLE): Group Act vs Naive 
## Wald test p-value: Group Act vs Naive 
## DataFrame with 3 rows and 6 columns
##        baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##       <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## 54167  20595.52        2.22416 0.0513373   43.3244         0         0
## 20198   4896.29        3.41735 0.0782338   43.6812         0         0
## 20200   6444.62        3.19637 0.0748192   42.7212         0         0

DESeqResults objects

The DESeqResults object contains important information from our differential expression testing.

  • baseMean - Mean normalised values across all samples.
  • log2FoldChange - Log2 of fold change between groups being compared.
  • pvalue - Significance of change between groups.
  • padj - Significance of change between groups corrected for multiple testing.

DESeqResults objects

With our DESeqResults object we can get a very quick overview of differential expression changes using the summary function.

summary(myRes)
## 
## out of 19776 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2819, 14%
## LFC < 0 (down)     : 2373, 12%
## outliers [1]       : 0, 0%
## low counts [2]     : 6443, 33%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results

DESeqResults objects

We can review the relationship between fold-changes and expression levels with a MA-Plot by using the plotMA() function and our DESeqResults object.

plotMA(myRes)

DESeqResults objects

We downweight genes with high fold change but low significance (due to low counts/high dispersion) by using the lfcShrink() function as we have the results() function. This allows us to now use the log2FC as a measure of significance of change in our ranking for analysis and in programs such as GSEA.

myRes_lfc <- lfcShrink(dds, coef = "Group_Naive_vs_Act")
DESeq2::plotMA(myRes_lfc)

DESeqResults objects

We can also convert our DESeqResults objects into a standard data frame using the as.data.frame function.

myResAsDF <- as.data.frame(myRes)
myResAsDF[1:2, ]
##         baseMean log2FoldChange    lfcSE       stat    pvalue padj
## 497097 0.1840597      -1.353988 4.558285 -0.2970389 0.7664368   NA
## 19888  2.0244933       4.254714 3.557199  1.1960856 0.2316632   NA

Significance and Multiple Testing


Multiple testing

In our analysis we are testing 1000s of genes and identifing significance of each gene separately at a set confidence interval.

If we are 95% confident a gene is differentially expressed when we look at all genes we can expect 5% to be false.

When looking across all genes we apply a multiple testing correction to account for this.

offset

Multiple testing

We could apply a multiple testing to our p-values ourselves using either the Bonferroni or Benjamini-Hockberg correction.

  • Bonferonni = (p-value of gene)*(Total genes tested)
  • Benjamini-Hockberg = ((p-value of gene)*(Total genes tested))/(rank of p-value of gene)

Multiple testing

The total number of genes tested has a large effect on these multiple correction methods.

Independent filtering may be applied to reduce number of genes tested such as removing genes with low expression and/or variance.

We can not simply filter to differentially expressed genes and reapply correction. This is not independent!

NA in padj

By default DEseq2 filters out low expressed genes to assist in multiple testing correction. This results in NA values in padj column for low expressed genes.

table(is.na(myResAsDF$padj))
## 
## FALSE  TRUE 
## 13333 10767

NA in padj

We can add a column of our own adjusted p-values for all genes with a pvalue using the p.adjust() function.

myResAsDF$newPadj <- p.adjust(myResAsDF$pvalue)
myResAsDF[1:3, ]
##         baseMean log2FoldChange    lfcSE       stat    pvalue padj newPadj
## 497097 0.1840597      -1.353988 4.558285 -0.2970389 0.7664368   NA       1
## 19888  2.0244933       4.254714 3.557199  1.1960856 0.2316632   NA       1
## 20671  2.0122042      -3.633158 2.891907 -1.2563189 0.2090004   NA       1

NA in padj

Typcially genes with NA padj values should be filtered from the table for later evaluation and functional testing.

myResAsDF <- myResAsDF[!is.na(myResAsDF$padj), ]
myResAsDF <- myResAsDF[order(myResAsDF$pvalue), ]
myResAsDF[1:3, ]
##        baseMean log2FoldChange      lfcSE     stat pvalue padj newPadj
## 54167 20595.525       2.224161 0.05133731 43.32445      0    0       0
## 20198  4896.288       3.417348 0.07823384 43.68120      0    0       0
## 20200  6444.617       3.196367 0.07481916 42.72124      0    0       0

DEseq2 and Salmon


DESeq2 from Salmon

We saw at the end of last session how we can gain transcript quantification from FASTQ using Salmon.

Salmon offers a fast alternative to alignment and counting for transcript expression estimation so we would be keen to make use of this in our differential expression analysis.

To make use of Salmon transcript quantifications for differential gene expression changes we must however summarize our transcript expression estimates to gene expression estimates.

tximport

The tximport package, written by same author as DESeq2, offers functions to import data from a wide range of transcript counting and quantification software and summarizes transcript expression to genes.

We can first load the Bioconductor tximport package.

library(tximport)

tximport

In order to summarize our transcripts to genes we will need to provide a data.frame of transcripts to their respective gene names.

First we read in a single Salmon quantification report to get a table containing all transcript names.

temp <- read.delim("data/Salmon/TReg_2_Quant/quant.sf")

temp[1:3, ]
##                   Name Length EffectiveLength     TPM NumReads
## 1 ENSMUST00000193812.1   1070         820.000 0.00000        0
## 2 ENSMUST00000082908.1    110           3.749 0.00000        0
## 3 ENSMUST00000192857.1    480         230.000 0.24391        2

tximport

We can now use the select() function from AnnotationDbi to retrieve a transcript names (TXNAME) to gene ids (GENEID) map from our TxDb.Mmusculus.UCSC.mm10.knownGene object.

Remember that available columns and keys can be found using the columns() and keytype functions respectively.

Tx2Gene <- select(TxDb.Mmusculus.UCSC.mm10.knownGene, keys = as.vector(temp[, 1]),
    keytype = "TXNAME", columns = c("GENEID", "TXNAME"))
Tx2Gene <- Tx2Gene[!is.na(Tx2Gene$GENEID), ]
Tx2Gene[1:10, ]
##                   TXNAME GENEID
## 14  ENSMUST00000134384.7  18777
## 15 ENSMUST00000027036.10  18777
## 16  ENSMUST00000150971.7  18777
## 17  ENSMUST00000155020.1  18777
## 18  ENSMUST00000119612.8  18777
## 19  ENSMUST00000137887.7  18777
## 20  ENSMUST00000115529.7  18777
## 21  ENSMUST00000131119.1  18777
## 22  ENSMUST00000141278.1  18777
## 23 ENSMUST00000081551.13  21399

tximport

Now we can use the tximport function to import and summarize our Salmon quantification files to gene level expression estimates.

We must provide the paths to Salmon quant.sf files, the type of file to import (here “salmon”) to the type argument and our data.frame of transcript to gene mapping using the tx2gene argument.

salmonQ <- dir("data/Salmon/", recursive = T, pattern = "quant.sf", full.names = T)
salmonCounts <- tximport(salmonQ, type = "salmon", tx2gene = Tx2Gene)
## reading in files with read.delim (install 'readr' package for speed up)
## 1 2 3 4 5 
## transcripts missing from tx2gene: 38851
## summarizing abundance
## summarizing counts
## summarizing length

tximport

The result from tximport is a list containing our summarized gene expression estimates

Abundance - Number of transcripts per million transcripts (TPM).
Counts - Estimated counts.

salmonCounts$abundance[1:2, ]
##               [,1]     [,2]     [,3]     [,4]     [,5]
## 100009600 1.193915 1.664778 0.623252 1.117103 1.255239
## 100009609 0.079162 0.090154 0.062317 0.067593 0.041142
salmonCounts$counts[1:2, ]
##             [,1]    [,2]   [,3]   [,4]   [,5]
## 100009600 73.602 100.600 34.117 74.385 75.221
## 100009609 27.280  30.453 19.069 25.160 13.782

DESeq2 from Tximport

We can now use the DESeqDataSetFromTximport() to build our DESeq2 object from salmon counts. We must specify the metadata to colData parameter and specify the column of interest in design parameter as we have done for DESeqDataSetFromMatrix().

ddsSalmon <- DESeqDataSetFromTximport(salmonCounts, colData = metaData, design = ~Group)
## using counts and average transcript lengths from tximport

DESeq2 from Tximport

We can then proceed as we did for previous analysis using summarizeOverlaps counts.

ddsSalmon <- DESeq(ddsSalmon)
myResS <- results(ddsSalmon, contrast = c("Group", "Act", "Naive"))
myResS <- myResS[order(myResS$pvalue), ]
myResS[1:3, ]
## log2 fold change (MLE): Group Act vs Naive 
## Wald test p-value: Group Act vs Naive 
## DataFrame with 3 rows and 6 columns
##         baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##        <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## 110454  20032.40        2.81696 0.0569895   49.4295         0         0
## 12772    4452.24        3.87109 0.1007898   38.4075         0         0
## 14939    6154.51        5.75391 0.1076499   53.4502         0         0

Salmon and Rsubread

We can now compare the differential following Salmon quantification and Rsubread/summarizeOverlaps. Here we are showing the overlaps between genes that have a padj <0.05 and also a fold change greater then 2.

Adding Annotation


Annotation of results table

By whatever method we have chosen to analyze differential expression we will want to add some sensible gene names or symbols to our data frame of results.

We can use the org.db packages to retrieve Gene Symbols for our Entrez IDs using the select function. Here we use the org.db package for mouse, org.db

library(org.Mm.eg.db)
eToSym <- select(org.Mm.eg.db,
                 keys = rownames(myResAsDF),
                 keytype = "ENTREZID",
                 columns="SYMBOL")
eToSym[1:10,]
##    ENTREZID    SYMBOL
## 1     54167      Icos
## 2     20198    S100a4
## 3     20200    S100a6
## 4     12772      Ccr2
## 5     16407     Itgae
## 6     14939      Gzmb
## 7    110454      Ly6a
## 8     20715 Serpina3g
## 9     93692      Glrx
## 10    12766     Cxcr3

Annotation of results table

Now we can merge the Entrez ID to Symbol table into our table of differential expression results.

annotatedRes <- merge(eToSym,myResAsDF,
                      by.x=1,
                      by.y=0,
                      all.x=FALSE,
                      all.y=TRUE)
annotatedRes <- annotatedRes[order(annotatedRes$pvalue),]
annotatedRes[1:3,]
##      ENTREZID SYMBOL  baseMean log2FoldChange      lfcSE     stat pvalue padj
## 1027   110454   Ly6a 17468.571       2.809022 0.04994700 56.24006      0    0
## 1690    12772   Ccr2  8529.015       4.011062 0.08476422 47.32022      0    0
## 2339    14939   Gzmb  6260.651       5.733767 0.09678177 59.24428      0    0
##      newPadj
## 1027       0
## 1690       0
## 2339       0

Time for an exercise

Link_to_exercises

Link_to_answers