Analysis of RNAseq data in R and Bioconductor (part 5)

class: middle, inverse, title-slide

.title[
# Analysis of RNAseq data in R and Bioconductor (part 5)
]
.subtitle[
## <html><br />
<br />
<hr color='#EB811B' size=1px width=796px><br />
</html><br />
Bioinformatics Resource Center - Rockefeller University
]
.author[
### <a href="http://rockefelleruniversity.github.io/RU_RNAseq/" class="uri">http://rockefelleruniversity.github.io/RU_RNAseq/</a>
]
.author[
### <a href="mailto:brc@rockefeller.edu" class="email">brc@rockefeller.edu</a>
]

---

## What we will cover

* Session 1: Alignment and counting

* Session 2: Differential gene expression analysis

* Session 3: Visualizing results through PCA and clustering

* Session 4: Gene Set Analysis

* _Session 5: Differential Transcript Utilization analysis_

---
## The data

In this session we will be reviewing data from  Jesús Gil's lab at MRC from a splicing factor knockdown (PTBP1 RNAi) RNAseq experiment. Metadata and raw files can be found on GEO [here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE101763).

I have aligned all FQ to BAM and counted in genes and exons using Rsubread and summarizeOverlaps().

---
## The data

All Gene and Exon counts can be found as .RData obects in **data** directory

- Counts in disjoint exons can be found at - **data/senescence_ExonCounts.RData**

---
## What we will cover

In our first session we looked at two separate ways we can gain gene expression estimates from raw sequencing data as FASTQ files.

In our second session we imported our expression estimates to identify differences in the expression of genes between two conditions.

In this session we will look at how we can identify differences in the expression of genes between multiple conditions and how we can identify changes in exon usage to infer changes in transcript usage.

---
## Counts from SummarizeOverlaps

As we saw in session 2, we can use a **BamFileList** to handle multiple BAM files, and **yieldSize** to control memory usage.

``` r
library(Rsamtools)
bamFilesToCount <- c("Sorted_shPTBP1_53_rep1.bam","Sorted_shPTBP1_53_rep2.bam","Sorted_shPTBP1_53_rep3.bam",
                     "Sorted_vec_4OHT_rep1.bam","Sorted_vec_4OHT_rep2.bam","Sorted_vec_4OHT_rep3.bam" )
myBams <- BamFileList(bamFilesToCount,yieldSize = 10000)
```

---
## Counts from SummarizeOverlaps

For differential exon usage we must use a non-overlapping, disjoint set of exons to count on.

We can produce a non-overlapping set of exons by using the **exonicParts()** function. This is equivalent of us having to create GFF file and using the python script wrapped up in DEXseq package (*dexseq_prepare_annotation.py*).

``` r
library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
nonOverlappingExons <- exonicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)

names(nonOverlappingExons) <- paste(mcols(nonOverlappingExons)$gene_id, mcols(nonOverlappingExons)$exonic_part,
    sep = "_")
nonOverlappingExons[1:2, ]
```

---
## Disjoint Exons

![](imgs/collapsedE1.png)

---
## Disjoint Exons

![](imgs/collapsedE2.png)

---
## Counts from SummarizeOverlaps

Now we have our disjoint exons to count on we can use the **summarizeOverlaps()**.

``` r
library(GenomicAlignments)
senescence_ExonCounts <- summarizeOverlaps(nonOverlappingExons, myBam, ignore.strand = TRUE,
    inter.feature = FALSE)
senescence_ExonCounts
```

```
## class: RangedSummarizedExperiment 
## dim: 269121 6 
## metadata(0):
## assays(1): counts
## rownames(269121): 1_1 1_2 ... 9997_3 9997_4
## rowData names(3): gene_id tx_name exonic_part
## colnames(6): Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam ...
##   Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam
## colData names(1): condition
```

---
## Counts from SummarizeOverlaps

Our **RangedSummarizedExperiment** contains our counts as a matrix accessible by the **assay()** function.

Each row is named by our Exon ID, geneID + exon number.

``` r
assay(senescence_ExonCounts)[1:2, ]
```

```
##     Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam
## 1_1                        400                        489
## 1_2                        573                        970
##     Sorted_shPTBP1_53_rep3.bam Sorted_vec_4OHT_rep1.bam
## 1_1                        317                      326
## 1_2                        455                      452
##     Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam
## 1_1                      650                      508
## 1_2                      801                      885
```

---
## Counts from SummarizeOverlaps

We can retrieve the GRanges we counted by the **rowRanges()** accessor function.

The GRanges rows contains our disjoint exons used in counting and is in matching order as rows in our count table.

``` r
rowRanges(senescence_ExonCounts)[1:5, ]
```

```
## GRanges object with 5 ranges and 3 metadata columns:
##       seqnames            ranges strand |         gene_id               tx_name
##          <Rle>         <IRanges>  <Rle> | <CharacterList>       <CharacterList>
##   1_1    chr19 58858172-58858395      - |               1            uc002qsd.4
##   1_2    chr19 58858719-58859006      - |               1            uc002qsd.4
##   1_3    chr19 58859832-58860494      - |               1            uc002qsf.2
##   1_4    chr19 58860934-58861735      - |               1            uc002qsf.2
##   1_5    chr19 58861736-58862017      - |               1 uc002qsd.4,uc002qsf.2
##       exonic_part
##         <integer>
##   1_1           1
##   1_2           2
##   1_3           3
##   1_4           4
##   1_5           5
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
```

---
class: inverse, center, middle

# Differential transcript usage

---

## Differential transcript usage

Previously we have investigated how we may identify changes in the overall abundance on gene using RNAseq data.

![](imgs/dge.png)

---
## Differential transcript usage

We may also be interested in identifying any evidence of a changes in transcript usage within a gene.

![](imgs/dtu.png)

---
## Differential transcript usage

Due to the uncertainty in the assignment of reads to overlapping transcripts from the same genes, a common approach to investigate differential transcript usage is to define non-overlapping disjoint exons and identify changes in their usage.

![](imgs/deu2.png)

---
class: inverse, center, middle

# DEXSeq

---

## DEXSeq

Options for differential transcript usage are available in R including DEXseq and Voom/Limma.

DEXseq can be a little slow when fitting models but it has options for parallelization.

Voom/Limma is significantly faster by comparison, but provides a less informative output.

---
## DEXSeq inputs

The [**DEXseq** package](https://bioconductor.org/packages/release/bioc/html/DEXSeq.html) has a similar set up to DESeq2.

We first must define our metadata data.frame of sample groups, with sample rownames matching colnames of our **RangedSummarizedExperiment** object.

``` r
metaData <- data.frame(condition = c("shPTBP1_53", "shPTBP1_53", "shPTBP1_53", "senescence",
    "senescence", "senescence"), row.names = colnames(senescence_ExonCounts))
metaData
```

```
##                             condition
## Sorted_shPTBP1_53_rep1.bam shPTBP1_53
## Sorted_shPTBP1_53_rep2.bam shPTBP1_53
## Sorted_shPTBP1_53_rep3.bam shPTBP1_53
## Sorted_vec_4OHT_rep1.bam   senescence
## Sorted_vec_4OHT_rep2.bam   senescence
## Sorted_vec_4OHT_rep3.bam   senescence
```

---
## DEXseq inputs

Previously we have been using **DESeqDataSetFromMatrix()** constructor for DESeq2. In DEXSeq we can use the **DEXSeqDataSet** constructor in a similar manner.

As with the **DESeqDataSetFromMatrix()** function, we must provide the counts matrix, our metadata dataframe and our design specifying with metadata columns to test.

In addition to this we must provide our exon IDs and gene IDs to the allow for the association of exons to the genes they constitute.

``` r
countMatrix <- assay(senescence_ExonCounts)
countGRanges <- rowRanges(senescence_ExonCounts)
geneIDs <- as.vector(unlist(countGRanges$gene_id))
exonIDs <- rownames(countMatrix)
```

---
## DEXseq inputs

We must specify our design to include a sample and exon effect to distinguish change in exon expression from a change in exon usage.

Previously we simply added a design formula with our group of interest.

```
~ GROUPOFINTEREST
```

Now we specify.

```
~ sample + exon + GROUPOFINTEREST:exon
```

---
## Running DEXseq

Now we can use the **DEXSeqDataSet** constructor as we have **DESeqDataSetFromMatrix()** but in addition we specify the exon ids to the **featureID** argument and the gene ids to the **groupID** argument.

``` r
library(DEXSeq)
ddx <- DEXSeqDataSet(countMatrix, metaData, design = ~sample + exon + condition:exon,
    featureID = exonIDs, groupID = geneIDs)
ddx
```

```
## class: DEXSeqDataSet 
## dim: 269121 12 
## metadata(1): version
## assays(1): counts
## rownames(269121): 1:1_1 1:1_2 ... 9997:9997_3 9997:9997_4
## rowData names(4): featureID groupID exonBaseMean exonBaseVar
## colnames: NULL
## colData names(3): sample condition exon
```

---
## Other DEXseq inputs

As we have generated a **RangedSummarizedExperiment** for exon counting, we can use the more straightforward  **DEXSeqDataSetFromSE** function as we used the **DESeqDataSet** function.

We can simply provide our **RangedSummarizedExperiment** and design to the **DEXSeqDataSetFromSE** function.

``` r
colData(senescence_ExonCounts)$condition <- factor(c("shPTBP1_53", "shPTBP1_53",
    "shPTBP1_53", "senescence", "senescence", "senescence"))
ddx <- DEXSeqDataSetFromSE(senescence_ExonCounts, design = ~sample + exon + condition:exon)
ddx
```

```
## class: DEXSeqDataSet 
## dim: 269121 12 
## metadata(1): version
## assays(1): counts
## rownames(269121): 1:E001 1:E002 ... 9997:E003 9997:E004
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
##   transcripts
## colnames: NULL
## colData names(3): sample condition exon
```

---
## DEXseq (filtering low exon counts)

DEXseq is fairly computationally expensive so we can reduce the time taken by removing exons with little expression up front.

Here we can filter any exons which do not have 10 reads in at least two samples (our smallest group size is 3).

``` r
ToFilter <- apply(assay(senescence_ExonCounts), 1, function(x) sum(x > 10)) >= 2
senescence_ExonCounts <- senescence_ExonCounts[ToFilter, ]
table(ToFilter)
```

```
## ToFilter
##  FALSE   TRUE 
## 108125 160996
```

``` r
ddx <- DEXSeqDataSetFromSE(senescence_ExonCounts, design = ~sample + exon + condition:exon)
ddx
```

```
## class: DEXSeqDataSet 
## dim: 160996 12 
## metadata(1): version
## assays(1): counts
## rownames(160996): 1:E001 1:E002 ... 9997:E002 9997:E003
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
##   transcripts
## colnames: NULL
## colData names(3): sample condition exon
```

---
## Running DEXseq in steps

We can normalize, estimate dispersion and test for differential exon usage with the **estimateSizeFactors()**, **estimateDispersions()** and **testForDEU()** functions.

Finally we can estimate fold changes for exons using the **estimateExonFoldChanges()** function and specifying metadata column of interest to **fitExpToVar** parameter.

``` r
ddx <- estimateSizeFactors(ddx)
ddx <- estimateDispersions(ddx)
ddx <- testForDEU(ddx, reducedModel = ~sample + exon)
ddx = estimateExonFoldChanges(ddx, fitExpToVar = "condition")
```

---
## DEXseq results

Once have processed our data through the **DEXseq** function we can use the **DEXSeqResults()** function with our **DEXseq** object. We order table to arrange by pvalue.

``` r
dxr1 <- DEXSeqResults(ddx)
dxr1 <- dxr1[order(dxr1$pvalue), ]
```

---
## DEXseq function

As with DESeq2, the DEXseq has a single workflow function which will normalize to library sizes, estimate dispersion and perform differential exon usage in one go. The returned object is our **DEXSeqResults** results object.

``` r
dxr1 <- DEXSeq(ddx)
dxr1 <- dxr1[order(dxr1$pvalue), ]
```

---
## DEXseq results

The **DEXSeqResults** results object contains significantly more information than the DESeq2 results objects we have seen so far.

``` r
dxr1[1, ]
```

```
## 
## LRT p-value: full vs reduced
## 
## DataFrame with 1 row and 13 columns
##                groupID   featureID exonBaseMean dispersion      stat    pvalue
##            <character> <character>    <numeric>  <numeric> <numeric> <numeric>
## 57142:E008       57142        E008      21482.1 0.00455953   2036.23         0
##                 padj senescence shPTBP1_53 log2fold_shPTBP1_53_senescence
##            <numeric>  <numeric>  <numeric>                      <numeric>
## 57142:E008         0    30.1855    63.8085                        4.12122
##                         genomicData             countData           transcripts
##                           <GRanges>              <matrix>                <list>
## 57142:E008 chr2:55252222-55254621:- 40618:35656:54755:... uc002ryd.3,uc002rye.3
```

---
## DEXseq results

The **DEXSeqResults** results object can be converted to a data.frame as with **DESeqResults** objects

``` r
as.data.frame(dxr1)[1, ]
```

```
##            groupID featureID exonBaseMean dispersion     stat pvalue padj
## 57142:E008   57142      E008     21482.14 0.00455953 2036.225      0    0
##            senescence shPTBP1_53 log2fold_shPTBP1_53_senescence
## 57142:E008   30.18552   63.80855                       4.121217
##            genomicData.seqnames genomicData.start genomicData.end
## 57142:E008                 chr2          55252222        55254621
##            genomicData.width genomicData.strand
## 57142:E008              2400                  -
##            countData.Sorted_shPTBP1_53_rep1.bam
## 57142:E008                                40618
##            countData.Sorted_shPTBP1_53_rep2.bam
## 57142:E008                                35656
##            countData.Sorted_shPTBP1_53_rep3.bam
## 57142:E008                                54755
##            countData.Sorted_vec_4OHT_rep1.bam
## 57142:E008                               1007
##            countData.Sorted_vec_4OHT_rep2.bam
## 57142:E008                                747
##            countData.Sorted_vec_4OHT_rep3.bam  transcripts
## 57142:E008                                838 uc002ryd....
```