RNAseq (part 5)


The data

In this session we will be reviewing data from Jesús Gil’s lab at MRC from a splicing factor knockdown (PTBP1 RNAi) RNAseq experiment. Metadata and raw files can be found on GEO here.

I have aligned all FQ to BAM and counted in genes and exons using Rsubread and summarizeOverlaps().

The data

All Gene and Exon counts can be found as .RData obects in data directory

  • Counts in disjoint exons can be found at - data/senescence_ExonCounts.RData

What we will cover

In our first session we looked at two separate ways we can gain gene expression estimates from raw sequencing data as FASTQ files.

In our second session we imported our expression estimates to identify differences in the expression of genes between two conditions.

In this session we will look at how we can identify differences in the expression of genes between multiple conditions and how we can identify changes in exon usage to infer changes in transcript usage.

Counts from SummarizeOverlaps

As we saw in session 2, we can use a BamFileList to handle multiple BAM files, and yieldSize to control memory usage.

library(Rsamtools)
bamFilesToCount <- c("Sorted_shPTBP1_53_rep1.bam","Sorted_shPTBP1_53_rep2.bam","Sorted_shPTBP1_53_rep3.bam",
                     "Sorted_vec_4OHT_rep1.bam","Sorted_vec_4OHT_rep2.bam","Sorted_vec_4OHT_rep3.bam" )
myBams <- BamFileList(bamFilesToCount,yieldSize = 10000)

Counts from SummarizeOverlaps

For differential exon usage we must use a non-overlapping, disjoint set of exons to count on.

We can produce a non-overlapping set of exons by using the disjointExons() function. This is equivalent of us having to create GFF file and using the python script wrapped up in DEXseq package (dexseq_prepare_annotation.py).

library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
nonOverlappingExons <- disjointExons(TxDb.Hsapiens.UCSC.hg19.knownGene)
## Warning: disjointExons() is deprecated. Please use exonicParts() instead.
names(nonOverlappingExons) <- paste(mcols(nonOverlappingExons)$gene_id, mcols(nonOverlappingExons)$exonic_part,
    sep = "_")
nonOverlappingExons[1:2, ]
## GRanges object with 2 ranges and 3 metadata columns:
##       seqnames            ranges strand |         gene_id         tx_name
##          <Rle>         <IRanges>  <Rle> | <CharacterList> <CharacterList>
##   1_1    chr19 58858172-58858395      - |               1      uc002qsd.4
##   1_2    chr19 58858719-58859006      - |               1      uc002qsd.4
##       exonic_part
##         <integer>
##   1_1           1
##   1_2           2
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Disjoint Exons

Disjoint Exons

Counts from SummarizeOverlaps

Now we have our disjoint exons to count on we can use the summarizeOverlaps().

library(GenomicAlignments)
senescence_ExonCounts <- summarizeOverlaps(nonOverlappingExons, myBam, ignore.strand = TRUE,
    inter.feature = FALSE)
senescence_ExonCounts
## class: RangedSummarizedExperiment 
## dim: 269121 6 
## metadata(0):
## assays(1): counts
## rownames(269121): 1_1 1_2 ... 9997_3 9997_4
## rowData names(3): gene_id tx_name exonic_part
## colnames(6): Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam ...
##   Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam
## colData names(1): condition

Counts from SummarizeOverlaps

Our RangedSummarizedExperiment contains our counts as a matrix accessible by the assay() function.

Each row is named by our Exon ID, geneID + exon number.

assay(senescence_ExonCounts)[1:2, ]
##     Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam
## 1_1                        400                        489
## 1_2                        573                        970
##     Sorted_shPTBP1_53_rep3.bam Sorted_vec_4OHT_rep1.bam
## 1_1                        317                      326
## 1_2                        455                      452
##     Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam
## 1_1                      650                      508
## 1_2                      801                      885

Counts from SummarizeOverlaps

We can retrieve the GRanges we counted by the rowRanges() accessor function.

The GRanges rows contains our disjoint exons used in counting and is in matching order as rows in our count table.

rowRanges(senescence_ExonCounts)[1:5, ]
## GRanges object with 5 ranges and 3 metadata columns:
##       seqnames            ranges strand |         gene_id               tx_name
##          <Rle>         <IRanges>  <Rle> | <CharacterList>       <CharacterList>
##   1_1    chr19 58858172-58858395      - |               1            uc002qsd.4
##   1_2    chr19 58858719-58859006      - |               1            uc002qsd.4
##   1_3    chr19 58859832-58860494      - |               1            uc002qsf.2
##   1_4    chr19 58860934-58861735      - |               1            uc002qsf.2
##   1_5    chr19 58861736-58862017      - |               1 uc002qsd.4,uc002qsf.2
##       exonic_part
##         <integer>
##   1_1           1
##   1_2           2
##   1_3           3
##   1_4           4
##   1_5           5
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome

Differential transcript usage

## Differential transcript usage
Previously we have investigated how we may identify changes in the overall abundance on gene using RNAseq data.
## Differential transcript usage
We may also be interested in identifying any evidence of a changes in transcript usage within a gene.
## Differential transcript usage
Due to the uncertainty in the assignment of reads to overlapping transcripts from the same genes, a common approach to investigate differential transcript usage is to define non-overlapping disjoint exons and identify changes in their usage.
# DEXSeq

DEXSeq

Options for differential transcript usage are available in R including DEXseq and Voom/Limma.

DEXseq can be a little slow when fitting models but it has options for parallelization.

Voom/Limma is significantly faster by comparison, but provides a less informative output.

DEXSeq inputs

The DEXseq package has a similar set up to DESeq2.

We first must define our metadata data.frame of sample groups, with sample rownames matching colnames of our RangedSummarizedExperiment object.

metaData <- data.frame(condition = c("shPTBP1_53", "shPTBP1_53", "shPTBP1_53", "senescence",
    "senescence", "senescence"), row.names = colnames(senescence_ExonCounts))
metaData
##                             condition
## Sorted_shPTBP1_53_rep1.bam shPTBP1_53
## Sorted_shPTBP1_53_rep2.bam shPTBP1_53
## Sorted_shPTBP1_53_rep3.bam shPTBP1_53
## Sorted_vec_4OHT_rep1.bam   senescence
## Sorted_vec_4OHT_rep2.bam   senescence
## Sorted_vec_4OHT_rep3.bam   senescence

DEXseq inputs

Previously we have been using DESeqDataSetFromMatrix() constructor for DESeq2. In DEXSeq we can use the DEXSeqDataSet constructor in a similar manner.

As with the DESeqDataSetFromMatrix() function, we must provide the counts matrix, our metadata dataframe and our design specifying with metadata columns to test.

In addition to this we must provide our exon IDs and gene IDs to the allow for the association of exons to the genes they constitute.

countMatrix <- assay(senescence_ExonCounts)
countGRanges <- rowRanges(senescence_ExonCounts)
geneIDs <- as.vector(unlist(countGRanges$gene_id))
exonIDs <- rownames(countMatrix)

DEXseq inputs

We must specify our design to include a sample and exon effect to distinguish change in exon expression from a change in exon usage.

Previously we simply added a design formula with our group of interest.

~ GROUPOFINTEREST

Now we specify.

~ sample + exon + GROUPOFINTEREST:exon

Running DEXseq

Now we can use the DEXSeqDataSet constructor as we have DESeqDataSetFromMatrix() but in addition we specify the exon ids to the featureID argument and the gene ids to the groupID argument.

library(DEXSeq)
ddx <- DEXSeqDataSet(countMatrix, metaData, design = ~sample + exon + condition:exon,
    featureID = exonIDs, groupID = geneIDs)
ddx
## class: DEXSeqDataSet 
## dim: 269121 12 
## metadata(1): version
## assays(1): counts
## rownames(269121): 1:1_1 1:1_2 ... 9997:9997_3 9997:9997_4
## rowData names(4): featureID groupID exonBaseMean exonBaseVar
## colnames: NULL
## colData names(3): sample condition exon

Other DEXseq inputs

As we have generated a RangedSummarizedExperiment for exon counting, we can use the more straightforward DEXSeqDataSetFromSE function as we used the DESeqDataSet function.

We can simply provide our RangedSummarizedExperiment and design to the DEXSeqDataSetFromSE function.

colData(senescence_ExonCounts)$condition <- factor(c("shPTBP1_53", "shPTBP1_53",
    "shPTBP1_53", "senescence", "senescence", "senescence"))
ddx <- DEXSeqDataSetFromSE(senescence_ExonCounts, design = ~sample + exon + condition:exon)
ddx
## class: DEXSeqDataSet 
## dim: 269121 12 
## metadata(1): version
## assays(1): counts
## rownames(269121): 1:E001 1:E002 ... 9997:E003 9997:E004
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
##   transcripts
## colnames: NULL
## colData names(3): sample condition exon

DEXseq (filtering low exon counts)

DEXseq is fairly computationally expensive so we can reduce the time taken by removing exons with little expression up front.

Here we can filter any exons which do not have 10 reads in at least two samples (our smallest group size is 3).

ToFilter <- apply(assay(senescence_ExonCounts), 1, function(x) sum(x > 10)) >= 2
senescence_ExonCounts <- senescence_ExonCounts[ToFilter, ]
table(ToFilter)
## ToFilter
##  FALSE   TRUE 
## 108125 160996
ddx <- DEXSeqDataSetFromSE(senescence_ExonCounts, design = ~sample + exon + condition:exon)
ddx
## class: DEXSeqDataSet 
## dim: 160996 12 
## metadata(1): version
## assays(1): counts
## rownames(160996): 1:E001 1:E002 ... 9997:E002 9997:E003
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
##   transcripts
## colnames: NULL
## colData names(3): sample condition exon

Running DEXseq in steps

We can normalize, estimate dispersion and test for differential exon usage with the estimateSizeFactors(), estimateDispersions() and testForDEU() functions.

Finally we can estimate fold changes for exons using the estimateExonFoldChanges() function and specifying metadata column of interest to fitExpToVar parameter.

ddx <- estimateSizeFactors(ddx)
ddx <- estimateDispersions(ddx)
ddx <- testForDEU(ddx, reducedModel = ~sample + exon)
ddx = estimateExonFoldChanges(ddx, fitExpToVar = "condition")

DEXseq results

Once have processed our data through the DEXseq function we can use the DEXSeqResults() function with our DEXseq object. We order table to arrange by pvalue.

dxr1 <- DEXSeqResults(ddx)
dxr1 <- dxr1[order(dxr1$pvalue), ]

DEXseq function

As with DESeq2, the DEXseq has a single workflow function which will normalize to library sizes, estimate dispersion and perform differential exon usage in one go. The returned object is our DEXSeqResults results object.

dxr1 <- DEXSeq(ddx)
dxr1 <- dxr1[order(dxr1$pvalue), ]

DEXseq results

The DEXSeqResults results object contains significantly more information than the DESeq2 results objects we have seen so far.

dxr1[1, ]
## 
## LRT p-value: full vs reduced
## 
## DataFrame with 1 row and 13 columns
##                groupID   featureID exonBaseMean dispersion      stat    pvalue
##            <character> <character>    <numeric>  <numeric> <numeric> <numeric>
## 57142:E008       57142        E008      21482.1 0.00455953   2036.23         0
##                 padj senescence shPTBP1_53 log2fold_shPTBP1_53_senescence
##            <numeric>  <numeric>  <numeric>                      <numeric>
## 57142:E008         0    30.1855    63.8085                        4.12122
##                         genomicData             countData           transcripts
##                           <GRanges>              <matrix>                <list>
## 57142:E008 chr2:55252222-55254621:- 40618:35656:54755:... uc002ryd.3,uc002rye.3

DEXseq results

The DEXSeqResults results object can be converted to a data.frame as with DESeqResults objects

as.data.frame(dxr1)[1, ]
##            groupID featureID exonBaseMean dispersion     stat pvalue padj
## 57142:E008   57142      E008     21482.14 0.00455953 2036.225      0    0
##            senescence shPTBP1_53 log2fold_shPTBP1_53_senescence
## 57142:E008   30.18552   63.80855                       4.121217
##            genomicData.seqnames genomicData.start genomicData.end
## 57142:E008                 chr2          55252222        55254621
##            genomicData.width genomicData.strand
## 57142:E008              2400                  -
##            countData.Sorted_shPTBP1_53_rep1.bam
## 57142:E008                                40618
##            countData.Sorted_shPTBP1_53_rep2.bam
## 57142:E008                                35656
##            countData.Sorted_shPTBP1_53_rep3.bam
## 57142:E008                                54755
##            countData.Sorted_vec_4OHT_rep1.bam
## 57142:E008                               1007
##            countData.Sorted_vec_4OHT_rep2.bam
## 57142:E008                                747
##            countData.Sorted_vec_4OHT_rep3.bam  transcripts
## 57142:E008                                838 uc002ryd....

Visualizing DEXseq results

A very useful feature of the DEXSeqResults results objects is the ability to visualise the differential exons along our gene using the plotDEXSeq() function and providing our DEXSeqResults results object and gene of interest to plot.

plotDEXSeq(dxr1, "57142")

Visualizing DEXseq results

We can specify the parameter displayTranscripts to TRUE to see the full gene model.

plotDEXSeq(dxr1, "57142", displayTranscripts = TRUE)

Visualizing DEXseq results

Visualizing DEXseq results

The differential exon usage for Entrez gene 7168 is more complex.

plotDEXSeq(dxr1, "7168", displayTranscripts = TRUE)

Visualizing DEXseq results

Visualizing DEXseq results

We can produce an HTML report for selected genes using the DEXSeqHTML function. By default this will create a file called testForDEU.html within a folder DEXSeqReport in the current working directory.

DEXSeqHTML(dxr1, "57142")

Link to result here

Gene Q values

It may be desirable to obtain a pvalue summarizing the differential exon usage to a gene level.

We can obtain an adjusted p-value per gene with the perGeneQValue() with our DEXSeqResults results object.

The result is a vector of adjusted p-values named by their corresponding gene.

geneQ <- perGeneQValue(dxr1)
geneQ[1:10]
##         1       100      1000     10000     10001 100033414 100037417 100048912 
##         1         1         1         1         1         1         1         1 
## 100049076     10005 
##         1         1

Gene Q values

We could merge this back to our full results table to provides exon and gene level statistics in one report.

gQFrame <- as.data.frame(geneQ, row.names = names(geneQ))
dxDF <- as.data.frame(dxr1)
dxDF <- merge(gQFrame, dxDF, by.x = 0, by.y = 1, all = TRUE)
dxDF <- dxDF[order(dxDF$geneQ, dxDF$pvalue), ]
dxDF[1:2, ]
##        Row.names geneQ featureID exonBaseMean  dispersion     stat
## 97377      57142     0      E008    21482.144 0.004559530 2036.225
## 117732      7168     0      E009     2805.899 0.004776165 1092.222
##               pvalue          padj senescence shPTBP1_53
## 97377   0.000000e+00  0.000000e+00   30.18552   63.80855
## 117732 1.619401e-239 1.246016e-234   49.34291   18.36318
##        log2fold_shPTBP1_53_senescence genomicData.seqnames genomicData.start
## 97377                        4.121217                 chr2          55252222
## 117732                      -4.264727                chr15          63353397
##        genomicData.end genomicData.width genomicData.strand
## 97377         55254621              2400                  -
## 117732        63353472                76                  +
##        countData.Sorted_shPTBP1_53_rep1.bam
## 97377                                 40618
## 117732                                  619
##        countData.Sorted_shPTBP1_53_rep2.bam
## 97377                                 35656
## 117732                                  668
##        countData.Sorted_shPTBP1_53_rep3.bam countData.Sorted_vec_4OHT_rep1.bam
## 97377                                 54755                               1007
## 117732                                  705                               4442
##        countData.Sorted_vec_4OHT_rep2.bam countData.Sorted_vec_4OHT_rep3.bam
## 97377                                 747                                838
## 117732                               4979                               5206
##         transcripts
## 97377  uc002ryd....
## 117732 uc002alh....

Time for an exercise

Link_to_exercises

Link_to_answers