Session 1: Alignment and counting
Session 2: Differential gene expression analysis
Session 3: Visualizing results through PCA and clustering
Session 4: Gene Set Analysis
Session 5: Differential Transcript Utilization analysis
In this session we will be reviewing data from Jesús Gil’s lab at MRC from a splicing factor knockdown (PTBP1 RNAi) RNAseq experiment. Metadata and raw files can be found on GEO here.
I have aligned all FQ to BAM and counted in genes and exons using Rsubread and summarizeOverlaps().
All Gene and Exon counts can be found as .RData obects in data directory
In our first session we looked at two separate ways we can gain gene expression estimates from raw sequencing data as FASTQ files.
In our second session we imported our expression estimates to identify differences in the expression of genes between two conditions.
In this session we will look at how we can identify differences in the expression of genes between multiple conditions and how we can identify changes in exon usage to infer changes in transcript usage.
As we saw in session 2, we can use a BamFileList to handle multiple BAM files, and yieldSize to control memory usage.
For differential exon usage we must use a non-overlapping, disjoint set of exons to count on.
We can produce a non-overlapping set of exons by using the exonicParts() function. This is equivalent of us having to create GFF file and using the python script wrapped up in DEXseq package (dexseq_prepare_annotation.py).
Now we have our disjoint exons to count on we can use the summarizeOverlaps().
library(GenomicAlignments)
senescence_ExonCounts <- summarizeOverlaps(nonOverlappingExons, myBam, ignore.strand = TRUE,
inter.feature = FALSE)
senescence_ExonCounts
## class: RangedSummarizedExperiment
## dim: 269121 6
## metadata(0):
## assays(1): counts
## rownames(269121): 1_1 1_2 ... 9997_3 9997_4
## rowData names(3): gene_id tx_name exonic_part
## colnames(6): Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam ...
## Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam
## colData names(1): condition
Our RangedSummarizedExperiment contains our counts as a matrix accessible by the assay() function.
Each row is named by our Exon ID, geneID + exon number.
## Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam
## 1_1 400 489
## 1_2 573 970
## Sorted_shPTBP1_53_rep3.bam Sorted_vec_4OHT_rep1.bam
## 1_1 317 326
## 1_2 455 452
## Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam
## 1_1 650 508
## 1_2 801 885
We can retrieve the GRanges we counted by the rowRanges() accessor function.
The GRanges rows contains our disjoint exons used in counting and is in matching order as rows in our count table.
## GRanges object with 5 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id tx_name
## <Rle> <IRanges> <Rle> | <CharacterList> <CharacterList>
## 1_1 chr19 58858172-58858395 - | 1 uc002qsd.4
## 1_2 chr19 58858719-58859006 - | 1 uc002qsd.4
## 1_3 chr19 58859832-58860494 - | 1 uc002qsf.2
## 1_4 chr19 58860934-58861735 - | 1 uc002qsf.2
## 1_5 chr19 58861736-58862017 - | 1 uc002qsd.4,uc002qsf.2
## exonic_part
## <integer>
## 1_1 1
## 1_2 2
## 1_3 3
## 1_4 4
## 1_5 5
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
Previously we have investigated how we may identify changes in the overall abundance on gene using RNAseq data.
We may also be interested in identifying any evidence of a changes in transcript usage within a gene.
Due to the uncertainty in the assignment of reads to overlapping transcripts from the same genes, a common approach to investigate differential transcript usage is to define non-overlapping disjoint exons and identify changes in their usage.
Options for differential transcript usage are available in R including DEXseq and Voom/Limma.
DEXseq can be a little slow when fitting models but it has options for parallelization.
Voom/Limma is significantly faster by comparison, but provides a less informative output.
The DEXseq package has a similar set up to DESeq2.
We first must define our metadata data.frame of sample groups, with sample rownames matching colnames of our RangedSummarizedExperiment object.
metaData <- data.frame(condition = c("shPTBP1_53", "shPTBP1_53", "shPTBP1_53", "senescence",
"senescence", "senescence"), row.names = colnames(senescence_ExonCounts))
metaData
## condition
## Sorted_shPTBP1_53_rep1.bam shPTBP1_53
## Sorted_shPTBP1_53_rep2.bam shPTBP1_53
## Sorted_shPTBP1_53_rep3.bam shPTBP1_53
## Sorted_vec_4OHT_rep1.bam senescence
## Sorted_vec_4OHT_rep2.bam senescence
## Sorted_vec_4OHT_rep3.bam senescence
Previously we have been using DESeqDataSetFromMatrix() constructor for DESeq2. In DEXSeq we can use the DEXSeqDataSet constructor in a similar manner.
As with the DESeqDataSetFromMatrix() function, we must provide the counts matrix, our metadata dataframe and our design specifying with metadata columns to test.
In addition to this we must provide our exon IDs and gene IDs to the allow for the association of exons to the genes they constitute.
We must specify our design to include a sample and exon effect to distinguish change in exon expression from a change in exon usage.
Previously we simply added a design formula with our group of interest.
~ GROUPOFINTEREST
Now we specify.
~ sample + exon + GROUPOFINTEREST:exon
Now we can use the DEXSeqDataSet constructor as we have DESeqDataSetFromMatrix() but in addition we specify the exon ids to the featureID argument and the gene ids to the groupID argument.
library(DEXSeq)
ddx <- DEXSeqDataSet(countMatrix, metaData, design = ~sample + exon + condition:exon,
featureID = exonIDs, groupID = geneIDs)
ddx
## class: DEXSeqDataSet
## dim: 269121 12
## metadata(1): version
## assays(1): counts
## rownames(269121): 1:1_1 1:1_2 ... 9997:9997_3 9997:9997_4
## rowData names(4): featureID groupID exonBaseMean exonBaseVar
## colnames: NULL
## colData names(3): sample condition exon
As we have generated a RangedSummarizedExperiment for exon counting, we can use the more straightforward DEXSeqDataSetFromSE function as we used the DESeqDataSet function.
We can simply provide our RangedSummarizedExperiment and design to the DEXSeqDataSetFromSE function.
colData(senescence_ExonCounts)$condition <- factor(c("shPTBP1_53", "shPTBP1_53",
"shPTBP1_53", "senescence", "senescence", "senescence"))
ddx <- DEXSeqDataSetFromSE(senescence_ExonCounts, design = ~sample + exon + condition:exon)
ddx
## class: DEXSeqDataSet
## dim: 269121 12
## metadata(1): version
## assays(1): counts
## rownames(269121): 1:E001 1:E002 ... 9997:E003 9997:E004
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
## transcripts
## colnames: NULL
## colData names(3): sample condition exon
DEXseq is fairly computationally expensive so we can reduce the time taken by removing exons with little expression up front.
Here we can filter any exons which do not have 10 reads in at least two samples (our smallest group size is 3).
ToFilter <- apply(assay(senescence_ExonCounts), 1, function(x) sum(x > 10)) >= 2
senescence_ExonCounts <- senescence_ExonCounts[ToFilter, ]
table(ToFilter)
## ToFilter
## FALSE TRUE
## 108125 160996
## class: DEXSeqDataSet
## dim: 160996 12
## metadata(1): version
## assays(1): counts
## rownames(160996): 1:E001 1:E002 ... 9997:E002 9997:E003
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
## transcripts
## colnames: NULL
## colData names(3): sample condition exon
We can normalize, estimate dispersion and test for differential exon usage with the estimateSizeFactors(), estimateDispersions() and testForDEU() functions.
Finally we can estimate fold changes for exons using the estimateExonFoldChanges() function and specifying metadata column of interest to fitExpToVar parameter.
Once have processed our data through the DEXseq function we can use the DEXSeqResults() function with our DEXseq object. We order table to arrange by pvalue.
As with DESeq2, the DEXseq has a single workflow function which will normalize to library sizes, estimate dispersion and perform differential exon usage in one go. The returned object is our DEXSeqResults results object.
The DEXSeqResults results object contains significantly more information than the DESeq2 results objects we have seen so far.
##
## LRT p-value: full vs reduced
##
## DataFrame with 1 row and 13 columns
## groupID featureID exonBaseMean dispersion stat pvalue
## <character> <character> <numeric> <numeric> <numeric> <numeric>
## 57142:E008 57142 E008 21482.1 0.00455953 2036.23 0
## padj senescence shPTBP1_53 log2fold_shPTBP1_53_senescence
## <numeric> <numeric> <numeric> <numeric>
## 57142:E008 0 30.1855 63.8085 4.12122
## genomicData countData transcripts
## <GRanges> <matrix> <list>
## 57142:E008 chr2:55252222-55254621:- 40618:35656:54755:... uc002ryd.3,uc002rye.3
The DEXSeqResults results object can be converted to a data.frame as with DESeqResults objects
## groupID featureID exonBaseMean dispersion stat pvalue padj
## 57142:E008 57142 E008 21482.14 0.00455953 2036.225 0 0
## senescence shPTBP1_53 log2fold_shPTBP1_53_senescence
## 57142:E008 30.18552 63.80855 4.121217
## genomicData.seqnames genomicData.start genomicData.end
## 57142:E008 chr2 55252222 55254621
## genomicData.width genomicData.strand
## 57142:E008 2400 -
## countData.Sorted_shPTBP1_53_rep1.bam
## 57142:E008 40618
## countData.Sorted_shPTBP1_53_rep2.bam
## 57142:E008 35656
## countData.Sorted_shPTBP1_53_rep3.bam
## 57142:E008 54755
## countData.Sorted_vec_4OHT_rep1.bam
## 57142:E008 1007
## countData.Sorted_vec_4OHT_rep2.bam
## 57142:E008 747
## countData.Sorted_vec_4OHT_rep3.bam transcripts
## 57142:E008 838 uc002ryd....
A very useful feature of the DEXSeqResults results objects is the ability to visualise the differential exons along our gene using the plotDEXSeq() function and providing our DEXSeqResults results object and gene of interest to plot.
We can specify the parameter displayTranscripts to TRUE to see the full gene model.
The differential exon usage for Entrez gene 7168 is more complex.
We can produce an HTML report for selected genes using the DEXSeqHTML function. By default this will create a file called testForDEU.html within a folder DEXSeqReport in the current working directory.
It may be desirable to obtain a pvalue summarizing the differential exon usage to a gene level.
We can obtain an adjusted p-value per gene with the perGeneQValue() with our DEXSeqResults results object.
The result is a vector of adjusted p-values named by their corresponding gene.
## 1 100 1000 10000 10001 100033414 100037417 100048912
## 1 1 1 1 1 1 1 1
## 100049076 10005
## 1 1
We could merge this back to our full results table to provides exon and gene level statistics in one report.
gQFrame <- as.data.frame(geneQ, row.names = names(geneQ))
dxDF <- as.data.frame(dxr1)
dxDF <- merge(gQFrame, dxDF, by.x = 0, by.y = 1, all = TRUE)
dxDF <- dxDF[order(dxDF$geneQ, dxDF$pvalue), ]
dxDF[1:2, ]
## Row.names geneQ featureID exonBaseMean dispersion stat
## 97377 57142 0 E008 21482.144 0.004559530 2036.225
## 117732 7168 0 E009 2805.899 0.004776165 1092.222
## pvalue padj senescence shPTBP1_53
## 97377 0.000000e+00 0.000000e+00 30.18552 63.80855
## 117732 1.619401e-239 1.246016e-234 49.34291 18.36318
## log2fold_shPTBP1_53_senescence genomicData.seqnames genomicData.start
## 97377 4.121217 chr2 55252222
## 117732 -4.264727 chr15 63353397
## genomicData.end genomicData.width genomicData.strand
## 97377 55254621 2400 -
## 117732 63353472 76 +
## countData.Sorted_shPTBP1_53_rep1.bam
## 97377 40618
## 117732 619
## countData.Sorted_shPTBP1_53_rep2.bam
## 97377 35656
## 117732 668
## countData.Sorted_shPTBP1_53_rep3.bam countData.Sorted_vec_4OHT_rep1.bam
## 97377 54755 1007
## 117732 705 4442
## countData.Sorted_vec_4OHT_rep2.bam countData.Sorted_vec_4OHT_rep3.bam
## 97377 747 838
## 117732 4979 5206
## transcripts
## 97377 uc002ryd....
## 117732 uc002alh....