+ - 0:00:00
Notes for current slide
Notes for next slide

Analysis of RNAseq data in R and Bioconductor (part 5)






Bioinformatics Resource Center - Rockefeller University
1 / 38

What we will cover

  • Session 1: Alignment and counting

  • Session 2: Differential gene expression analysis

  • Session 3: Visualizing results through PCA and clustering

  • Session 4: Gene Set Analysis

  • Session 5: Differential Transcript Utilization analysis

2 / 38

The data

In this session we will be reviewing data from Jesús Gil's lab at MRC from a splicing factor knockdown (PTBP1 RNAi) RNAseq experiment. Metadata and raw files can be found on GEO here.

I have aligned all FQ to BAM and counted in genes and exons using Rsubread and summarizeOverlaps().

3 / 38

The data

All Gene and Exon counts can be found as .RData obects in data directory

  • Counts in disjoint exons can be found at - data/senescence_ExonCounts.RData
4 / 38

What we will cover

In our first session we looked at two separate ways we can gain gene expression estimates from raw sequencing data as FASTQ files.

In our second session we imported our expression estimates to identify differences in the expression of genes between two conditions.

In this session we will look at how we can identify differences in the expression of genes between multiple conditions and how we can identify changes in exon usage to infer changes in transcript usage.

5 / 38

Counts from SummarizeOverlaps

As we saw in session 2, we can use a BamFileList to handle multiple BAM files, and yieldSize to control memory usage.

library(Rsamtools)
bamFilesToCount <- c("Sorted_shPTBP1_53_rep1.bam","Sorted_shPTBP1_53_rep2.bam","Sorted_shPTBP1_53_rep3.bam",
"Sorted_vec_4OHT_rep1.bam","Sorted_vec_4OHT_rep2.bam","Sorted_vec_4OHT_rep3.bam" )
myBams <- BamFileList(bamFilesToCount,yieldSize = 10000)
6 / 38

Counts from SummarizeOverlaps

For differential exon usage we must use a non-overlapping, disjoint set of exons to count on.

We can produce a non-overlapping set of exons by using the exonicParts() function. This is equivalent of us having to create GFF file and using the python script wrapped up in DEXseq package (dexseq_prepare_annotation.py).

library(GenomicFeatures)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
nonOverlappingExons <- exonicParts(TxDb.Hsapiens.UCSC.hg19.knownGene)
names(nonOverlappingExons) <- paste(mcols(nonOverlappingExons)$gene_id, mcols(nonOverlappingExons)$exonic_part,
sep = "_")
nonOverlappingExons[1:2, ]
7 / 38

Disjoint Exons

8 / 38

Disjoint Exons

9 / 38

Counts from SummarizeOverlaps

Now we have our disjoint exons to count on we can use the summarizeOverlaps().

library(GenomicAlignments)
senescence_ExonCounts <- summarizeOverlaps(nonOverlappingExons, myBam, ignore.strand = TRUE,
inter.feature = FALSE)
senescence_ExonCounts
## class: RangedSummarizedExperiment
## dim: 269121 6
## metadata(0):
## assays(1): counts
## rownames(269121): 1_1 1_2 ... 9997_3 9997_4
## rowData names(3): gene_id tx_name exonic_part
## colnames(6): Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam ...
## Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam
## colData names(1): condition
10 / 38

Counts from SummarizeOverlaps

Our RangedSummarizedExperiment contains our counts as a matrix accessible by the assay() function.

Each row is named by our Exon ID, geneID + exon number.

assay(senescence_ExonCounts)[1:2, ]
## Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam
## 1_1 400 489
## 1_2 573 970
## Sorted_shPTBP1_53_rep3.bam Sorted_vec_4OHT_rep1.bam
## 1_1 317 326
## 1_2 455 452
## Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam
## 1_1 650 508
## 1_2 801 885
11 / 38

Counts from SummarizeOverlaps

We can retrieve the GRanges we counted by the rowRanges() accessor function.

The GRanges rows contains our disjoint exons used in counting and is in matching order as rows in our count table.

rowRanges(senescence_ExonCounts)[1:5, ]
## GRanges object with 5 ranges and 3 metadata columns:
## seqnames ranges strand | gene_id tx_name
## <Rle> <IRanges> <Rle> | <CharacterList> <CharacterList>
## 1_1 chr19 58858172-58858395 - | 1 uc002qsd.4
## 1_2 chr19 58858719-58859006 - | 1 uc002qsd.4
## 1_3 chr19 58859832-58860494 - | 1 uc002qsf.2
## 1_4 chr19 58860934-58861735 - | 1 uc002qsf.2
## 1_5 chr19 58861736-58862017 - | 1 uc002qsd.4,uc002qsf.2
## exonic_part
## <integer>
## 1_1 1
## 1_2 2
## 1_3 3
## 1_4 4
## 1_5 5
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
12 / 38

Differential transcript usage


13 / 38

Differential transcript usage

Previously we have investigated how we may identify changes in the overall abundance on gene using RNAseq data.

14 / 38

Differential transcript usage

We may also be interested in identifying any evidence of a changes in transcript usage within a gene.

15 / 38

Differential transcript usage

Due to the uncertainty in the assignment of reads to overlapping transcripts from the same genes, a common approach to investigate differential transcript usage is to define non-overlapping disjoint exons and identify changes in their usage.

16 / 38

DEXSeq


17 / 38

DEXSeq

Options for differential transcript usage are available in R including DEXseq and Voom/Limma.

DEXseq can be a little slow when fitting models but it has options for parallelization.

Voom/Limma is significantly faster by comparison, but provides a less informative output.

18 / 38

DEXSeq inputs

The DEXseq package has a similar set up to DESeq2.

We first must define our metadata data.frame of sample groups, with sample rownames matching colnames of our RangedSummarizedExperiment object.

metaData <- data.frame(condition = c("shPTBP1_53", "shPTBP1_53", "shPTBP1_53", "senescence",
"senescence", "senescence"), row.names = colnames(senescence_ExonCounts))
metaData
## condition
## Sorted_shPTBP1_53_rep1.bam shPTBP1_53
## Sorted_shPTBP1_53_rep2.bam shPTBP1_53
## Sorted_shPTBP1_53_rep3.bam shPTBP1_53
## Sorted_vec_4OHT_rep1.bam senescence
## Sorted_vec_4OHT_rep2.bam senescence
## Sorted_vec_4OHT_rep3.bam senescence
19 / 38

DEXseq inputs

Previously we have been using DESeqDataSetFromMatrix() constructor for DESeq2. In DEXSeq we can use the DEXSeqDataSet constructor in a similar manner.

As with the DESeqDataSetFromMatrix() function, we must provide the counts matrix, our metadata dataframe and our design specifying with metadata columns to test.

In addition to this we must provide our exon IDs and gene IDs to the allow for the association of exons to the genes they constitute.

countMatrix <- assay(senescence_ExonCounts)
countGRanges <- rowRanges(senescence_ExonCounts)
geneIDs <- as.vector(unlist(countGRanges$gene_id))
exonIDs <- rownames(countMatrix)
20 / 38

DEXseq inputs

We must specify our design to include a sample and exon effect to distinguish change in exon expression from a change in exon usage.

Previously we simply added a design formula with our group of interest.

~ GROUPOFINTEREST

Now we specify.

~ sample + exon + GROUPOFINTEREST:exon
21 / 38

Running DEXseq

Now we can use the DEXSeqDataSet constructor as we have DESeqDataSetFromMatrix() but in addition we specify the exon ids to the featureID argument and the gene ids to the groupID argument.

library(DEXSeq)
ddx <- DEXSeqDataSet(countMatrix, metaData, design = ~sample + exon + condition:exon,
featureID = exonIDs, groupID = geneIDs)
ddx
## class: DEXSeqDataSet
## dim: 269121 12
## metadata(1): version
## assays(1): counts
## rownames(269121): 1:1_1 1:1_2 ... 9997:9997_3 9997:9997_4
## rowData names(4): featureID groupID exonBaseMean exonBaseVar
## colnames: NULL
## colData names(3): sample condition exon
22 / 38

Other DEXseq inputs

As we have generated a RangedSummarizedExperiment for exon counting, we can use the more straightforward DEXSeqDataSetFromSE function as we used the DESeqDataSet function.

We can simply provide our RangedSummarizedExperiment and design to the DEXSeqDataSetFromSE function.

colData(senescence_ExonCounts)$condition <- factor(c("shPTBP1_53", "shPTBP1_53",
"shPTBP1_53", "senescence", "senescence", "senescence"))
ddx <- DEXSeqDataSetFromSE(senescence_ExonCounts, design = ~sample + exon + condition:exon)
ddx
## class: DEXSeqDataSet
## dim: 269121 12
## metadata(1): version
## assays(1): counts
## rownames(269121): 1:E001 1:E002 ... 9997:E003 9997:E004
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
## transcripts
## colnames: NULL
## colData names(3): sample condition exon
23 / 38

DEXseq (filtering low exon counts)

DEXseq is fairly computationally expensive so we can reduce the time taken by removing exons with little expression up front.

Here we can filter any exons which do not have 10 reads in at least two samples (our smallest group size is 3).

ToFilter <- apply(assay(senescence_ExonCounts), 1, function(x) sum(x > 10)) >= 2
senescence_ExonCounts <- senescence_ExonCounts[ToFilter, ]
table(ToFilter)
## ToFilter
## FALSE TRUE
## 108125 160996
ddx <- DEXSeqDataSetFromSE(senescence_ExonCounts, design = ~sample + exon + condition:exon)
ddx
## class: DEXSeqDataSet
## dim: 160996 12
## metadata(1): version
## assays(1): counts
## rownames(160996): 1:E001 1:E002 ... 9997:E002 9997:E003
## rowData names(5): featureID groupID exonBaseMean exonBaseVar
## transcripts
## colnames: NULL
## colData names(3): sample condition exon
24 / 38

Running DEXseq in steps

We can normalize, estimate dispersion and test for differential exon usage with the estimateSizeFactors(), estimateDispersions() and testForDEU() functions.

Finally we can estimate fold changes for exons using the estimateExonFoldChanges() function and specifying metadata column of interest to fitExpToVar parameter.

ddx <- estimateSizeFactors(ddx)
ddx <- estimateDispersions(ddx)
ddx <- testForDEU(ddx, reducedModel = ~sample + exon)
ddx = estimateExonFoldChanges(ddx, fitExpToVar = "condition")
25 / 38

DEXseq results

Once have processed our data through the DEXseq function we can use the DEXSeqResults() function with our DEXseq object. We order table to arrange by pvalue.

dxr1 <- DEXSeqResults(ddx)
dxr1 <- dxr1[order(dxr1$pvalue), ]
26 / 38

DEXseq function

As with DESeq2, the DEXseq has a single workflow function which will normalize to library sizes, estimate dispersion and perform differential exon usage in one go. The returned object is our DEXSeqResults results object.

dxr1 <- DEXSeq(ddx)
dxr1 <- dxr1[order(dxr1$pvalue), ]
27 / 38

DEXseq results

The DEXSeqResults results object contains significantly more information than the DESeq2 results objects we have seen so far.

dxr1[1, ]
##
## LRT p-value: full vs reduced
##
## DataFrame with 1 row and 13 columns
## groupID featureID exonBaseMean dispersion stat pvalue
## <character> <character> <numeric> <numeric> <numeric> <numeric>
## 57142:E008 57142 E008 21482.1 0.00455953 2036.23 0
## padj senescence shPTBP1_53 log2fold_shPTBP1_53_senescence
## <numeric> <numeric> <numeric> <numeric>
## 57142:E008 0 30.1855 63.8085 4.12122
## genomicData countData transcripts
## <GRanges> <matrix> <list>
## 57142:E008 chr2:55252222-55254621:- 40618:35656:54755:... uc002ryd.3,uc002rye.3
28 / 38

DEXseq results

The DEXSeqResults results object can be converted to a data.frame as with DESeqResults objects

as.data.frame(dxr1)[1, ]
## groupID featureID exonBaseMean dispersion stat pvalue padj
## 57142:E008 57142 E008 21482.14 0.00455953 2036.225 0 0
## senescence shPTBP1_53 log2fold_shPTBP1_53_senescence
## 57142:E008 30.18552 63.80855 4.121217
## genomicData.seqnames genomicData.start genomicData.end
## 57142:E008 chr2 55252222 55254621
## genomicData.width genomicData.strand
## 57142:E008 2400 -
## countData.Sorted_shPTBP1_53_rep1.bam
## 57142:E008 40618
## countData.Sorted_shPTBP1_53_rep2.bam
## 57142:E008 35656
## countData.Sorted_shPTBP1_53_rep3.bam
## 57142:E008 54755
## countData.Sorted_vec_4OHT_rep1.bam
## 57142:E008 1007
## countData.Sorted_vec_4OHT_rep2.bam
## 57142:E008 747
## countData.Sorted_vec_4OHT_rep3.bam transcripts
## 57142:E008 838 uc002ryd....
29 / 38

Visualizing DEXseq results

A very useful feature of the DEXSeqResults results objects is the ability to visualise the differential exons along our gene using the plotDEXSeq() function and providing our DEXSeqResults results object and gene of interest to plot.

plotDEXSeq(dxr1, "57142")

30 / 38

Visualizing DEXseq results

We can specify the parameter displayTranscripts to TRUE to see the full gene model.

plotDEXSeq(dxr1, "57142", displayTranscripts = TRUE)

31 / 38

Visualizing DEXseq results

32 / 38

Visualizing DEXseq results

The differential exon usage for Entrez gene 7168 is more complex.

plotDEXSeq(dxr1, "7168", displayTranscripts = TRUE)

33 / 38

Visualizing DEXseq results

34 / 38

Visualizing DEXseq results

We can produce an HTML report for selected genes using the DEXSeqHTML function. By default this will create a file called testForDEU.html within a folder DEXSeqReport in the current working directory.

DEXSeqHTML(dxr1, "57142")

Link to result here

35 / 38

Gene Q values

It may be desirable to obtain a pvalue summarizing the differential exon usage to a gene level.

We can obtain an adjusted p-value per gene with the perGeneQValue() with our DEXSeqResults results object.

The result is a vector of adjusted p-values named by their corresponding gene.

geneQ <- perGeneQValue(dxr1)
geneQ[1:10]
## 1 100 1000 10000 10001 100033414 100037417 100048912
## 1 1 1 1 1 1 1 1
## 100049076 10005
## 1 1
36 / 38

Gene Q values

We could merge this back to our full results table to provides exon and gene level statistics in one report.

gQFrame <- as.data.frame(geneQ, row.names = names(geneQ))
dxDF <- as.data.frame(dxr1)
dxDF <- merge(gQFrame, dxDF, by.x = 0, by.y = 1, all = TRUE)
dxDF <- dxDF[order(dxDF$geneQ, dxDF$pvalue), ]
dxDF[1:2, ]
## Row.names geneQ featureID exonBaseMean dispersion stat
## 97377 57142 0 E008 21482.144 0.004559530 2036.225
## 117732 7168 0 E009 2805.899 0.004776165 1092.222
## pvalue padj senescence shPTBP1_53
## 97377 0.000000e+00 0.000000e+00 30.18552 63.80855
## 117732 1.619401e-239 1.246016e-234 49.34291 18.36318
## log2fold_shPTBP1_53_senescence genomicData.seqnames genomicData.start
## 97377 4.121217 chr2 55252222
## 117732 -4.264727 chr15 63353397
## genomicData.end genomicData.width genomicData.strand
## 97377 55254621 2400 -
## 117732 63353472 76 +
## countData.Sorted_shPTBP1_53_rep1.bam
## 97377 40618
## 117732 619
## countData.Sorted_shPTBP1_53_rep2.bam
## 97377 35656
## 117732 668
## countData.Sorted_shPTBP1_53_rep3.bam countData.Sorted_vec_4OHT_rep1.bam
## 97377 54755 1007
## 117732 705 4442
## countData.Sorted_vec_4OHT_rep2.bam countData.Sorted_vec_4OHT_rep3.bam
## 97377 747 838
## 117732 4979 5206
## transcripts
## 97377 uc002ryd....
## 117732 uc002alh....
37 / 38

Time for an exercise

Link_to_exercises

Link_to_answers

38 / 38

What we will cover

  • Session 1: Alignment and counting

  • Session 2: Differential gene expression analysis

  • Session 3: Visualizing results through PCA and clustering

  • Session 4: Gene Set Analysis

  • Session 5: Differential Transcript Utilization analysis

2 / 38
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow