class: center, middle, inverse, title-slide # Analysis of RNAseq data in R and Bioconductor (part 5)
###
http://rockefelleruniversity.github.io/RU_RNAseq/
--- ## The data In this session we will be reviewing data from Jesús Gil's lab at MRC from a splicing factor knockdown (PTBP1 RNAi) RNAseq experiment. Metadata and raw files can be found on GEO [here](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE101763). I have aligned all FQ to BAM and counted in genes and exons using Rsubread and summarizeOverlaps(). --- ## The data All Gene and Exon counts can be found as .RData obects in **data** directory - Counts in disjoint exons can be found at - **data/senescence_ExonCounts.RData** --- ## What we will cover In our first session we looked at two separate ways we can gain gene expression estimates from raw sequencing data as FASTQ files. In our second session we imported our expression estimates to identify differences in the expression of genes between two conditions. In this session we will look at how we can identify differences in the expression of genes between multiple conditions and how we can identify changes in exon usage to infer changes in transcript usage. --- ## Counts from SummarizeOverlaps As we saw in session 2, we can use a **BamFileList** to handle multiple BAM files, and **yieldSize** to control memory usage. ```r library(Rsamtools) bamFilesToCount <- c("Sorted_shPTBP1_53_rep1.bam","Sorted_shPTBP1_53_rep2.bam","Sorted_shPTBP1_53_rep3.bam", "Sorted_vec_4OHT_rep1.bam","Sorted_vec_4OHT_rep2.bam","Sorted_vec_4OHT_rep3.bam" ) myBams <- BamFileList(bamFilesToCount,yieldSize = 10000) ``` --- ## Counts from SummarizeOverlaps For differential exon usage we must use a non-overlapping, disjoint set of exons to count on. We can produce a non-overlapping set of exons by using the **disjointExons()** function. This is equivalent of us having to create GFF file and using the python script wrapped up in DEXseq package (*dexseq_prepare_annotation.py*). ```r library(GenomicFeatures) library(TxDb.Hsapiens.UCSC.hg19.knownGene) nonOverlappingExons <- disjointExons(TxDb.Hsapiens.UCSC.hg19.knownGene) ``` ``` ## Warning: disjointExons() is deprecated. Please use exonicParts() instead. ``` ```r names(nonOverlappingExons) <- paste(mcols(nonOverlappingExons)$gene_id, mcols(nonOverlappingExons)$exonic_part, sep = "_") nonOverlappingExons[1:2, ] ``` ``` ## GRanges object with 2 ranges and 3 metadata columns: ## seqnames ranges strand | gene_id tx_name ## <Rle> <IRanges> <Rle> | <CharacterList> <CharacterList> ## 1_1 chr19 58858172-58858395 - | 1 uc002qsd.4 ## 1_2 chr19 58858719-58859006 - | 1 uc002qsd.4 ## exonic_part ## <integer> ## 1_1 1 ## 1_2 2 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- ## Disjoint Exons  --- ## Disjoint Exons  --- ## Counts from SummarizeOverlaps Now we have our disjoint exons to count on we can use the **summarizeOverlaps()**. ```r library(GenomicAlignments) senescence_ExonCounts <- summarizeOverlaps(nonOverlappingExons, myBam, ignore.strand = TRUE, inter.feature = FALSE) senescence_ExonCounts ``` ``` ## class: RangedSummarizedExperiment ## dim: 269121 6 ## metadata(0): ## assays(1): counts ## rownames(269121): 1_1 1_2 ... 9997_3 9997_4 ## rowData names(3): gene_id tx_name exonic_part ## colnames(6): Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam ... ## Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam ## colData names(1): condition ``` --- ## Counts from SummarizeOverlaps Our **RangedSummarizedExperiment** contains our counts as a matrix accessible by the **assay()** function. Each row is named by our Exon ID, geneID + exon number. ```r assay(senescence_ExonCounts)[1:2, ] ``` ``` ## Sorted_shPTBP1_53_rep1.bam Sorted_shPTBP1_53_rep2.bam ## 1_1 400 489 ## 1_2 573 970 ## Sorted_shPTBP1_53_rep3.bam Sorted_vec_4OHT_rep1.bam ## 1_1 317 326 ## 1_2 455 452 ## Sorted_vec_4OHT_rep2.bam Sorted_vec_4OHT_rep3.bam ## 1_1 650 508 ## 1_2 801 885 ``` --- ## Counts from SummarizeOverlaps We can retrieve the GRanges we counted by the **rowRanges()** accessor function. The GRanges rows contains our disjoint exons used in counting and is in matching order as rows in our count table. ```r rowRanges(senescence_ExonCounts)[1:5, ] ``` ``` ## GRanges object with 5 ranges and 3 metadata columns: ## seqnames ranges strand | gene_id tx_name ## <Rle> <IRanges> <Rle> | <CharacterList> <CharacterList> ## 1_1 chr19 58858172-58858395 - | 1 uc002qsd.4 ## 1_2 chr19 58858719-58859006 - | 1 uc002qsd.4 ## 1_3 chr19 58859832-58860494 - | 1 uc002qsf.2 ## 1_4 chr19 58860934-58861735 - | 1 uc002qsf.2 ## 1_5 chr19 58861736-58862017 - | 1 uc002qsd.4,uc002qsf.2 ## exonic_part ## <integer> ## 1_1 1 ## 1_2 2 ## 1_3 3 ## 1_4 4 ## 1_5 5 ## ------- ## seqinfo: 93 sequences (1 circular) from hg19 genome ``` --- class: inverse, center, middle # Differential transcript usage <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Differential transcript usage Previously we have investigated how we may identify changes in the overall abundance on gene using RNAseq data.  --- ## Differential transcript usage We may also be interested in identifying any evidence of a changes in transcript usage within a gene.  --- ## Differential transcript usage Due to the uncertainty in the assignment of reads to overlapping transcripts from the same genes, a common approach to investigate differential transcript usage is to define non-overlapping disjoint exons and identify changes in their usage.  --- class: inverse, center, middle # DEXSeq <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## DEXSeq Options for differential transcript usage are available in R including DEXseq and Voom/Limma. DEXseq can be a little slow when fitting models but it has options for parallelization. Voom/Limma is significantly faster by comparison, but provides a less informative output. --- ## DEXSeq inputs The [**DEXseq** package](https://bioconductor.org/packages/release/bioc/html/DEXSeq.html) has a similar set up to DESeq2. We first must define our metadata data.frame of sample groups, with sample rownames matching colnames of our **RangedSummarizedExperiment** object. ```r metaData <- data.frame(condition = c("shPTBP1_53", "shPTBP1_53", "shPTBP1_53", "senescence", "senescence", "senescence"), row.names = colnames(senescence_ExonCounts)) metaData ``` ``` ## condition ## Sorted_shPTBP1_53_rep1.bam shPTBP1_53 ## Sorted_shPTBP1_53_rep2.bam shPTBP1_53 ## Sorted_shPTBP1_53_rep3.bam shPTBP1_53 ## Sorted_vec_4OHT_rep1.bam senescence ## Sorted_vec_4OHT_rep2.bam senescence ## Sorted_vec_4OHT_rep3.bam senescence ``` --- ## DEXseq inputs Previously we have been using **DESeqDataSetFromMatrix()** constructor for DESeq2. In DEXSeq we can use the **DEXSeqDataSet** constructor in a similar manner. As with the **DESeqDataSetFromMatrix()** function, we must provide the counts matrix, our metadata dataframe and our design specifying with metadata columns to test. In addition to this we must provide our exon IDs and gene IDs to the allow for the association of exons to the genes they constitute. ```r countMatrix <- assay(senescence_ExonCounts) countGRanges <- rowRanges(senescence_ExonCounts) geneIDs <- as.vector(unlist(countGRanges$gene_id)) exonIDs <- rownames(countMatrix) ``` --- ## DEXseq inputs We must specify our design to include a sample and exon effect to distinguish change in exon expression from a change in exon usage. Previously we simply added a design formula with our group of interest. ``` ~ GROUPOFINTEREST ``` Now we specify. ``` ~ sample + exon + GROUPOFINTEREST:exon ``` --- ## Running DEXseq Now we can use the **DEXSeqDataSet** constructor as we have **DESeqDataSetFromMatrix()** but in addition we specify the exon ids to the **featureID** argument and the gene ids to the **groupID** argument. ```r library(DEXSeq) ddx <- DEXSeqDataSet(countMatrix, metaData, design = ~sample + exon + condition:exon, featureID = exonIDs, groupID = geneIDs) ddx ``` ``` ## class: DEXSeqDataSet ## dim: 269121 12 ## metadata(1): version ## assays(1): counts ## rownames(269121): 1:1_1 1:1_2 ... 9997:9997_3 9997:9997_4 ## rowData names(4): featureID groupID exonBaseMean exonBaseVar ## colnames: NULL ## colData names(3): sample condition exon ``` --- ## Other DEXseq inputs As we have generated a **RangedSummarizedExperiment** for exon counting, we can use the more straightforward **DEXSeqDataSetFromSE** function as we used the **DESeqDataSet** function. We can simply provide our **RangedSummarizedExperiment** and design to the **DEXSeqDataSetFromSE** function. ```r colData(senescence_ExonCounts)$condition <- factor(c("shPTBP1_53", "shPTBP1_53", "shPTBP1_53", "senescence", "senescence", "senescence")) ddx <- DEXSeqDataSetFromSE(senescence_ExonCounts, design = ~sample + exon + condition:exon) ddx ``` ``` ## class: DEXSeqDataSet ## dim: 269121 12 ## metadata(1): version ## assays(1): counts ## rownames(269121): 1:E001 1:E002 ... 9997:E003 9997:E004 ## rowData names(5): featureID groupID exonBaseMean exonBaseVar ## transcripts ## colnames: NULL ## colData names(3): sample condition exon ``` --- ## DEXseq (filtering low exon counts) DEXseq is fairly computationally expensive so we can reduce the time taken by removing exons with little expression up front. Here we can filter any exons which do not have 10 reads in at least two samples (our smallest group size is 3). ```r ToFilter <- apply(assay(senescence_ExonCounts), 1, function(x) sum(x > 10)) >= 2 senescence_ExonCounts <- senescence_ExonCounts[ToFilter, ] table(ToFilter) ``` ``` ## ToFilter ## FALSE TRUE ## 108125 160996 ``` ```r ddx <- DEXSeqDataSetFromSE(senescence_ExonCounts, design = ~sample + exon + condition:exon) ddx ``` ``` ## class: DEXSeqDataSet ## dim: 160996 12 ## metadata(1): version ## assays(1): counts ## rownames(160996): 1:E001 1:E002 ... 9997:E002 9997:E003 ## rowData names(5): featureID groupID exonBaseMean exonBaseVar ## transcripts ## colnames: NULL ## colData names(3): sample condition exon ``` --- ## Running DEXseq in steps We can normalize, estimate dispersion and test for differential exon usage with the **estimateSizeFactors()**, **estimateDispersions()** and **testForDEU()** functions. Finally we can estimate fold changes for exons using the **estimateExonFoldChanges()** function and specifying metadata column of interest to **fitExpToVar** parameter. ```r ddx <- estimateSizeFactors(ddx) ddx <- estimateDispersions(ddx) ddx <- testForDEU(ddx, reducedModel = ~sample + exon) ddx = estimateExonFoldChanges(ddx, fitExpToVar = "condition") ``` --- ## DEXseq results Once have processed our data through the **DEXseq** function we can use the **DEXSeqResults()** function with our **DEXseq** object. We order table to arrange by pvalue. ```r dxr1 <- DEXSeqResults(ddx) dxr1 <- dxr1[order(dxr1$pvalue), ] ``` --- ## DEXseq function As with DESeq2, the DEXseq has a single workflow function which will normalize to library sizes, estimate dispersion and perform differential exon usage in one go. The returned object is our **DEXSeqResults** results object. ```r dxr1 <- DEXSeq(ddx) dxr1 <- dxr1[order(dxr1$pvalue), ] ``` --- ## DEXseq results The **DEXSeqResults** results object contains significantly more information than the DESeq2 results objects we have seen so far. ```r dxr1[1, ] ``` ``` ## ## LRT p-value: full vs reduced ## ## DataFrame with 1 row and 13 columns ## groupID featureID exonBaseMean dispersion stat pvalue ## <character> <character> <numeric> <numeric> <numeric> <numeric> ## 57142:E008 57142 E008 21482.1 0.00455953 2036.23 0 ## padj senescence shPTBP1_53 log2fold_shPTBP1_53_senescence ## <numeric> <numeric> <numeric> <numeric> ## 57142:E008 0 30.1855 63.8085 4.12122 ## genomicData countData transcripts ## <GRanges> <matrix> <list> ## 57142:E008 chr2:55252222-55254621:- 40618:35656:54755:... uc002ryd.3,uc002rye.3 ``` --- ## DEXseq results The **DEXSeqResults** results object can be converted to a data.frame as with **DESeqResults** objects ```r as.data.frame(dxr1)[1, ] ``` ``` ## groupID featureID exonBaseMean dispersion stat pvalue padj ## 57142:E008 57142 E008 21482.14 0.00455953 2036.225 0 0 ## senescence shPTBP1_53 log2fold_shPTBP1_53_senescence ## 57142:E008 30.18552 63.80855 4.121217 ## genomicData.seqnames genomicData.start genomicData.end ## 57142:E008 chr2 55252222 55254621 ## genomicData.width genomicData.strand ## 57142:E008 2400 - ## countData.Sorted_shPTBP1_53_rep1.bam ## 57142:E008 40618 ## countData.Sorted_shPTBP1_53_rep2.bam ## 57142:E008 35656 ## countData.Sorted_shPTBP1_53_rep3.bam ## 57142:E008 54755 ## countData.Sorted_vec_4OHT_rep1.bam ## 57142:E008 1007 ## countData.Sorted_vec_4OHT_rep2.bam ## 57142:E008 747 ## countData.Sorted_vec_4OHT_rep3.bam transcripts ## 57142:E008 838 uc002ryd.... ``` --- ## Visualizing DEXseq results A very useful feature of the **DEXSeqResults** results objects is the ability to visualise the differential exons along our gene using the **plotDEXSeq()** function and providing our **DEXSeqResults** results object and gene of interest to plot. ```r plotDEXSeq(dxr1, "57142") ``` <!-- --> --- ## Visualizing DEXseq results We can specify the parameter **displayTranscripts** to TRUE to see the full gene model. ```r plotDEXSeq(dxr1, "57142", displayTranscripts = TRUE) ``` <!-- --> --- ## Visualizing DEXseq results  --- ## Visualizing DEXseq results The differential exon usage for Entrez gene 7168 is more complex. ```r plotDEXSeq(dxr1, "7168", displayTranscripts = TRUE) ``` <!-- --> --- ## Visualizing DEXseq results  --- ## Visualizing DEXseq results We can produce an HTML report for selected genes using the **DEXSeqHTML** function. By default this will create a file called *testForDEU.html* within a folder *DEXSeqReport* in the current working directory. ```r DEXSeqHTML(dxr1, "57142") ``` [Link to result here]("../../data/DEXSeqReport/testForDEU.html") --- ## Gene Q values It may be desirable to obtain a pvalue summarizing the differential exon usage to a gene level. We can obtain an adjusted p-value per gene with the **perGeneQValue()** with our **DEXSeqResults** results object. The result is a vector of adjusted p-values named by their corresponding gene. ```r geneQ <- perGeneQValue(dxr1) geneQ[1:10] ``` ``` ## 1 100 1000 10000 10001 100033414 100037417 100048912 ## 1 1 1 1 1 1 1 1 ## 100049076 10005 ## 1 1 ``` --- ## Gene Q values We could merge this back to our full results table to provides exon and gene level statistics in one report. ```r gQFrame <- as.data.frame(geneQ, row.names = names(geneQ)) dxDF <- as.data.frame(dxr1) dxDF <- merge(gQFrame, dxDF, by.x = 0, by.y = 1, all = TRUE) dxDF <- dxDF[order(dxDF$geneQ, dxDF$pvalue), ] dxDF[1:2, ] ``` ``` ## Row.names geneQ featureID exonBaseMean dispersion stat ## 97377 57142 0 E008 21482.144 0.004559530 2036.225 ## 117732 7168 0 E009 2805.899 0.004776165 1092.222 ## pvalue padj senescence shPTBP1_53 ## 97377 0.000000e+00 0.000000e+00 30.18552 63.80855 ## 117732 1.619401e-239 1.246016e-234 49.34291 18.36318 ## log2fold_shPTBP1_53_senescence genomicData.seqnames genomicData.start ## 97377 4.121217 chr2 55252222 ## 117732 -4.264727 chr15 63353397 ## genomicData.end genomicData.width genomicData.strand ## 97377 55254621 2400 - ## 117732 63353472 76 + ## countData.Sorted_shPTBP1_53_rep1.bam ## 97377 40618 ## 117732 619 ## countData.Sorted_shPTBP1_53_rep2.bam ## 97377 35656 ## 117732 668 ## countData.Sorted_shPTBP1_53_rep3.bam countData.Sorted_vec_4OHT_rep1.bam ## 97377 54755 1007 ## 117732 705 4442 ## countData.Sorted_vec_4OHT_rep2.bam countData.Sorted_vec_4OHT_rep3.bam ## 97377 747 838 ## 117732 4979 5206 ## transcripts ## 97377 uc002ryd.... ## 117732 uc002alh.... ``` --- # Time for an exercise [Link_to_exercises](../../exercises/exercises/RNAseq_part5_exercise.html) [Link_to_answers](../../exercises/answers/RNAseq_part5_answers.html)