class: center, middle, inverse, title-slide # Summarizing Scores In Bioconductor ### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/Bioconductor_Introduction/
--- class: inverse, center, middle # Summarizing Scores <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Raw Sequence data Sequencing data from Illumina sequences machines are often stored in FASTQ format. <div align="center"> <img src="imgs/fq1.png" alt="igv" height="200" width="600"> </div> --- # Aligned sequence data Sequence reads aligned to a genome are most often stored in BAM format. <div align="center"> <img src="imgs/sam.png" alt="igv" height="400" width="600"> </div> --- # Summarizing Alignments A common analysis step in the interpretation of high-throughput sequencing experiments is to quantify the abundance of sequence in/over genomic regions. <div align="center"> <img src="imgs/sum.png" alt="igv" height="400" width="600"> </div> --- # Summarizing for RNAseq * **In RNAseq** experiments we assess the abundance of transcripts/genes under differing conditions by assessing the number of reads aligning with the transcript's/gene's exons. ![](imgs/RNAseq.png) --- # Summarizing for ChIPseq * **In ChIPseq** experiments we compare the number of DNA fragments mapping to a genomic locus after an IP for a specific protein or DNA/histone state versus the number of DNA fragments mapping from a background control/input sample. ![](imgs/chipseq.png) --- # Summarizing Genomic Alignments We can provide a summarization of aligned data per base pair using **wig, bedgraph and bigWig** files. ![](imgs/scoreBP.png) --- # Summarizing alignments in/over regions We will often summarize our aligned data in BAM format over genomic regions of interest. ![](imgs/countss.png) # Summarizing alignments in/over regions The genomic regions may be: * Simple genomic regions such as peaks or TSS regions are often stored as **BED** files. In Bioconductor these are represented as **GRanges** objects. * More complex, multi-part regions such as gene models are often stored as **GTF** or **GFF** files. In Bioconductor, gene models are handled as **TxDb** objects. --- # Summarizing with Bioconductor We will use two main functions to allows us to summarize our aligned data over the genome and within regions. These are the **coverage()** and **summarizeOverlaps()** functions from **GenomicsAlignments** package. * **coverage()** - To create per base pair signal/alignments scores across genome i.e. bigWigs. * **summarizeOverlaps()** - To create counts of signal/alignments within genomic regions. --- # The Data In this session we will work with aligned data as BAM files. I have provided BAM files we saw in our last session as our test data from today. This can be found in **data/liver.bodyMap.bam** and **data/heart.bodyMap.bam**. * **data/liver.bodyMap.bam** * **data/heart.bodyMap.bam** --- # GenomicAlignments packages To make use of a GenomicsAlignments package we must first load the library. ```r library(GenomicAlignments) ``` --- # Coverage The **coverage()** function can accept a Bam file path and will return an [Rlelist](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/singlepage/GenomicScores_In_Bioconductor.html#rle_in_genomics) of the number of reads at every base pair. We should always work with sorted and indexed BAM files so we will perform these steps first. ```r sortedHeart <- sortBam("data/heart.bodyMap.bam","Heart") indexBam(sortedHeart) ``` ``` ## Heart.bam ## "Heart.bam.bai" ``` --- # Coverage Now we simply use the **coverage()** function with the path to our sorted file. ```r heartCoverage <- coverage("Heart.bam") class(heartCoverage) ``` ``` ## [1] "SimpleRleList" ## attr(,"package") ## [1] "IRanges" ``` --- # Coverage as RLElist As we have seen, [Rlelist](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/singlepage/GenomicScores_In_Bioconductor.html#rle_in_genomics) objects contain compressed vectors carrying the scores, here number of alignments, for every base pair in the genome. Here we see scores only on chromosome 12 around the gene we have been reviewing, SLC25A3. ```r heartCoverage ``` ``` ## RleList of length 1 ## $chr12 ## integer-Rle of length 133275309 with 4553 runs ## Lengths: 33666 50 3194555 35 ... 7200240 38 2762799 ## Values : 0 2 0 3 ... 0 1 0 ``` --- # Coverage as RLElist We have seen we can [index, subset, manipulate and plot our RleList and Rle vector](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/singlepage/GenomicScores_In_Bioconductor.html#indexing_an_rle) in a similar manner to standard lists and vectors. ```r chr12Cov <- heartCoverage[["chr12"]] signalDepth <- chr12Cov[98591400:98608400] signalDepthScaled <- data.frame(Position=98591400:98608400, Signal=signalDepth*1000) library(ggplot2) ggplot(signalDepthScaled,aes(x=Position,y=Signal))+ geom_line()+theme_minimal() ``` ![](Summarising_Scores_In_Bioconductor_files/figure-html/a3-1.png)<!-- --> --- # Coverage function The **coverage()** function will also accept **GAlignments** objects. ```r heartAln <- readGAlignments("Heart.bam") heartCov1 <- coverage(heartAln) ``` ![](Summarising_Scores_In_Bioconductor_files/figure-html/frfga3-1.png)<!-- --> --- # Coverage function As well as **GRanges** and **GRangesList** objects too. For RNAseq we should generate coverage from a [**GRangesList** object to account for reads spannning exons](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignedDataInBioconductor.html#36). ```r heartGR <- granges(heartAln) heartCov2 <- coverage(heartGR) heartGRL <- grglist(heartAln) heartCov3 <- coverage(heartGRL) ``` ![](Summarising_Scores_In_Bioconductor_files/figure-html/frfa3-1.png)<!-- --> --- # Coverage function This means we can perform some operations on our sequence reads in R prior to creating our bigWig. Here we simply select only reads aligning to the positive strand prior to creating our signal scores.. ```r heartAlnPos <- heartAln[strand(heartAln) == "+"] heartAlnPos <- coverage(heartAlnPos) heartAlnPos["chr12"] export.bw(heartAlnPos,con="heartPos.bw") ``` ``` ## RleList of length 1 ## $chr12 ## integer-Rle of length 133275309 with 3135 runs ## Lengths: 33666 50 3194555 35 ... 1530798 40 9963077 ## Values : 0 2 0 3 ... 0 1 0 ``` --- # Coverage function The coverage function allows for the specification of a additional parameter, **weight**. The **weight** parameter applies a scaling factor to the calculated scores and so can be useful to normalize your signal e.g. to total mapped reads. ```r heartCoverageX10 <- coverage("Heart.bam", weight = 10) heartCoverageX10["chr12"] ## RleList of length 1 ## $chr12 ## numeric-Rle of length 133275309 with 4553 runs ## Lengths: 33666 50 3194555 35 ... 7200240 38 2762799 ## Values : 0 20 0 30 ... 0 10 0 heartCoverage["chr12"] ## RleList of length 1 ## $chr12 ## integer-Rle of length 133275309 with 4553 runs ## Lengths: 33666 50 3194555 35 ... 7200240 38 2762799 ## Values : 0 2 0 3 ... 0 1 0 ``` --- # Coverage function In combination with **idxstatsBam()** function, we could use this to normalize your signal to total mapped reads. ```r allChromosomeStats <- idxstatsBam("Heart.bam") allChromosomeStats ``` ``` ## seqnames seqlength mapped unmapped ## 1 chr12 133275309 132381 0 ## 2 * 0 0 223 ``` --- # Coverage function ```r mapped <- sum(allChromosomeStats[,"mapped"]) heartCoverageNorm <- coverage("Heart.bam", weight = (10^6)/mapped) heartCoverageNorm["chr12"] ``` ``` ## RleList of length 1 ## $chr12 ## numeric-Rle of length 133275309 with 4553 runs ## Lengths: 33666 50 3194555 ... 38 2762799 ## Values : 0.00000e+00 1.51079e+01 0.00000e+00 ... 7.55395e+00 -1.10187e-11 ``` **REMEMBER: You can always export the RLE to a BigWig with rtracklayer, so you can then view it in IGV.** --- # Summarizing from scores We can use [RleLists to summarize scores over regions by subsetting with GRanges and using standard arithmetic operations.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/singlepage/GenomicScores_In_Bioconductor.html#rlelists_and_granges) Here we get the mean signal (read depth per base pair) over exons for slc25A3 gene, **Entrez ID 5250**. ```r library(TxDb.Hsapiens.UCSC.hg38.knownGene) exonsOfGenes <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene, by="gene") slc25A3 <- exonsOfGenes[["5250"]] slc25A3 ``` ``` ## GRanges object with 39 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr12 98593591-98593740 + | 381304 <NA> ## [2] chr12 98593625-98593740 + | 381305 <NA> ## [3] chr12 98593625-98594135 + | 381306 <NA> ## [4] chr12 98593650-98593740 + | 381307 <NA> ## [5] chr12 98593655-98593740 + | 381308 <NA> ## ... ... ... ... . ... ... ## [35] chr12 98601368-98601708 + | 381339 <NA> ## [36] chr12 98601368-98601997 + | 381340 <NA> ## [37] chr12 98601368-98602168 + | 381341 <NA> ## [38] chr12 98601368-98606367 + | 381342 <NA> ## [39] chr12 98601368-98606379 + | 381343 <NA> ## ------- ## seqinfo: 595 sequences (1 circular) from hg38 genome ``` --- # Summarizing from scores ```r heartCoverageNorm <- coverage("Heart.bam") myMeanCovOverExons <- mean(heartCoverageNorm[slc25A3]) myMeanCovOverExons ``` ``` ## chr12 chr12 chr12 chr12 chr12 chr12 chr12 chr12 ## 197.3933 255.2500 633.2740 323.8462 341.8605 345.5647 510.2909 715.6889 ## chr12 chr12 chr12 chr12 chr12 chr12 chr12 chr12 ## 601.2222 554.0204 756.4282 1757.2919 473.0144 1770.5732 539.0552 4232.1680 ## chr12 chr12 chr12 chr12 chr12 chr12 chr12 chr12 ## 2317.5153 204.4162 299.2623 2600.1274 6779.2570 6785.8833 6749.0552 5607.1697 ## chr12 chr12 chr12 chr12 chr12 chr12 chr12 chr12 ## 2104.1342 6218.8696 6158.3462 4092.3000 6375.2486 3771.8433 7370.8919 7180.3787 ## chr12 chr12 chr12 chr12 chr12 chr12 chr12 ## 7139.7459 6667.7147 6648.3255 3626.9619 2858.4856 478.1346 476.9898 ``` --- class: inverse, center, middle # Summarizing counts in regions from alignments <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Counting reads in/over regions To count reads within regions and across gene models we can take advantage of specialist counting tools available within the **summarizeOverlaps()** functions. This is important as it will allow us to: * Handle reads overlapping multiple regions (features or metafeatures). * Assign reads to meta-features (genes instead of exons). --- # Reads over/across exons ![](imgs/exonsAndGenes.png) <!-- --- --> <!-- # summarizeOverlaps - Modes of counting --> <!-- We can control counting behaviour by setting **mode**. Typically we will use the conservative mode **union**. --> <!-- <div align="center"> --> <!-- <img src="imgs/count_modes.png" alt="igv" height="400" width="400"> --> <!-- </div> --> --- # summarizeOverlaps - Genes not exons Often we will want to summarize to genes, not individual exons. <div align="center"> <img src="imgs/withinGenes.png" alt="igv" height="400" width="600"> </div> --- # A read can only assigned to one feature! When using **summarizeOverlaps()** function, a read can only be assigned to one feature. This is important to note when counting features such as overlapping exons. We will learn how to deal with counting across overlapping features in later sessions such as in RNAseq. --- # summarizeOverlaps The **summarizeOverlaps()** function is well intergrated with many of the Bioconductor objects we have seen. As with the **coverage()** function it can count reads from **Bam** files or **GRanges** objects. The **summarizeOverlaps()** function can count over **GRanges (features e.g. exons)** and **GRangesLists (meta-features genes)**. --- # summarizeOverlaps for regions To count non-overlapping regions, we can simply specify the path to our BAM file and a GRanges of regions to count over. The returned object is a **RangedSummarizedExperiment** object. ```r geneBody <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene) ``` ``` ## 228 genes were dropped because they have exons located on both strands ## of the same reference sequence or on more than one reference sequence, ## so cannot be represented by a single genomic range. ## Use 'single.strand.genes.only=FALSE' to get all the genes in a ## GRangesList object, or use suppressMessages() to suppress this message. ``` ```r TSS <- promoters(geneBody,500,500) myTSScounts <- summarizeOverlaps(TSS,"Heart.bam") class(myTSScounts) ``` ``` ## [1] "RangedSummarizedExperiment" ## attr(,"package") ## [1] "SummarizedExperiment" ``` --- # SummarizedExperiment object The resulting **SummarizedExperiment** object contains a matrix of our counts over regions and the original GRanges object we used for counting. As with other Bioconductor objects, the object shows some useful summary information ```r myTSScounts ``` ``` ## class: RangedSummarizedExperiment ## dim: 27113 1 ## metadata(0): ## assays(1): counts ## rownames(27113): 1 10 ... 9994 9997 ## rowData names(1): gene_id ## colnames(1): Heart.bam ## colData names(0): ``` --- # SummarizedExperiment object We can access the count matrix using the accessor function **assay()** with our **SummarizedExperiment** object. Here we see total counts for SLC25A3 genes' TSS region ```r countMatrix <- assay(myTSScounts) countMatrix["5250",] ``` ``` ## Heart.bam ## 5250 2934 ``` --- # SummarizedExperiment object The **rowRanges()** function will access the original **GRanges** we used for counting. ```r Granges <- rowRanges(myTSScounts) Granges ``` ``` ## GRanges object with 27113 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## 1 chr19 58362252-58363251 - | 1 ## 10 chr8 18390782-18391781 + | 10 ## 100 chr20 44651734-44652733 - | 100 ## 1000 chr18 28177447-28178446 - | 1000 ## 10000 chr1 243850580-243851579 - | 10000 ## ... ... ... ... . ... ## 9991 chr9 112333165-112334164 - | 9991 ## 9992 chr21 34363506-34364505 + | 9992 ## 9993 chr22 19121955-19122954 - | 9993 ## 9994 chr6 89829394-89830393 + | 9994 ## 9997 chr22 50525962-50526961 - | 9997 ## ------- ## seqinfo: 595 sequences (1 circular) from hg38 genome ``` --- # summarizeOverlaps for gene models For gene models, we may want to assign reads to each gene and so count all reads overlapping any exons within a gene. We can count reads in meta-features such as genes using **GRangesLists**. First we extract a GRangesList of exons for every gene. ```r geneExons <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene,by="gene") geneExons["5250"] ``` ``` ## GRangesList object of length 1: ## $`5250` ## GRanges object with 39 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr12 98593591-98593740 + | 381304 <NA> ## [2] chr12 98593625-98593740 + | 381305 <NA> ## [3] chr12 98593625-98594135 + | 381306 <NA> ## [4] chr12 98593650-98593740 + | 381307 <NA> ## [5] chr12 98593655-98593740 + | 381308 <NA> ## ... ... ... ... . ... ... ## [35] chr12 98601368-98601708 + | 381339 <NA> ## [36] chr12 98601368-98601997 + | 381340 <NA> ## [37] chr12 98601368-98602168 + | 381341 <NA> ## [38] chr12 98601368-98606367 + | 381342 <NA> ## [39] chr12 98601368-98606379 + | 381343 <NA> ## ------- ## seqinfo: 595 sequences (1 circular) from hg38 genome ``` --- # summarizeOverlaps for gene models Now we can summarize counts for every gene by counting in their respective exons. ```r myGeneCounts <- summarizeOverlaps(geneExons,"Heart.bam") myGeneCounts ``` ``` ## class: RangedSummarizedExperiment ## dim: 27341 1 ## metadata(0): ## assays(1): counts ## rownames(27341): 1 10 ... 9994 9997 ## rowData names(0): ## colnames(1): Heart.bam ## colData names(0): ``` --- # summarizeOverlaps for gene models And now we have a count of signal in exons for every gene. ```r countMatrix <- assay(myGeneCounts) countMatrix["5250",] ``` ``` ## Heart.bam ## 5250 65799 ``` --- # summarizeOverlaps for gene models The **rowRanges()** contains the original GRangesList of exons by genes. ```r grgList <- rowRanges(myGeneCounts) grgList ``` ``` ## GRangesList object of length 27341: ## $`1` ## GRanges object with 29 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr19 58345178-58347029 - | 579646 <NA> ## [2] chr19 58345183-58347029 - | 579647 <NA> ## [3] chr19 58346854-58347029 - | 579648 <NA> ## [4] chr19 58346858-58347029 - | 579649 <NA> ## [5] chr19 58346860-58347029 - | 579650 <NA> ## ... ... ... ... . ... ... ## [25] chr19 58357585-58357649 - | 579673 <NA> ## [26] chr19 58357952-58358286 - | 579674 <NA> ## [27] chr19 58358489-58358585 - | 579675 <NA> ## [28] chr19 58359197-58359323 - | 579676 <NA> ## [29] chr19 58362677-58362751 - | 579677 <NA> ## ------- ## seqinfo: 595 sequences (1 circular) from hg38 genome ## ## ... ## <27340 more elements> ``` --- # SummarizeOverlaps for many files The **summarizeOverlaps()** function can count across multiple BAM files. We can specify a vector of BAM file paths we wish to count from to **summarizeOverlaps()** function. ```r allGeneCounts <- summarizeOverlaps(geneExons, c("Heart.bam","Liver.bam")) ``` --- # SummarizeOverlaps for many files Now we can return the matrix of counts with samples by columns and genes as rows. ```r countMatrix <- assay(allGeneCounts) countMatrix["5250",] ``` ``` ## Heart.bam Liver.bam ## 5250 65799 24296 ``` --- # Low memory summarization Counting from BAM files, as with sorting, can be configured to trade of memory usage for speed i.e. less RAM used, but takes longer to count. We can control the amount of memory used by creating a **BamFile** object prior to counting using the **BamFile()** function ```r myBam <- BamFile("Heart.bam") class(myBam) ``` ``` ## [1] "BamFile" ## attr(,"package") ## [1] "Rsamtools" ``` --- # Low memory summarization We can additionally specify the **yieldSize** parameter which controls how many reads are held in memory in one time. Here we set the **yieldSize** to a very low 1000 reads, this will save a lot of memory but slow down processing. We will often need to test the **yieldSize** which best fits our machine. ```r myBam <- BamFile("Heart.bam", yieldSize = 1000) heartGeneCounts <- summarizeOverlaps(geneExons,myBam) heartGeneCounts ``` ``` ## class: RangedSummarizedExperiment ## dim: 27341 1 ## metadata(0): ## assays(1): counts ## rownames(27341): 1 10 ... 9994 9997 ## rowData names(0): ## colnames(1): Heart.bam ## colData names(0): ``` --- # Low memory summarization When counting multiple files we can use the **BamFileList()** function in a similar manner but this time we provide a vector of the two BAM file paths. ```r myBam <- BamFileList(c("Heart.bam","Liver.bam"), yieldSize = 1000) allGeneCounts <- summarizeOverlaps(geneExons,myBam) allGeneCounts ``` ``` ## class: RangedSummarizedExperiment ## dim: 27341 2 ## metadata(0): ## assays(1): counts ## rownames(27341): 1 10 ... 9994 9997 ## rowData names(0): ## colnames(2): Heart.bam Liver.bam ## colData names(0): ``` --- # Time for an exercise. [Link_to_exercises](../../exercises/exercises/summarization_exercise.html) [Link_to_answers](../../exercises/answers/summarization_answers.html)