Summarizing Scores In Bioconductor

class: center, middle, inverse, title-slide

# Summarizing Scores In Bioconductor
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/Bioconductor_Introduction/" class="uri">http://rockefelleruniversity.github.io/Bioconductor_Introduction/</a>

---

class: inverse, center, middle

# Summarizing Scores

---

# Raw Sequence data

Sequencing data from Illumina sequences machines are often stored in FASTQ format.

---
# Aligned sequence data

Sequence reads aligned to a genome are most often stored in BAM format.

---
# Summarizing Alignments

A common analysis step in the interpretation of high-throughput sequencing experiments is to quantify the abundance of sequence in/over genomic regions.

---
# Summarizing for RNAseq

* **In RNAseq** experiments we assess the abundance of transcripts/genes under differing conditions by assessing the number of reads aligning with the transcript's/gene's exons.

![](imgs/RNAseq.png)
---
# Summarizing for ChIPseq

* **In ChIPseq** experiments we compare the number of DNA fragments mapping to a genomic locus after an IP for a specific protein or DNA/histone state versus the number of DNA fragments mapping from a background control/input sample.

![](imgs/chipseq.png)
---
# Summarizing Genomic Alignments
We can provide a summarization of aligned data per base pair using **wig, bedgraph and bigWig** files.

![](imgs/scoreBP.png)

---
# Summarizing alignments in/over regions

We will often summarize our aligned data in BAM format over genomic regions of interest.

![](imgs/countss.png)

# Summarizing alignments in/over regions

The genomic regions may be:

* Simple genomic regions such as peaks or TSS regions are often stored as **BED** files. In Bioconductor these are represented as **GRanges** objects.

* More complex, multi-part regions such as gene models are often stored as **GTF** or **GFF** files. In Bioconductor, gene models are handled as **TxDb** objects.

---
# Summarizing with Bioconductor

We will use two main functions to allows us to summarize our aligned data over the genome and within regions.

These are the **coverage()** and **summarizeOverlaps()** functions from **GenomicsAlignments** package.

* **coverage()** - To create per base pair signal/alignments scores across genome i.e. bigWigs.
* **summarizeOverlaps()** - To create counts of signal/alignments within genomic regions.

---
# The Data

In this session we will work with aligned data as BAM files.

I have provided BAM files we saw in our last session as our test data from today.

This can be found in **data/liver.bodyMap.bam** and **data/heart.bodyMap.bam**.

* **data/liver.bodyMap.bam**
* **data/heart.bodyMap.bam**

---

# GenomicAlignments packages

To make use of a GenomicsAlignments package we must first load the library.

```r
library(GenomicAlignments)
```

---
# Coverage

The **coverage()** function can accept a Bam file path and will return an [Rlelist](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/singlepage/GenomicScores_In_Bioconductor.html#rle_in_genomics) of the number of reads at every base pair.

We should always work with sorted and indexed BAM files so we will perform these steps first.

```r
sortedHeart <- sortBam("data/heart.bodyMap.bam","Heart")
indexBam(sortedHeart)
```

```
##       Heart.bam 
## "Heart.bam.bai"
```

---
# Coverage

Now we simply use the **coverage()** function with the path to our sorted file.

```r
heartCoverage <- coverage("Heart.bam")
class(heartCoverage)
```

```
## [1] "SimpleRleList"
## attr(,"package")
## [1] "IRanges"
```

---
# Coverage as RLElist

As we have seen,  [Rlelist](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/singlepage/GenomicScores_In_Bioconductor.html#rle_in_genomics) objects contain compressed vectors carrying the scores, here number of alignments, for every base pair in the genome.

Here we see scores only on chromosome 12 around the gene we have been reviewing, SLC25A3.

```r
heartCoverage
```

```
## RleList of length 1
## $chr12
## integer-Rle of length 133275309 with 4553 runs
##   Lengths:    33666       50  3194555       35 ...  7200240       38  2762799
##   Values :        0        2        0        3 ...        0        1        0
```

---
# Coverage as RLElist

We have seen we can [index, subset, manipulate and plot our RleList and Rle vector](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/singlepage/GenomicScores_In_Bioconductor.html#indexing_an_rle) in a similar manner to standard lists and vectors.

```r
chr12Cov <- heartCoverage[["chr12"]]
signalDepth <- chr12Cov[98591400:98608400]
signalDepthScaled <- data.frame(Position=98591400:98608400,
                                Signal=signalDepth*1000)
library(ggplot2)
ggplot(signalDepthScaled,aes(x=Position,y=Signal))+
  geom_line()+theme_minimal()
```

![](Summarising_Scores_In_Bioconductor_files/figure-html/a3-1.png)

---
# Coverage function

The **coverage()** function will also accept 
**GAlignments** objects.

```r
heartAln <- readGAlignments("Heart.bam")
heartCov1 <- coverage(heartAln)
```

![](Summarising_Scores_In_Bioconductor_files/figure-html/frfga3-1.png)

---
# Coverage function

As well as **GRanges** and **GRangesList** objects too.

For RNAseq we should generate coverage from a [**GRangesList** object to account for reads spannning exons](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignedDataInBioconductor.html#36).

```r
heartGR <- granges(heartAln)
heartCov2 <- coverage(heartGR)
heartGRL <- grglist(heartAln)
heartCov3 <- coverage(heartGRL)
```

![](Summarising_Scores_In_Bioconductor_files/figure-html/frfa3-1.png)

---
# Coverage function

This means we can perform some operations on our sequence reads in R prior to creating our bigWig.

Here we simply select only reads aligning to the positive strand prior to creating our signal scores..

```r
heartAlnPos <- heartAln[strand(heartAln) == "+"]
heartAlnPos <- coverage(heartAlnPos)
heartAlnPos["chr12"]
export.bw(heartAlnPos,con="heartPos.bw")
```

```
## RleList of length 1
## $chr12
## integer-Rle of length 133275309 with 3135 runs
##   Lengths:    33666       50  3194555       35 ...  1530798       40  9963077
##   Values :        0        2        0        3 ...        0        1        0
```

---
# Coverage function

The coverage function allows for the specification of a additional parameter, **weight**.

The **weight** parameter applies a scaling factor to the calculated scores and so can be useful to normalize your signal e.g. to total mapped reads.

```r
heartCoverageX10 <- coverage("Heart.bam",
                          weight = 10)
heartCoverageX10["chr12"]
## RleList of length 1
## $chr12
## numeric-Rle of length 133275309 with 4553 runs
##   Lengths:    33666       50  3194555       35 ...  7200240       38  2762799
##   Values :        0       20        0       30 ...        0       10        0
heartCoverage["chr12"]
## RleList of length 1
## $chr12
## integer-Rle of length 133275309 with 4553 runs
##   Lengths:    33666       50  3194555       35 ...  7200240       38  2762799
##   Values :        0        2        0        3 ...        0        1        0
```

---
# Coverage function

In combination with **idxstatsBam()** function, we could use this to normalize your signal to total mapped reads.

```r
allChromosomeStats <- idxstatsBam("Heart.bam")
allChromosomeStats
```

```
##   seqnames seqlength mapped unmapped
## 1    chr12 133275309 132381        0
## 2        *         0      0      223
```

---
# Coverage function

```r
mapped <- sum(allChromosomeStats[,"mapped"])
heartCoverageNorm <- coverage("Heart.bam",
                          weight = (10^6)/mapped)
heartCoverageNorm["chr12"]
```

```
## RleList of length 1
## $chr12
## numeric-Rle of length 133275309 with 4553 runs
##   Lengths:        33666           50      3194555 ...           38      2762799
##   Values :  0.00000e+00  1.51079e+01  0.00000e+00 ...  7.55395e+00 -1.10187e-11
```

**REMEMBER: You can always export the RLE to a BigWig with rtracklayer, so you can then view it in IGV.**

---
# Summarizing from scores

We can use [RleLists to summarize scores over regions by subsetting with GRanges and using standard arithmetic operations.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/singlepage/GenomicScores_In_Bioconductor.html#rlelists_and_granges)

Here we get the mean signal (read depth per base pair) over exons for slc25A3 gene, **Entrez ID 5250**.

```r
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
exonsOfGenes <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene,
                        by="gene")
slc25A3 <- exonsOfGenes[["5250"]]
slc25A3
```

```
## GRanges object with 39 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id   exon_name
##           <Rle>         <IRanges>  <Rle> | <integer> <character>
##    [1]    chr12 98593591-98593740      + |    381304        <NA>
##    [2]    chr12 98593625-98593740      + |    381305        <NA>
##    [3]    chr12 98593625-98594135      + |    381306        <NA>
##    [4]    chr12 98593650-98593740      + |    381307        <NA>
##    [5]    chr12 98593655-98593740      + |    381308        <NA>
##    ...      ...               ...    ... .       ...         ...
##   [35]    chr12 98601368-98601708      + |    381339        <NA>
##   [36]    chr12 98601368-98601997      + |    381340        <NA>
##   [37]    chr12 98601368-98602168      + |    381341        <NA>
##   [38]    chr12 98601368-98606367      + |    381342        <NA>
##   [39]    chr12 98601368-98606379      + |    381343        <NA>
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome
```

---
# Summarizing from scores

```r
heartCoverageNorm <- coverage("Heart.bam")
myMeanCovOverExons <- mean(heartCoverageNorm[slc25A3])
myMeanCovOverExons
```

```
##     chr12     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
##  197.3933  255.2500  633.2740  323.8462  341.8605  345.5647  510.2909  715.6889 
##     chr12     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
##  601.2222  554.0204  756.4282 1757.2919  473.0144 1770.5732  539.0552 4232.1680 
##     chr12     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
## 2317.5153  204.4162  299.2623 2600.1274 6779.2570 6785.8833 6749.0552 5607.1697 
##     chr12     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
## 2104.1342 6218.8696 6158.3462 4092.3000 6375.2486 3771.8433 7370.8919 7180.3787 
##     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
## 7139.7459 6667.7147 6648.3255 3626.9619 2858.4856  478.1346  476.9898
```

---
class: inverse, center, middle

# Summarizing counts in regions from alignments

---

# Counting reads in/over regions

To count reads within regions and across gene models we can take advantage of specialist counting tools available within the **summarizeOverlaps()** functions.

This is important as it will allow us to:

* Handle reads overlapping multiple regions (features or metafeatures).
* Assign reads to meta-features (genes instead of exons).

---
# Reads over/across exons

![](imgs/exonsAndGenes.png)

---
# summarizeOverlaps - Genes not exons

Often we will want to summarize to genes, not individual exons.

---
# A read can only assigned to one feature!

When using **summarizeOverlaps()** function, a read can only be assigned to one feature.

This is important to note when counting features such as overlapping exons.

We will learn how to deal with counting across overlapping features in later sessions such as in RNAseq.

---
# summarizeOverlaps

The **summarizeOverlaps()** function is well intergrated with many of the Bioconductor objects we have seen.

As with the **coverage()** function it can count reads from **Bam** files or **GRanges** objects.

The **summarizeOverlaps()** function can count over **GRanges (features e.g. exons)** and **GRangesLists (meta-features genes)**.

---
# summarizeOverlaps for regions

To count non-overlapping regions, we can simply specify the path to our BAM file and a GRanges of regions to count over.

The returned object is a **RangedSummarizedExperiment** object.

```r
geneBody <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)
```

```
##   228 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
```

```r
TSS <- promoters(geneBody,500,500)
myTSScounts <- summarizeOverlaps(TSS,"Heart.bam")
class(myTSScounts)
```

```
## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
```

---
# SummarizedExperiment object

The resulting **SummarizedExperiment** object contains a matrix of our counts over regions and the original GRanges object we used for counting.

As with other Bioconductor objects, the object shows some useful summary information

```r
myTSScounts
```

```
## class: RangedSummarizedExperiment 
## dim: 27113 1 
## metadata(0):
## assays(1): counts
## rownames(27113): 1 10 ... 9994 9997
## rowData names(1): gene_id
## colnames(1): Heart.bam
## colData names(0):
```

---
# SummarizedExperiment object

We can access the count matrix using the accessor function **assay()** with our **SummarizedExperiment** object.

Here we see total counts for SLC25A3 genes' TSS region

```r
countMatrix <- assay(myTSScounts)
countMatrix["5250",]
```

```
##      Heart.bam
## 5250      2934
```

---
# SummarizedExperiment object

The **rowRanges()** function will access the original **GRanges** we used for counting.

```r
Granges <- rowRanges(myTSScounts)
Granges
```

```
## GRanges object with 27113 ranges and 1 metadata column:
##         seqnames              ranges strand |     gene_id
##            <Rle>           <IRanges>  <Rle> | <character>
##       1    chr19   58362252-58363251      - |           1
##      10     chr8   18390782-18391781      + |          10
##     100    chr20   44651734-44652733      - |         100
##    1000    chr18   28177447-28178446      - |        1000
##   10000     chr1 243850580-243851579      - |       10000
##     ...      ...                 ...    ... .         ...
##    9991     chr9 112333165-112334164      - |        9991
##    9992    chr21   34363506-34364505      + |        9992
##    9993    chr22   19121955-19122954      - |        9993
##    9994     chr6   89829394-89830393      + |        9994
##    9997    chr22   50525962-50526961      - |        9997
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome
```

---
# summarizeOverlaps for gene models

For gene models, we may want to assign reads to each gene and so count all reads overlapping any exons within a gene.

We can count reads in meta-features such as genes using **GRangesLists**.

First we extract a GRangesList of exons for every gene.

```r
geneExons <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene,by="gene")
geneExons["5250"]
```

```
## GRangesList object of length 1:
## $`5250`
## GRanges object with 39 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id   exon_name
##           <Rle>         <IRanges>  <Rle> | <integer> <character>
##    [1]    chr12 98593591-98593740      + |    381304        <NA>
##    [2]    chr12 98593625-98593740      + |    381305        <NA>
##    [3]    chr12 98593625-98594135      + |    381306        <NA>
##    [4]    chr12 98593650-98593740      + |    381307        <NA>
##    [5]    chr12 98593655-98593740      + |    381308        <NA>
##    ...      ...               ...    ... .       ...         ...
##   [35]    chr12 98601368-98601708      + |    381339        <NA>
##   [36]    chr12 98601368-98601997      + |    381340        <NA>
##   [37]    chr12 98601368-98602168      + |    381341        <NA>
##   [38]    chr12 98601368-98606367      + |    381342        <NA>
##   [39]    chr12 98601368-98606379      + |    381343        <NA>
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome
```

---
# summarizeOverlaps for gene models

Now we can summarize counts for every gene by counting in their respective exons.

```r
myGeneCounts <- summarizeOverlaps(geneExons,"Heart.bam")
myGeneCounts
```

```
## class: RangedSummarizedExperiment 
## dim: 27341 1 
## metadata(0):
## assays(1): counts
## rownames(27341): 1 10 ... 9994 9997
## rowData names(0):
## colnames(1): Heart.bam
## colData names(0):
```

---
# summarizeOverlaps for gene models

And now we have a count of signal in exons for every gene.

```r
countMatrix <- assay(myGeneCounts)
countMatrix["5250",]
```

```
##      Heart.bam
## 5250     65799
```

---
# summarizeOverlaps for gene models

The **rowRanges()**  contains the original GRangesList of exons by genes.

```r
grgList <- rowRanges(myGeneCounts)
grgList
```

```
## GRangesList object of length 27341:
## $`1`
## GRanges object with 29 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id   exon_name
##           <Rle>         <IRanges>  <Rle> | <integer> <character>
##    [1]    chr19 58345178-58347029      - |    579646        <NA>
##    [2]    chr19 58345183-58347029      - |    579647        <NA>
##    [3]    chr19 58346854-58347029      - |    579648        <NA>
##    [4]    chr19 58346858-58347029      - |    579649        <NA>
##    [5]    chr19 58346860-58347029      - |    579650        <NA>
##    ...      ...               ...    ... .       ...         ...
##   [25]    chr19 58357585-58357649      - |    579673        <NA>
##   [26]    chr19 58357952-58358286      - |    579674        <NA>
##   [27]    chr19 58358489-58358585      - |    579675        <NA>
##   [28]    chr19 58359197-58359323      - |    579676        <NA>
##   [29]    chr19 58362677-58362751      - |    579677        <NA>
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome
## 
## ...
## <27340 more elements>
```

---
# SummarizeOverlaps for many files

The **summarizeOverlaps()** function can count across multiple BAM files.

We can specify a vector of BAM file paths we wish to count from to **summarizeOverlaps()** function.

```r
allGeneCounts <- summarizeOverlaps(geneExons,
                                   c("Heart.bam","Liver.bam"))
```

---
# SummarizeOverlaps for many files

Now we can return the matrix of counts with samples by columns and genes as rows.

```r
countMatrix <- assay(allGeneCounts)
countMatrix["5250",]
```

```
##      Heart.bam Liver.bam
## 5250     65799     24296
```

---
# Low memory summarization

Counting from BAM files, as with sorting, can be configured to trade of memory usage for speed i.e. less RAM used, but takes longer to count.

We can control the amount of memory used by creating a **BamFile** object prior to counting using the **BamFile()** function

```r
myBam <- BamFile("Heart.bam")
class(myBam)
```

```
## [1] "BamFile"
## attr(,"package")
## [1] "Rsamtools"
```

---
# Low memory summarization

We can additionally specify the **yieldSize** parameter which controls how many reads are held in memory in one time.

Here we set the **yieldSize** to a very low 1000 reads, this will save a lot of memory but slow down processing.

We will often need to test the **yieldSize** which best fits our machine.

```r
myBam <- BamFile("Heart.bam", yieldSize = 1000)
heartGeneCounts <- summarizeOverlaps(geneExons,myBam)
heartGeneCounts
```

---
# Low memory summarization

When counting multiple files we can use the **BamFileList()** function in a similar manner but this time we provide a vector of the two BAM file paths.

```r
myBam <- BamFileList(c("Heart.bam","Liver.bam"),
                     yieldSize = 1000)
allGeneCounts <- summarizeOverlaps(geneExons,myBam)
allGeneCounts
```

```
## class: RangedSummarizedExperiment 
## dim: 27341 2 
## metadata(0):
## assays(1): counts
## rownames(27341): 1 10 ... 9994 9997
## rowData names(0):
## colnames(2): Heart.bam Liver.bam
## colData names(0):
```

---
# Time for an exercise.

[Link_to_exercises](../../exercises/exercises/summarization_exercise.html)

[Link_to_answers](../../exercises/answers/summarization_answers.html)