Summarizing Scores

Raw Sequence data

Sequencing data from Illumina sequences machines are often stored in FASTQ format.

$igv$

Aligned sequence data

Sequence reads aligned to a genome are most often stored in BAM format.

igv

Summarizing Alignments

A common analysis step in the interpretation of high-throughput sequencing experiments is to quantify the abundance of sequence in/over genomic regions.

igv

Summarizing for RNAseq

In RNAseq experiments we assess the abundance of transcripts/genes under differing conditions by assessing the number of reads aligning with the transcript’s/gene’s exons.

Summarizing for ChIPseq

In ChIPseq experiments we compare the number of DNA fragments mapping to a genomic locus after an IP for a specific protein or DNA/histone state versus the number of DNA fragments mapping from a background control/input sample.

Summarizing Genomic Alignments

We can provide a summarization of aligned data per base pair using wig, bedgraph and bigWig files.

Summarizing alignments in/over regions

We will often summarize our aligned data in BAM format over genomic regions of interest.

Summarizing alignments in/over regions

The genomic regions may be:

Simple genomic regions such as peaks or TSS regions are often stored as BED files. In Bioconductor these are represented as GRanges objects.
More complex, multi-part regions such as gene models are often stored as GTF or GFF files. In Bioconductor, gene models are handled as TxDb objects.

Summarizing with Bioconductor

We will use two main functions to allows us to summarize our aligned data over the genome and within regions.

These are the coverage() and summarizeOverlaps() functions from GenomicsAlignments package.

coverage() - To create per base pair signal/alignments scores across genome i.e. bigWigs.
summarizeOverlaps() - To create counts of signal/alignments within genomic regions.

The Data

In this session we will work with aligned data as BAM files.

I have provided BAM files we saw in our last session as our test data from today.

This can be found in data/liver.bodyMap.bam and data/heart.bodyMap.bam.

data/liver.bodyMap.bam
data/heart.bodyMap.bam

GenomicAlignments packages

To make use of a GenomicsAlignments package we must first load the library.

library(GenomicAlignments)

Coverage

The coverage() function can accept a Bam file path and will return an Rlelist of the number of reads at every base pair.

We should always work with sorted and indexed BAM files so we will perform these steps first.

sortedHeart <- sortBam("data/heart.bodyMap.bam","Heart")
indexBam(sortedHeart)

##       Heart.bam 
## "Heart.bam.bai"

Coverage

Now we simply use the coverage() function with the path to our sorted file.

heartCoverage <- coverage("Heart.bam")
class(heartCoverage)

## [1] "SimpleRleList"
## attr(,"package")
## [1] "IRanges"

Coverage as RLElist

As we have seen, Rlelist objects contain compressed vectors carrying the scores, here number of alignments, for every base pair in the genome.

Here we see scores only on chromosome 12 around the gene we have been reviewing, SLC25A3.

heartCoverage

## RleList of length 1
## $chr12
## integer-Rle of length 133275309 with 4553 runs
##   Lengths:    33666       50  3194555       35 ...  7200240       38  2762799
##   Values :        0        2        0        3 ...        0        1        0

Coverage as RLElist

We have seen we can index, subset, manipulate and plot our RleList and Rle vector in a similar manner to standard lists and vectors.

chr12Cov <- heartCoverage[["chr12"]]
signalDepth <- chr12Cov[98591400:98608400]
signalDepthScaled <- data.frame(Position=98591400:98608400,
                                Signal=signalDepth*1000)
library(ggplot2)
ggplot(signalDepthScaled,aes(x=Position,y=Signal))+
  geom_line()+theme_minimal()

Coverage function

The coverage() function will also accept GAlignments objects.

heartAln <- readGAlignments("Heart.bam")
heartCov1 <- coverage(heartAln)

Coverage function

As well as GRanges and GRangesList objects too.

For RNAseq we should generate coverage from a GRangesList object to account for reads spannning exons.

heartGR <- granges(heartAln)
heartCov2 <- coverage(heartGR)
heartGRL <- grglist(heartAln)
heartCov3 <- coverage(heartGRL)

Coverage function

This means we can perform some operations on our sequence reads in R prior to creating our bigWig.

Here we simply select only reads aligning to the positive strand prior to creating our signal scores..

heartAlnPos <- heartAln[strand(heartAln) == "+"]
heartAlnPos <- coverage(heartAlnPos)
heartAlnPos["chr12"]
export.bw(heartAlnPos,con="heartPos.bw")

## RleList of length 1
## $chr12
## integer-Rle of length 133275309 with 3135 runs
##   Lengths:    33666       50  3194555       35 ...  1530798       40  9963077
##   Values :        0        2        0        3 ...        0        1        0

Coverage function

The coverage function allows for the specification of a additional parameter, weight.

The weight parameter applies a scaling factor to the calculated scores and so can be useful to normalize your signal e.g. to total mapped reads.

heartCoverageX10 <- coverage("Heart.bam",
                          weight = 10)
heartCoverageX10["chr12"]
## RleList of length 1
## $chr12
## numeric-Rle of length 133275309 with 4553 runs
##   Lengths:    33666       50  3194555       35 ...  7200240       38  2762799
##   Values :        0       20        0       30 ...        0       10        0
heartCoverage["chr12"]
## RleList of length 1
## $chr12
## integer-Rle of length 133275309 with 4553 runs
##   Lengths:    33666       50  3194555       35 ...  7200240       38  2762799
##   Values :        0        2        0        3 ...        0        1        0

Coverage function

In combination with idxstatsBam() function, we could use this to normalize your signal to total mapped reads.

allChromosomeStats <- idxstatsBam("Heart.bam")
allChromosomeStats

##   seqnames seqlength mapped unmapped
## 1    chr12 133275309 132381        0
## 2        *         0      0      223

Coverage function

mapped <- sum(allChromosomeStats[,"mapped"])
heartCoverageNorm <- coverage("Heart.bam",
                          weight = (10^6)/mapped)
heartCoverageNorm["chr12"]

## RleList of length 1
## $chr12
## numeric-Rle of length 133275309 with 4553 runs
##   Lengths:        33666           50      3194555 ...           38      2762799
##   Values :  0.00000e+00  1.51079e+01  0.00000e+00 ...  7.55395e+00 -1.10187e-11

REMEMBER: You can always export the RLE to a BigWig with rtracklayer, so you can then view it in IGV.

Summarizing from scores

We can use RleLists to summarize scores over regions by subsetting with GRanges and using standard arithmetic operations.

Here we get the mean signal (read depth per base pair) over exons for slc25A3 gene, Entrez ID 5250.

library(TxDb.Hsapiens.UCSC.hg38.knownGene)
exonsOfGenes <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene,
                        by="gene")
slc25A3 <- exonsOfGenes[["5250"]]
slc25A3

## GRanges object with 39 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id   exon_name
##           <Rle>         <IRanges>  <Rle> | <integer> <character>
##    [1]    chr12 98593591-98593740      + |    381304        <NA>
##    [2]    chr12 98593625-98593740      + |    381305        <NA>
##    [3]    chr12 98593625-98594135      + |    381306        <NA>
##    [4]    chr12 98593650-98593740      + |    381307        <NA>
##    [5]    chr12 98593655-98593740      + |    381308        <NA>
##    ...      ...               ...    ... .       ...         ...
##   [35]    chr12 98601368-98601708      + |    381339        <NA>
##   [36]    chr12 98601368-98601997      + |    381340        <NA>
##   [37]    chr12 98601368-98602168      + |    381341        <NA>
##   [38]    chr12 98601368-98606367      + |    381342        <NA>
##   [39]    chr12 98601368-98606379      + |    381343        <NA>
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome

Summarizing from scores

heartCoverageNorm <- coverage("Heart.bam")
myMeanCovOverExons <- mean(heartCoverageNorm[slc25A3])
myMeanCovOverExons

##     chr12     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
##  197.3933  255.2500  633.2740  323.8462  341.8605  345.5647  510.2909  715.6889 
##     chr12     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
##  601.2222  554.0204  756.4282 1757.2919  473.0144 1770.5732  539.0552 4232.1680 
##     chr12     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
## 2317.5153  204.4162  299.2623 2600.1274 6779.2570 6785.8833 6749.0552 5607.1697 
##     chr12     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
## 2104.1342 6218.8696 6158.3462 4092.3000 6375.2486 3771.8433 7370.8919 7180.3787 
##     chr12     chr12     chr12     chr12     chr12     chr12     chr12 
## 7139.7459 6667.7147 6648.3255 3626.9619 2858.4856  478.1346  476.9898

Summarizing counts in regions from alignments

Counting reads in/over regions

To count reads within regions and across gene models we can take advantage of specialist counting tools available within the summarizeOverlaps() functions.

This is important as it will allow us to:

Handle reads overlapping multiple regions (features or metafeatures).
Assign reads to meta-features (genes instead of exons).

Reads over/across exons

summarizeOverlaps - Genes not exons

Often we will want to summarize to genes, not individual exons.

igv

A read can only assigned to one feature!

When using summarizeOverlaps() function, a read can only be assigned to one feature.

This is important to note when counting features such as overlapping exons.

We will learn how to deal with counting across overlapping features in later sessions such as in RNAseq.

summarizeOverlaps

The summarizeOverlaps() function is well intergrated with many of the Bioconductor objects we have seen.

As with the coverage() function it can count reads from Bam files or GRanges objects.

The summarizeOverlaps() function can count over GRanges (features e.g. exons) and GRangesLists (meta-features genes).

summarizeOverlaps for regions

To count non-overlapping regions, we can simply specify the path to our BAM file and a GRanges of regions to count over.

The returned object is a RangedSummarizedExperiment object.

geneBody <- genes(TxDb.Hsapiens.UCSC.hg38.knownGene)

##   228 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.

TSS <- promoters(geneBody,500,500)
myTSScounts <- summarizeOverlaps(TSS,"Heart.bam")
class(myTSScounts)

## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"

SummarizedExperiment object

The resulting SummarizedExperiment object contains a matrix of our counts over regions and the original GRanges object we used for counting.

As with other Bioconductor objects, the object shows some useful summary information

myTSScounts

## class: RangedSummarizedExperiment 
## dim: 27113 1 
## metadata(0):
## assays(1): counts
## rownames(27113): 1 10 ... 9994 9997
## rowData names(1): gene_id
## colnames(1): Heart.bam
## colData names(0):

SummarizedExperiment object

We can access the count matrix using the accessor function assay() with our SummarizedExperiment object.

Here we see total counts for SLC25A3 genes’ TSS region

countMatrix <- assay(myTSScounts)
countMatrix["5250",]

##      Heart.bam
## 5250      2934

SummarizedExperiment object

The rowRanges() function will access the original GRanges we used for counting.

Granges <- rowRanges(myTSScounts)
Granges

## GRanges object with 27113 ranges and 1 metadata column:
##         seqnames              ranges strand |     gene_id
##            <Rle>           <IRanges>  <Rle> | <character>
##       1    chr19   58362252-58363251      - |           1
##      10     chr8   18390782-18391781      + |          10
##     100    chr20   44651734-44652733      - |         100
##    1000    chr18   28177447-28178446      - |        1000
##   10000     chr1 243850580-243851579      - |       10000
##     ...      ...                 ...    ... .         ...
##    9991     chr9 112333165-112334164      - |        9991
##    9992    chr21   34363506-34364505      + |        9992
##    9993    chr22   19121955-19122954      - |        9993
##    9994     chr6   89829394-89830393      + |        9994
##    9997    chr22   50525962-50526961      - |        9997
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome

summarizeOverlaps for gene models

For gene models, we may want to assign reads to each gene and so count all reads overlapping any exons within a gene.

We can count reads in meta-features such as genes using GRangesLists.

First we extract a GRangesList of exons for every gene.

geneExons <- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene,by="gene")
geneExons["5250"]

## GRangesList object of length 1:
## $`5250`
## GRanges object with 39 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id   exon_name
##           <Rle>         <IRanges>  <Rle> | <integer> <character>
##    [1]    chr12 98593591-98593740      + |    381304        <NA>
##    [2]    chr12 98593625-98593740      + |    381305        <NA>
##    [3]    chr12 98593625-98594135      + |    381306        <NA>
##    [4]    chr12 98593650-98593740      + |    381307        <NA>
##    [5]    chr12 98593655-98593740      + |    381308        <NA>
##    ...      ...               ...    ... .       ...         ...
##   [35]    chr12 98601368-98601708      + |    381339        <NA>
##   [36]    chr12 98601368-98601997      + |    381340        <NA>
##   [37]    chr12 98601368-98602168      + |    381341        <NA>
##   [38]    chr12 98601368-98606367      + |    381342        <NA>
##   [39]    chr12 98601368-98606379      + |    381343        <NA>
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome

summarizeOverlaps for gene models

Now we can summarize counts for every gene by counting in their respective exons.

myGeneCounts <- summarizeOverlaps(geneExons,"Heart.bam")
myGeneCounts

## class: RangedSummarizedExperiment 
## dim: 27341 1 
## metadata(0):
## assays(1): counts
## rownames(27341): 1 10 ... 9994 9997
## rowData names(0):
## colnames(1): Heart.bam
## colData names(0):

summarizeOverlaps for gene models

And now we have a count of signal in exons for every gene.

countMatrix <- assay(myGeneCounts)
countMatrix["5250",]

##      Heart.bam
## 5250     65799

summarizeOverlaps for gene models

The rowRanges() contains the original GRangesList of exons by genes.

grgList <- rowRanges(myGeneCounts)
grgList

## GRangesList object of length 27341:
## $`1`
## GRanges object with 29 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id   exon_name
##           <Rle>         <IRanges>  <Rle> | <integer> <character>
##    [1]    chr19 58345178-58347029      - |    579646        <NA>
##    [2]    chr19 58345183-58347029      - |    579647        <NA>
##    [3]    chr19 58346854-58347029      - |    579648        <NA>
##    [4]    chr19 58346858-58347029      - |    579649        <NA>
##    [5]    chr19 58346860-58347029      - |    579650        <NA>
##    ...      ...               ...    ... .       ...         ...
##   [25]    chr19 58357585-58357649      - |    579673        <NA>
##   [26]    chr19 58357952-58358286      - |    579674        <NA>
##   [27]    chr19 58358489-58358585      - |    579675        <NA>
##   [28]    chr19 58359197-58359323      - |    579676        <NA>
##   [29]    chr19 58362677-58362751      - |    579677        <NA>
##   -------
##   seqinfo: 595 sequences (1 circular) from hg38 genome
## 
## ...
## <27340 more elements>

SummarizeOverlaps for many files

The summarizeOverlaps() function can count across multiple BAM files.

We can specify a vector of BAM file paths we wish to count from to summarizeOverlaps() function.

allGeneCounts <- summarizeOverlaps(geneExons,
                                   c("Heart.bam","Liver.bam"))

SummarizeOverlaps for many files

Now we can return the matrix of counts with samples by columns and genes as rows.

countMatrix <- assay(allGeneCounts)
countMatrix["5250",]

##      Heart.bam Liver.bam
## 5250     65799     24296

Low memory summarization

Counting from BAM files, as with sorting, can be configured to trade of memory usage for speed i.e. less RAM used, but takes longer to count.

We can control the amount of memory used by creating a BamFile object prior to counting using the BamFile() function

myBam <- BamFile("Heart.bam")
class(myBam)

## [1] "BamFile"
## attr(,"package")
## [1] "Rsamtools"

Low memory summarization

We can additionally specify the yieldSize parameter which controls how many reads are held in memory in one time.

Here we set the yieldSize to a very low 1000 reads, this will save a lot of memory but slow down processing.

We will often need to test the yieldSize which best fits our machine.

myBam <- BamFile("Heart.bam", yieldSize = 1000)
heartGeneCounts <- summarizeOverlaps(geneExons,myBam)
heartGeneCounts

## class: RangedSummarizedExperiment 
## dim: 27341 1 
## metadata(0):
## assays(1): counts
## rownames(27341): 1 10 ... 9994 9997
## rowData names(0):
## colnames(1): Heart.bam
## colData names(0):

Low memory summarization

When counting multiple files we can use the BamFileList() function in a similar manner but this time we provide a vector of the two BAM file paths.

myBam <- BamFileList(c("Heart.bam","Liver.bam"),
                     yieldSize = 1000)
allGeneCounts <- summarizeOverlaps(geneExons,myBam)
allGeneCounts

## class: RangedSummarizedExperiment 
## dim: 27341 2 
## metadata(0):
## assays(1): counts
## rownames(27341): 1 10 ... 9994 9997
## rowData names(0):
## colnames(2): Heart.bam Liver.bam
## colData names(0):

Time for an exercise.

Link_to_exercises

Link_to_answers

Summarizing Scores In Bioconductor

Rockefeller University, Bioinformatics Resource Centre

http://rockefelleruniversity.github.io/Bioconductor_Introduction/

Summarizing Scores

Raw Sequence data

Aligned sequence data

Summarizing Alignments

Summarizing for RNAseq

Summarizing for ChIPseq

Summarizing Genomic Alignments

Summarizing alignments in/over regions

Summarizing alignments in/over regions

Summarizing with Bioconductor

The Data

GenomicAlignments packages

Coverage

Coverage

Coverage as RLElist

Coverage as RLElist

Coverage function

Coverage function

Coverage function

Coverage function

Coverage function

Coverage function

Summarizing from scores

Summarizing from scores

Summarizing counts in regions from alignments

Counting reads in/over regions

Reads over/across exons

summarizeOverlaps - Genes not exons

A read can only assigned to one feature!

summarizeOverlaps

summarizeOverlaps for regions

SummarizedExperiment object

SummarizedExperiment object

SummarizedExperiment object

summarizeOverlaps for gene models

summarizeOverlaps for gene models

summarizeOverlaps for gene models

summarizeOverlaps for gene models

SummarizeOverlaps for many files

SummarizeOverlaps for many files

Low memory summarization

Low memory summarization

Low memory summarization

Time for an exercise.