Sequencing data from Illumina sequences machines are often stored in FASTQ format.
Sequence reads aligned to a genome are most often stored in BAM format.
A common analysis step in the interpretation of high-throughput sequencing experiments is to quantify the abundance of sequence in/over genomic regions.
We can provide a summarization of aligned data per base pair using wig, bedgraph and bigWig files.
We will often summarize our aligned data in BAM format over genomic regions of interest.
The genomic regions may be:
Simple genomic regions such as peaks or TSS regions are often stored as BED files. In Bioconductor these are represented as GRanges objects.
More complex, multi-part regions such as gene models are often stored as GTF or GFF files. In Bioconductor, gene models are handled as TxDb objects.
We will use two main functions to allows us to summarize our aligned data over the genome and within regions.
These are the coverage() and summarizeOverlaps() functions from GenomicsAlignments package.
In this session we will work with aligned data as BAM files.
I have provided BAM files we saw in our last session as our test data from today.
This can be found in data/liver.bodyMap.bam and data/heart.bodyMap.bam.
To make use of a GenomicsAlignments package we must first load the library.
library(GenomicAlignments)
The coverage() function can accept a Bam file path and will return an Rlelist of the number of reads at every base pair.
We should always work with sorted and indexed BAM files so we will perform these steps first.
<- sortBam("data/heart.bodyMap.bam","Heart")
sortedHeart indexBam(sortedHeart)
## Heart.bam
## "Heart.bam.bai"
Now we simply use the coverage() function with the path to our sorted file.
<- coverage("Heart.bam")
heartCoverage class(heartCoverage)
## [1] "SimpleRleList"
## attr(,"package")
## [1] "IRanges"
As we have seen, Rlelist objects contain compressed vectors carrying the scores, here number of alignments, for every base pair in the genome.
Here we see scores only on chromosome 12 around the gene we have been reviewing, SLC25A3.
heartCoverage
## RleList of length 1
## $chr12
## integer-Rle of length 133275309 with 4553 runs
## Lengths: 33666 50 3194555 35 ... 7200240 38 2762799
## Values : 0 2 0 3 ... 0 1 0
We have seen we can index, subset, manipulate and plot our RleList and Rle vector in a similar manner to standard lists and vectors.
<- heartCoverage[["chr12"]]
chr12Cov <- chr12Cov[98591400:98608400]
signalDepth <- data.frame(Position=98591400:98608400,
signalDepthScaled Signal=signalDepth*1000)
library(ggplot2)
ggplot(signalDepthScaled,aes(x=Position,y=Signal))+
geom_line()+theme_minimal()
The coverage() function will also accept GAlignments objects.
<- readGAlignments("Heart.bam")
heartAln <- coverage(heartAln) heartCov1
As well as GRanges and GRangesList objects too.
For RNAseq we should generate coverage from a GRangesList object to account for reads spannning exons.
<- granges(heartAln)
heartGR <- coverage(heartGR)
heartCov2 <- grglist(heartAln)
heartGRL <- coverage(heartGRL) heartCov3
This means we can perform some operations on our sequence reads in R prior to creating our bigWig.
Here we simply select only reads aligning to the positive strand prior to creating our signal scores..
<- heartAln[strand(heartAln) == "+"]
heartAlnPos <- coverage(heartAlnPos)
heartAlnPos "chr12"]
heartAlnPos[export.bw(heartAlnPos,con="heartPos.bw")
## RleList of length 1
## $chr12
## integer-Rle of length 133275309 with 3135 runs
## Lengths: 33666 50 3194555 35 ... 1530798 40 9963077
## Values : 0 2 0 3 ... 0 1 0
The coverage function allows for the specification of a additional parameter, weight.
The weight parameter applies a scaling factor to the calculated scores and so can be useful to normalize your signal e.g. to total mapped reads.
<- coverage("Heart.bam",
heartCoverageX10 weight = 10)
"chr12"]
heartCoverageX10[## RleList of length 1
## $chr12
## numeric-Rle of length 133275309 with 4553 runs
## Lengths: 33666 50 3194555 35 ... 7200240 38 2762799
## Values : 0 20 0 30 ... 0 10 0
"chr12"]
heartCoverage[## RleList of length 1
## $chr12
## integer-Rle of length 133275309 with 4553 runs
## Lengths: 33666 50 3194555 35 ... 7200240 38 2762799
## Values : 0 2 0 3 ... 0 1 0
In combination with idxstatsBam() function, we could use this to normalize your signal to total mapped reads.
<- idxstatsBam("Heart.bam")
allChromosomeStats allChromosomeStats
## seqnames seqlength mapped unmapped
## 1 chr12 133275309 132381 0
## 2 * 0 0 223
<- sum(allChromosomeStats[,"mapped"])
mapped <- coverage("Heart.bam",
heartCoverageNorm weight = (10^6)/mapped)
"chr12"] heartCoverageNorm[
## RleList of length 1
## $chr12
## numeric-Rle of length 133275309 with 4553 runs
## Lengths: 33666 50 3194555 ... 38 2762799
## Values : 0.00000e+00 1.51079e+01 0.00000e+00 ... 7.55395e+00 -1.10187e-11
REMEMBER: You can always export the RLE to a BigWig with rtracklayer, so you can then view it in IGV.
Here we get the mean signal (read depth per base pair) over exons for slc25A3 gene, Entrez ID 5250.
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
<- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene,
exonsOfGenes by="gene")
<- exonsOfGenes[["5250"]]
slc25A3 slc25A3
## GRanges object with 39 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr12 98593591-98593740 + | 381304 <NA>
## [2] chr12 98593625-98593740 + | 381305 <NA>
## [3] chr12 98593625-98594135 + | 381306 <NA>
## [4] chr12 98593650-98593740 + | 381307 <NA>
## [5] chr12 98593655-98593740 + | 381308 <NA>
## ... ... ... ... . ... ...
## [35] chr12 98601368-98601708 + | 381339 <NA>
## [36] chr12 98601368-98601997 + | 381340 <NA>
## [37] chr12 98601368-98602168 + | 381341 <NA>
## [38] chr12 98601368-98606367 + | 381342 <NA>
## [39] chr12 98601368-98606379 + | 381343 <NA>
## -------
## seqinfo: 595 sequences (1 circular) from hg38 genome
<- coverage("Heart.bam")
heartCoverageNorm <- mean(heartCoverageNorm[slc25A3])
myMeanCovOverExons myMeanCovOverExons
## chr12 chr12 chr12 chr12 chr12 chr12 chr12 chr12
## 197.3933 255.2500 633.2740 323.8462 341.8605 345.5647 510.2909 715.6889
## chr12 chr12 chr12 chr12 chr12 chr12 chr12 chr12
## 601.2222 554.0204 756.4282 1757.2919 473.0144 1770.5732 539.0552 4232.1680
## chr12 chr12 chr12 chr12 chr12 chr12 chr12 chr12
## 2317.5153 204.4162 299.2623 2600.1274 6779.2570 6785.8833 6749.0552 5607.1697
## chr12 chr12 chr12 chr12 chr12 chr12 chr12 chr12
## 2104.1342 6218.8696 6158.3462 4092.3000 6375.2486 3771.8433 7370.8919 7180.3787
## chr12 chr12 chr12 chr12 chr12 chr12 chr12
## 7139.7459 6667.7147 6648.3255 3626.9619 2858.4856 478.1346 476.9898
To count reads within regions and across gene models we can take advantage of specialist counting tools available within the summarizeOverlaps() functions.
This is important as it will allow us to:
Often we will want to summarize to genes, not individual exons.
When using summarizeOverlaps() function, a read can only be assigned to one feature.
This is important to note when counting features such as overlapping exons.
We will learn how to deal with counting across overlapping features in later sessions such as in RNAseq.
The summarizeOverlaps() function is well intergrated with many of the Bioconductor objects we have seen.
As with the coverage() function it can count reads from Bam files or GRanges objects.
The summarizeOverlaps() function can count over GRanges (features e.g. exons) and GRangesLists (meta-features genes).
To count non-overlapping regions, we can simply specify the path to our BAM file and a GRanges of regions to count over.
The returned object is a RangedSummarizedExperiment object.
<- genes(TxDb.Hsapiens.UCSC.hg38.knownGene) geneBody
## 228 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
<- promoters(geneBody,500,500)
TSS <- summarizeOverlaps(TSS,"Heart.bam")
myTSScounts class(myTSScounts)
## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
The resulting SummarizedExperiment object contains a matrix of our counts over regions and the original GRanges object we used for counting.
As with other Bioconductor objects, the object shows some useful summary information
myTSScounts
## class: RangedSummarizedExperiment
## dim: 27113 1
## metadata(0):
## assays(1): counts
## rownames(27113): 1 10 ... 9994 9997
## rowData names(1): gene_id
## colnames(1): Heart.bam
## colData names(0):
We can access the count matrix using the accessor function assay() with our SummarizedExperiment object.
Here we see total counts for SLC25A3 genes’ TSS region
<- assay(myTSScounts)
countMatrix "5250",] countMatrix[
## Heart.bam
## 5250 2934
The rowRanges() function will access the original GRanges we used for counting.
<- rowRanges(myTSScounts)
Granges Granges
## GRanges object with 27113 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 58362252-58363251 - | 1
## 10 chr8 18390782-18391781 + | 10
## 100 chr20 44651734-44652733 - | 100
## 1000 chr18 28177447-28178446 - | 1000
## 10000 chr1 243850580-243851579 - | 10000
## ... ... ... ... . ...
## 9991 chr9 112333165-112334164 - | 9991
## 9992 chr21 34363506-34364505 + | 9992
## 9993 chr22 19121955-19122954 - | 9993
## 9994 chr6 89829394-89830393 + | 9994
## 9997 chr22 50525962-50526961 - | 9997
## -------
## seqinfo: 595 sequences (1 circular) from hg38 genome
For gene models, we may want to assign reads to each gene and so count all reads overlapping any exons within a gene.
We can count reads in meta-features such as genes using GRangesLists.
First we extract a GRangesList of exons for every gene.
<- exonsBy(TxDb.Hsapiens.UCSC.hg38.knownGene,by="gene")
geneExons "5250"] geneExons[
## GRangesList object of length 1:
## $`5250`
## GRanges object with 39 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr12 98593591-98593740 + | 381304 <NA>
## [2] chr12 98593625-98593740 + | 381305 <NA>
## [3] chr12 98593625-98594135 + | 381306 <NA>
## [4] chr12 98593650-98593740 + | 381307 <NA>
## [5] chr12 98593655-98593740 + | 381308 <NA>
## ... ... ... ... . ... ...
## [35] chr12 98601368-98601708 + | 381339 <NA>
## [36] chr12 98601368-98601997 + | 381340 <NA>
## [37] chr12 98601368-98602168 + | 381341 <NA>
## [38] chr12 98601368-98606367 + | 381342 <NA>
## [39] chr12 98601368-98606379 + | 381343 <NA>
## -------
## seqinfo: 595 sequences (1 circular) from hg38 genome
Now we can summarize counts for every gene by counting in their respective exons.
<- summarizeOverlaps(geneExons,"Heart.bam")
myGeneCounts myGeneCounts
## class: RangedSummarizedExperiment
## dim: 27341 1
## metadata(0):
## assays(1): counts
## rownames(27341): 1 10 ... 9994 9997
## rowData names(0):
## colnames(1): Heart.bam
## colData names(0):
And now we have a count of signal in exons for every gene.
<- assay(myGeneCounts)
countMatrix "5250",] countMatrix[
## Heart.bam
## 5250 65799
The rowRanges() contains the original GRangesList of exons by genes.
<- rowRanges(myGeneCounts)
grgList grgList
## GRangesList object of length 27341:
## $`1`
## GRanges object with 29 ranges and 2 metadata columns:
## seqnames ranges strand | exon_id exon_name
## <Rle> <IRanges> <Rle> | <integer> <character>
## [1] chr19 58345178-58347029 - | 579646 <NA>
## [2] chr19 58345183-58347029 - | 579647 <NA>
## [3] chr19 58346854-58347029 - | 579648 <NA>
## [4] chr19 58346858-58347029 - | 579649 <NA>
## [5] chr19 58346860-58347029 - | 579650 <NA>
## ... ... ... ... . ... ...
## [25] chr19 58357585-58357649 - | 579673 <NA>
## [26] chr19 58357952-58358286 - | 579674 <NA>
## [27] chr19 58358489-58358585 - | 579675 <NA>
## [28] chr19 58359197-58359323 - | 579676 <NA>
## [29] chr19 58362677-58362751 - | 579677 <NA>
## -------
## seqinfo: 595 sequences (1 circular) from hg38 genome
##
## ...
## <27340 more elements>
The summarizeOverlaps() function can count across multiple BAM files.
We can specify a vector of BAM file paths we wish to count from to summarizeOverlaps() function.
<- summarizeOverlaps(geneExons,
allGeneCounts c("Heart.bam","Liver.bam"))
Now we can return the matrix of counts with samples by columns and genes as rows.
<- assay(allGeneCounts)
countMatrix "5250",] countMatrix[
## Heart.bam Liver.bam
## 5250 65799 24296
Counting from BAM files, as with sorting, can be configured to trade of memory usage for speed i.e. less RAM used, but takes longer to count.
We can control the amount of memory used by creating a BamFile object prior to counting using the BamFile() function
<- BamFile("Heart.bam")
myBam class(myBam)
## [1] "BamFile"
## attr(,"package")
## [1] "Rsamtools"
We can additionally specify the yieldSize parameter which controls how many reads are held in memory in one time.
Here we set the yieldSize to a very low 1000 reads, this will save a lot of memory but slow down processing.
We will often need to test the yieldSize which best fits our machine.
<- BamFile("Heart.bam", yieldSize = 1000)
myBam <- summarizeOverlaps(geneExons,myBam)
heartGeneCounts heartGeneCounts
## class: RangedSummarizedExperiment
## dim: 27341 1
## metadata(0):
## assays(1): counts
## rownames(27341): 1 10 ... 9994 9997
## rowData names(0):
## colnames(1): Heart.bam
## colData names(0):
When counting multiple files we can use the BamFileList() function in a similar manner but this time we provide a vector of the two BAM file paths.
<- BamFileList(c("Heart.bam","Liver.bam"),
myBam yieldSize = 1000)
<- summarizeOverlaps(geneExons,myBam)
allGeneCounts allGeneCounts
## class: RangedSummarizedExperiment
## dim: 27341 2
## metadata(0):
## assays(1): counts
## rownames(27341): 1 10 ... 9994 9997
## rowData names(0):
## colnames(2): Heart.bam Liver.bam
## colData names(0):