Aligned sequence reads are stored in BAM format.
Information on the content and state of BAM file is stored in its header.
The body of the BAM file hold information on the original reads.
As well as on the positions reads map to in the genome.
Aligned data is handled in Bioconductor using the GenomicAlignments package.
GenomicAlignments package builds on tools in other packages we have already encountered such as the Rsamtools, GenomicRanges, ShortRead, BSgenome and rtracklayer packages.
In our previous session we saw how to align data to the genome using an aligner and that we can use a splice aware aligner to map RNAseq reads across splice junctions.
In this session we will work with aligned data as BAM files.
I have provided BAM files we saw in our last session as our test data from today.
This can be found in data/liver.bodyMap.bam
We introduced the Rsamtools package in our last session to help us post process our newly aligned BAM files.
The Rsamtools package is the basis for many R/Bioconductor packages working with alignments in BAM format including another package we will work with today, the GenomicAligments package.
First we can load the Rsamtools package.
library(Rsamtools)
We saw in our last session we can sort files using the sortBam() function. This function returns the name of sorted BAM file.
By default the file is sorted by chromosome name and then by coordinates of reads within these chromosomes.
<- sortBam("data/liver.bodyMap.bam",
coordSorted "Sorted_liver")
coordSorted
## [1] "Sorted_liver.bam"
Some external programs will require reads be sorted by read name, not coordinates. To sort by read name we can set the sortBam arguement byQname to TRUE.
<- sortBam("data/liver.bodyMap.bam",
readnameSorted "SortedByName_liver",
byQname=TRUE)
readnameSorted
## [1] "SortedByName_liver.bam"
We can control how much memory we use with the maxMemory parameter. This allows to sort very large files on smaller memory computers (such as our laptops).
The maxMemory is specified as the maximum MB of RAM which our sortBam() function call can use. In sorting Rsamtools will produce mulitple smaller BAM files, the smaller the maxMemory value the greater then number of temporary files.
Here in this example, we sort our file in 1MB of memory and it will produce several temporary files.
<- sortBam("data/liver.bodyMap.bam",
coordSorted "Sorted_liver",
maxMemory=1)
coordSorted
## [1] "Sorted_liver.bam"
Once we have a coordinate sorted file we can index these files to allow for use in other programs such as IGV.
Importantly for us, an indexed BAM file allows us to access information from a BAM file by genomic location.
indexBam("Sorted_liver.bam")
## Sorted_liver.bam
## "Sorted_liver.bam.bai"
We saw in our last session that we can use quickBamFlagSummary() function to get information on alignnment rates and other BAM flags.
quickBamFlagSummary("Sorted_liver.bam")
## group | nb of | nb of | mean / max
## of | records | unique | records per
## records | in group | QNAMEs | unique QNAME
## All records........................ A | 48401 | 33538 | 1.44 / 2
## o template has single segment.... S | 48401 | 33538 | 1.44 / 2
## o template has multiple segments. M | 0 | 0 | NA / NA
## - first segment.............. F | 0 | 0 | NA / NA
## - last segment............... L | 0 | 0 | NA / NA
## - other segment.............. O | 0 | 0 | NA / NA
##
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
##
## Details for group S:
## o record is mapped.............. S1 | 48294 | 33489 | 1.44 / 2
## - primary alignment......... S2 | 48294 | 33489 | 1.44 / 2
## - secondary alignment....... S3 | 0 | 0 | NA / NA
## o record is unmapped............ S4 | 107 | 106 | 1.01 / 2
Despite the name, the quickBamFlagSummary can take a noticeable amount of time when working with large files.
To get a very quick overview of number of mapped reads we can use the indexed BAM file and the idxstatsBam() function.
The idxstatsBam() function returns a data.frame containing chromosome name, chromosome lengths and number of mapped reads.
idxstatsBam("Sorted_liver.bam")
## seqnames seqlength mapped unmapped
## 1 chr12 133275309 48294 0
## 2 * 0 0 107
Importing and handling of BAM files is handled largely in the GenomicAlignments package.
We first load the package.
::install('GenomicAlignments')
BiocManagerlibrary(GenomicAlignments)
We saw in our earlier sessions on file formats that BAM files contain header information.
This header information provides a list of the chromosomes used and their lengths as well as the current state of BAM sorting (unsorted, coordinate or by name).
Additional information on programs used in processing of BAM file may also be present in BAM header
We can retrieve the BAM header in R using the scanBamHeader() function and the name of the file we wish to access header information from.
Header information is returned as a list.
<- scanBamHeader("Sorted_liver.bam")
myHeader str(myHeader)
## List of 1
## $ Sorted_liver.bam:List of 2
## ..$ targets: Named int 133275309
## .. ..- attr(*, "names")= chr "chr12"
## ..$ text :List of 3
## .. ..$ @HD: chr [1:2] "VN:1.0" "SO:coordinate"
## .. ..$ @SQ: chr [1:2] "SN:chr12" "LN:133275309"
## .. ..$ @PG: chr [1:4] "ID:subread" "PN:subread" "VN:Rsubread 2.0.1" "CL:\"subjunc\" \"-r\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_cour"| __truncated__
We can access useful information such as chromosome names and lengths as with other lists using $ accessors. The first level of list is the file names. The second levels of list are
names(myHeader)
## [1] "Sorted_liver.bam"
names(myHeader$Sorted_liver.bam)
## [1] "targets" "text"
The chromosome lengths are stored in a named vector under the targets list.
$Sorted_liver.bam$targets myHeader
## chr12
## 133275309
The text list contains information on sorting and programs.
$Sorted_liver.bam$text myHeader
## $`@HD`
## [1] "VN:1.0" "SO:coordinate"
##
## $`@SQ`
## [1] "SN:chr12" "LN:133275309"
##
## $`@PG`
## [1] "ID:subread"
## [2] "PN:subread"
## [3] "VN:Rsubread 2.0.1"
## [4] "CL:\"subjunc\" \"-r\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.fq\" \"-o\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.bam\" \"-i\" \"chr12\" \"-n\" \"14\" \"-m\" \"1\" \"-p\" \"1\" \"-M\" \"3\" \"-T\" \"1\" \"-I\" \"5\" \"--multiMapping\" \"-B\" \"1\" \"-d\" \"50\" \"-D\" \"600\" \"-S\" \"fr\" \"--trim5\" \"0\" \"--trim3\" \"0\" \"-G\" \"-1\" \"-E\" \"0\" \"-X\" \"0\" \"-Y\" \"2\" \"-P\" \"3\" "
We can see the order by reviewing the HD element. Here we sorted by coordinate.
$Sorted_liver.bam$text["@HD"] myHeader
## $`@HD`
## [1] "VN:1.0" "SO:coordinate"
We can check the order for our name sorted BAM file too.
<- scanBamHeader("SortedByName_liver.bam")
myHeader $SortedByName_liver.bam$text["@HD"] myHeader
## $`@HD`
## [1] "VN:1.0" "SO:queryname"
We can see the program information by reviewing the PG element. Note that PG elements are not always complete and depend on tools used. Here we can see aligner, version used and the command line command itself.
<- scanBamHeader("Sorted_liver.bam")
myHeader $Sorted_liver.bam$text["@PG"] myHeader
## $`@PG`
## [1] "ID:subread"
## [2] "PN:subread"
## [3] "VN:Rsubread 2.0.1"
## [4] "CL:\"subjunc\" \"-r\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.fq\" \"-o\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.bam\" \"-i\" \"chr12\" \"-n\" \"14\" \"-m\" \"1\" \"-p\" \"1\" \"-M\" \"3\" \"-T\" \"1\" \"-I\" \"5\" \"--multiMapping\" \"-B\" \"1\" \"-d\" \"50\" \"-D\" \"600\" \"-S\" \"fr\" \"--trim5\" \"0\" \"--trim3\" \"0\" \"-G\" \"-1\" \"-E\" \"0\" \"-X\" \"0\" \"-Y\" \"2\" \"-P\" \"3\" "
Now we have an idea how our BAM was constructed, we want to start to retrieve some of the data from our BAM file.
We can use the readGAlignments() function to import the BAM data into R.
The returned object is a GAlignments object.
<- readGAlignments("Sorted_liver.bam")
myReads class(myReads)
## [1] "GAlignments"
## attr(,"package")
## [1] "GenomicAlignments"
The resulting GAlignments object contains much of the information we saw in the SAM file earlier on.
This include the chromosome, strand, start and end position of alignment.
1:2,] myReads[
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M19S 50 956642 956672 31
## [2] chr12 + 20S30M 50 7243955 7243984 30
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
These objects are similar to GRanges objects, as we have the width of our ranges (end coordinate - start coordinate) in the GAlignments objects.
Additionally we have the qwidth which contains information on the width of the original read.
1:2,] myReads[
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M19S 50 956642 956672 31
## [2] chr12 + 20S30M 50 7243955 7243984 30
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
The GAlignments object also contains information on the cigar strings within alignments and the number of junctions a read spans in the njunc column
Cigar strings denote the matches against reference.
75M - This is 75 matches in a row.
1:2,] myReads[
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M19S 50 956642 956672 31
## [2] chr12 + 20S30M 50 7243955 7243984 30
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
The GAlignments object inherits much of the functionality we have seen with GRanges object.
We can access information using the same accessors as we saw with GRanges objects.
seqnames(myReads)
## factor-Rle of length 48294 with 1 run
## Lengths: 48294
## Values : chr12
## Levels(1): chr12
start(myReads)[1:2]
## [1] 956642 7243955
The GAlignments object also has some new accessors to access the cigar and njunc information using the cigar and njunc functions.
cigar(myReads)[1:2]
## [1] "31M19S" "20S30M"
njunc(myReads)[1:2]
## [1] 0 0
We can also index and subset the same way as with GRanges objects.
Here we only keep reads on positive strand.
strand(myReads) == "+"] myReads[
## GAlignments object with 24520 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 + 20S30M 50 7243955 7243984 30
## [2] chr12 + 29M21S 50 7923596 7923624 29
## [3] chr12 + 29M21S 50 7923596 7923624 29
## [4] chr12 + 10S30M10S 50 24010815 24010844 30
## [5] chr12 + 10S30M10S 50 24010815 24010844 30
## ... ... ... ... ... ... ... ...
## [24516] chr12 + 23M27S 50 123412219 123412241 23
## [24517] chr12 + 23M27S 50 123412219 123412241 23
## [24518] chr12 + 23M27S 50 123412219 123412241 23
## [24519] chr12 + 23M27S 50 123412219 123412241 23
## [24520] chr12 + 29M21S 50 132800747 132800775 29
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [24516] 0
## [24517] 0
## [24518] 0
## [24519] 0
## [24520] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
We can alter range positions using the GenomicRanges narrow() function.
Here we resize our reads to be the 5’ 1st base pair at the beginning of every read. Note that narrow() function does not take notice of strand.
The cigar strings and njunc will automatically be altered as well.
<- narrow(myReads, start=1, width = 1)
my5primeReads 1:2] my5primeReads[
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 1M 1 956642 956642 1
## [2] chr12 + 1M 1 7243955 7243955 1
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
We can control this ourselves with a little subsetting.
<- narrow(myReads[strand(myReads) == "+"],
myReadsPos start=1, width = 1)
<- narrow(myReads[strand(myReads) == "-"],
myReadsNeg end=-1, width = 1)
<- c(myReadsPos,myReadsNeg)
my5primeReads 1:2] my5primeReads[
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 + 1M 1 7243955 7243955 1
## [2] chr12 + 1M 1 7923596 7923596 1
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
We can convert a GAlignments object to a GRanges to take advantage of other functions using the granges() function.
This is most useful when reads align continously to a genome (WGS, ChIP-seq, ATAC-seq).
<- granges(myReads,use.mcols = TRUE)
myReadAsGRanges myReadAsGRanges
## GRanges object with 48294 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 956642-956672 -
## [2] chr12 7243955-7243984 +
## [3] chr12 7923596-7923624 +
## [4] chr12 7923596-7923624 +
## [5] chr12 12556357-12556386 -
## ... ... ... ...
## [48290] chr12 132082663-132082688 -
## [48291] chr12 132082663-132082688 -
## [48292] chr12 132082663-132082688 -
## [48293] chr12 132800747-132800775 +
## [48294] chr12 132811683-132811711 -
## -------
## seqinfo: 1 sequence from an unspecified genome
To convert RNA-seq reads which may span exons is a little more difficult. Since a read can potentially span multiple exons,a single read may need to be converted to multiple ranges.
To solve this we can use the grglist() function to return a GRangesList with a separate GRanges for each read.
<- grglist(myReads,use.mcols = TRUE)
myReadAsGRangesList njunc(myReads) == 1] myReadAsGRangesList[
## GRangesList object of length 9291:
## [[1]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 98593676-98593740 +
## [2] chr12 98593975-98593984 +
## -------
## seqinfo: 1 sequence from an unspecified genome
##
## [[2]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 98593678-98593740 +
## [2] chr12 98593975-98593984 +
## -------
## seqinfo: 1 sequence from an unspecified genome
##
## [[3]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 98593681-98593740 +
## [2] chr12 98593975-98593989 +
## -------
## seqinfo: 1 sequence from an unspecified genome
##
## ...
## <9288 more elements>
We can convert GRanges back to a GAlignments object using the function as(myGranges, “GAlignments”).
<- granges(myReads, use.mcols = TRUE)
myReadAsGRanges <- as(myReadAsGRanges, "GAlignments")
myReadsAgain 1:2] myReadsAgain[
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M 31 956642 956672 31
## [2] chr12 + 30M 30 7243955 7243984 30
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
This allows to perform some complex operations as GRanges and then convert back to a GAlignments object.
<- granges(myReads, use.mcols = TRUE)
myReadAsGRanges <- resize(myReadAsGRanges, fix = "start", width = 1)
my5Prime <- as(my5Prime, "GAlignments")
my5PrimeAsReads my5PrimeAsReads
## GAlignments object with 48294 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 1M 1 956672 956672 1
## [2] chr12 + 1M 1 7243955 7243955 1
## [3] chr12 + 1M 1 7923596 7923596 1
## [4] chr12 + 1M 1 7923596 7923596 1
## [5] chr12 - 1M 1 12556386 12556386 1
## ... ... ... ... ... ... ... ...
## [48290] chr12 - 1M 1 132082688 132082688 1
## [48291] chr12 - 1M 1 132082688 132082688 1
## [48292] chr12 - 1M 1 132082688 132082688 1
## [48293] chr12 + 1M 1 132800747 132800747 1
## [48294] chr12 - 1M 1 132811711 132811711 1
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [48290] 0
## [48291] 0
## [48292] 0
## [48293] 0
## [48294] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
One very good reason to convert our GRanges objects back to a GAlignments object is so we can export our modified reads back to a BAM.
We can use the rtracklayer packages export() function to export our GAlignments file to a BAM file.
library(rtracklayer)
export(my5PrimeAsReads, con="myModifiedReads.bam")
Handling large BAM file can mean we will often need to import only a subset of reads from a BAM file in one go. We can have a fine degree of control over what we import from a BAM file using the ScanBamParam() function.
Most importantly we can specify the what and which parameters to control what read information we import and which regions we import respectively.
We can specify to import information from only specific regions by providing a GRanges of regions of interest to the which parameter in the ScanBamParam() function.
<- GRanges("chr12", IRanges(98591400,98608400))
myRanges <- ScanBamParam(which=myRanges)
myParam myParam
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:
## bamTagFilter:
## bamWhich: 1 ranges
## bamWhat:
## bamMapqFilter: NA
We can then use the newly created ScanBamParam object within our readGAlignments() function call.
Now we import only reads which overlap our specified GRanges.
<- readGAlignments("Sorted_liver.bam", param = myParam)
filteredReads filteredReads
## GAlignments object with 48147 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 + 50M 50 98593646 98593695 50
## [2] chr12 + 50M 50 98593646 98593695 50
## [3] chr12 + 50M 50 98593646 98593695 50
## [4] chr12 + 50M 50 98593647 98593696 50
## [5] chr12 + 50M 50 98593647 98593696 50
## ... ... ... ... ... ... ... ...
## [48143] chr12 - 50M 50 98606306 98606355 50
## [48144] chr12 - 50M 50 98606306 98606355 50
## [48145] chr12 - 50M 50 98606306 98606355 50
## [48146] chr12 - 50M 50 98606311 98606360 50
## [48147] chr12 - 50M 50 98606311 98606360 50
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [48143] 0
## [48144] 0
## [48145] 0
## [48146] 0
## [48147] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
We can also control the information we import using the what parameter in ScanBamParam() function.
Here we import the read name, sequence and qualities.
<- ScanBamParam(what=c("qname", "seq", "qual"))
myParam <- readGAlignments("Sorted_liver.bam", param = myParam)
infoInReads 1] infoInReads[
## GAlignments object with 1 alignment and 3 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M19S 50 956642 956672 31
## njunc | qname seq
## <integer> | <character> <DNAStringSet>
## [1] 0 | HWI-BRUNOP16X_0001:8.. GCTCAAGCGA...CACAATTGNG
## qual
## <PhredQuality>
## [1] gg^eg]\\[Z[...SSTTTTSSBT
## -------
## seqinfo: 1 sequence from an unspecified genome
We can access this additional information as we would in GRanges objects by using the mcols function.
mcols(infoInReads)
## DataFrame with 48294 rows and 3 columns
## qname seq qual
## <character> <DNAStringSet> <PhredQuality>
## 1 HWI-BRUNOP16X_0001:8.. GCTCAAGCGA...CACAATTGNG gg^eg]\\[Z[...SSTTTTSSBT
## 2 HWI-BRUNOP16X_0001:8.. AAAAATACAA...TAGTTCCAGC aeebeebfgg...gfgbgeeebe
## 3 HWI-BRUNOP16X_0001:8.. AGAGTGAAAC...ATGATTTTTG gggggggggg...gggggggggg
## 4 HWI-BRUNOP16X_0001:8.. AGAGTGAAAC...ATGATTTTTG gggggggggg...gggggggggg
## 5 HWI-BRUNOP16X_0001:8.. TTGGAGACCA...TACAAAAAAT gggggggggg...ffgggggggg
## ... ... ... ...
## 48290 HWI-BRUNOP16X_0001:8.. TTGGCCAGGC...GCCCACCTCA gggggggggg...eggggggggg
## 48291 HWI-BRUNOP16X_0001:8.. TTGGCCAGGC...GCCCACCTCA gggggf^ece...dfggggeggg
## 48292 HWI-BRUNOP16X_0001:8.. TTGGCCAGGC...GCCCACCTCA gggegggggg...gggggggggg
## 48293 HWI-BRUNOP16X_0001:8.. CCAGGAGTTG...CTCTCTCTAC gggggggggg...gggggggggg
## 48294 HWI-BRUNOP16X_0001:8.. GGCTGTTCTT...TCAGCCTCCC BBBBBBBBBB...TTTSKSQXTT
Using the ScanBamParam() function in combination with a loop and information from our header file we can process our BAM files, one chromosome at a time.
<- scanBamHeader("Sorted_liver.bam")
bamHeader <- bamHeader$Sorted_liver.bam$targets
myChromosomes for(i in 1:length(myChromosomes)){
<- GRanges(names(myChromosomes)[i],
grangesForImport IRanges(1,myChromosomes)[i])
<- ScanBamParam(which = grangesForImport)
myParam <- readGAlignments("Sorted_liver.bam",
myReads param=myParam)
print(length(myReads))
}
## [1] 48294