+ - 0:00:00
Notes for current slide
Notes for next slide

Aligned data In Bioconductor


Rockefeller University, Bioinformatics Resource Centre

http://rockefelleruniversity.github.io/Bioconductor_Introduction/

1 / 46

Aligned Sequences


2 / 46

Aligned Sequences

Aligned sequence reads are stored in BAM format.

igv
3 / 46

Aligned Sequences - BAM header

Information on the content and state of BAM file is stored in its header.

igv
4 / 46

Aligned Sequences - BAM reads

The body of the BAM file hold information on the original reads.

igv
5 / 46

Aligned Sequences - BAM reads

As well as on the positions reads map to in the genome.

igv
6 / 46

Aligned data and Bioconductor

Aligned data is handled in Bioconductor using the GenomicAlignments package.

GenomicAlignments package builds on tools in other packages we have already encountered such as the Rsamtools, GenomicRanges, ShortRead, BSgenome and rtracklayer packages.

7 / 46

From unaligned to aligned data

In our previous session we saw how to align data to the genome using an aligner and that we can use a splice aware aligner to map RNAseq reads across splice junctions.

igv
8 / 46

The Data

In this session we will work with aligned data as BAM files.

I have provided BAM files we saw in our last session as our test data from today.

This can be found in data/liver.bodyMap.bam

9 / 46

Rsamtools

We introduced the Rsamtools package in our last session to help us post process our newly aligned BAM files.

The Rsamtools package is the basis for many R/Bioconductor packages working with alignments in BAM format including another package we will work with today, the GenomicAligments package.

First we can load the Rsamtools package.

library(Rsamtools)
10 / 46

Rsamtools - Sorting by coordinate

We saw in our last session we can sort files using the sortBam() function. This function returns the name of sorted BAM file.

By default the file is sorted by chromosome name and then by coordinates of reads within these chromosomes.

coordSorted <- sortBam("data/liver.bodyMap.bam",
"Sorted_liver")
coordSorted
## [1] "Sorted_liver.bam"
11 / 46

Rsamtools - Sorting by read name

Some external programs will require reads be sorted by read name, not coordinates. To sort by read name we can set the sortBam arguement byQname to TRUE.

readnameSorted <- sortBam("data/liver.bodyMap.bam",
"SortedByName_liver",
byQname=TRUE)
readnameSorted
## [1] "SortedByName_liver.bam"
12 / 46

Rsamtools - Optimizing memory

We can control how much memory we use with the maxMemory parameter. This allows to sort very large files on smaller memory computers (such as our laptops).

The maxMemory is specified as the maximum MB of RAM which our sortBam() function call can use. In sorting Rsamtools will produce mulitple smaller BAM files, the smaller the maxMemory value the greater then number of temporary files.

Here in this example, we sort our file in 1MB of memory and it will produce several temporary files.

coordSorted <- sortBam("data/liver.bodyMap.bam",
"Sorted_liver",
maxMemory=1)
coordSorted
## [1] "Sorted_liver.bam"
13 / 46

Rsamtools - Indexing

Once we have a coordinate sorted file we can index these files to allow for use in other programs such as IGV.

Importantly for us, an indexed BAM file allows us to access information from a BAM file by genomic location.

indexBam("Sorted_liver.bam")
## Sorted_liver.bam
## "Sorted_liver.bam.bai"
14 / 46

Rsamtools - Summary

We saw in our last session that we can use quickBamFlagSummary() function to get information on alignnment rates and other BAM flags.

quickBamFlagSummary("Sorted_liver.bam")
## group | nb of | nb of | mean / max
## of | records | unique | records per
## records | in group | QNAMEs | unique QNAME
## All records........................ A | 48401 | 33538 | 1.44 / 2
## o template has single segment.... S | 48401 | 33538 | 1.44 / 2
## o template has multiple segments. M | 0 | 0 | NA / NA
## - first segment.............. F | 0 | 0 | NA / NA
## - last segment............... L | 0 | 0 | NA / NA
## - other segment.............. O | 0 | 0 | NA / NA
##
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
##
## Details for group S:
## o record is mapped.............. S1 | 48294 | 33489 | 1.44 / 2
## - primary alignment......... S2 | 48294 | 33489 | 1.44 / 2
## - secondary alignment....... S3 | 0 | 0 | NA / NA
## o record is unmapped............ S4 | 107 | 106 | 1.01 / 2
15 / 46

Rsamtools - Summary

Despite the name, the quickBamFlagSummary can take a noticeable amount of time when working with large files.

To get a very quick overview of number of mapped reads we can use the indexed BAM file and the idxstatsBam() function.

The idxstatsBam() function returns a data.frame containing chromosome name, chromosome lengths and number of mapped reads.

idxstatsBam("Sorted_liver.bam")
## seqnames seqlength mapped unmapped
## 1 chr12 133275309 48294 0
## 2 * 0 0 107
16 / 46

GenomicAlignments

Importing and handling of BAM files is handled largely in the GenomicAlignments package.

We first load the package.

BiocManager::install('GenomicAlignments')
library(GenomicAlignments)
17 / 46

GenomicAlignments - BAM header

We saw in our earlier sessions on file formats that BAM files contain header information.

This header information provides a list of the chromosomes used and their lengths as well as the current state of BAM sorting (unsorted, coordinate or by name).

Additional information on programs used in processing of BAM file may also be present in BAM header

18 / 46

GenomicAlignments - BAM header

We can retrieve the BAM header in R using the scanBamHeader() function and the name of the file we wish to access header information from.

Header information is returned as a list.

myHeader <- scanBamHeader("Sorted_liver.bam")
str(myHeader)
## List of 1
## $ Sorted_liver.bam:List of 2
## ..$ targets: Named int 133275309
## .. ..- attr(*, "names")= chr "chr12"
## ..$ text :List of 3
## .. ..$ @HD: chr [1:2] "VN:1.0" "SO:coordinate"
## .. ..$ @SQ: chr [1:2] "SN:chr12" "LN:133275309"
## .. ..$ @PG: chr [1:4] "ID:subread" "PN:subread" "VN:Rsubread 2.0.1" "CL:\"subjunc\" \"-r\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_cour"| __truncated__
19 / 46

GenomicAlignments - BAM header

We can access useful information such as chromosome names and lengths as with other lists using $ accessors. The first level of list is the file names. The second levels of list are

  • Chromosome names and lengths - named targets
  • Unparsed Text from lines of header - named text
names(myHeader)
## [1] "Sorted_liver.bam"
names(myHeader$Sorted_liver.bam)
## [1] "targets" "text"
20 / 46

GenomicAlignments - BAM header

The chromosome lengths are stored in a named vector under the targets list.

myHeader$Sorted_liver.bam$targets
## chr12
## 133275309
21 / 46

GenomicAlignments - BAM header

The text list contains information on sorting and programs.

myHeader$Sorted_liver.bam$text
## $`@HD`
## [1] "VN:1.0" "SO:coordinate"
##
## $`@SQ`
## [1] "SN:chr12" "LN:133275309"
##
## $`@PG`
## [1] "ID:subread"
## [2] "PN:subread"
## [3] "VN:Rsubread 2.0.1"
## [4] "CL:\"subjunc\" \"-r\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.fq\" \"-o\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.bam\" \"-i\" \"chr12\" \"-n\" \"14\" \"-m\" \"1\" \"-p\" \"1\" \"-M\" \"3\" \"-T\" \"1\" \"-I\" \"5\" \"--multiMapping\" \"-B\" \"1\" \"-d\" \"50\" \"-D\" \"600\" \"-S\" \"fr\" \"--trim5\" \"0\" \"--trim3\" \"0\" \"-G\" \"-1\" \"-E\" \"0\" \"-X\" \"0\" \"-Y\" \"2\" \"-P\" \"3\" "
22 / 46

GenomicAlignments - BAM header

We can see the order by reviewing the HD element. Here we sorted by coordinate.

myHeader$Sorted_liver.bam$text["@HD"]
## $`@HD`
## [1] "VN:1.0" "SO:coordinate"
23 / 46

GenomicAlignments - BAM header

We can check the order for our name sorted BAM file too.

myHeader <- scanBamHeader("SortedByName_liver.bam")
myHeader$SortedByName_liver.bam$text["@HD"]
## $`@HD`
## [1] "VN:1.0" "SO:queryname"
24 / 46

GenomicAlignments - BAM header

We can see the program information by reviewing the PG element. Note that PG elements are not always complete and depend on tools used. Here we can see aligner, version used and the command line command itself.

myHeader <- scanBamHeader("Sorted_liver.bam")
myHeader$Sorted_liver.bam$text["@PG"]
## $`@PG`
## [1] "ID:subread"
## [2] "PN:subread"
## [3] "VN:Rsubread 2.0.1"
## [4] "CL:\"subjunc\" \"-r\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.fq\" \"-o\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.bam\" \"-i\" \"chr12\" \"-n\" \"14\" \"-m\" \"1\" \"-p\" \"1\" \"-M\" \"3\" \"-T\" \"1\" \"-I\" \"5\" \"--multiMapping\" \"-B\" \"1\" \"-d\" \"50\" \"-D\" \"600\" \"-S\" \"fr\" \"--trim5\" \"0\" \"--trim3\" \"0\" \"-G\" \"-1\" \"-E\" \"0\" \"-X\" \"0\" \"-Y\" \"2\" \"-P\" \"3\" "
25 / 46

GenomicAlignments - BAM Alignments

Now we have an idea how our BAM was constructed, we want to start to retrieve some of the data from our BAM file.

We can use the readGAlignments() function to import the BAM data into R.

The returned object is a GAlignments object.

myReads <- readGAlignments("Sorted_liver.bam")
class(myReads)
## [1] "GAlignments"
## attr(,"package")
## [1] "GenomicAlignments"
26 / 46

GAlignment objects

The resulting GAlignments object contains much of the information we saw in the SAM file earlier on.

This include the chromosome, strand, start and end position of alignment.

myReads[1:2,]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M19S 50 956642 956672 31
## [2] chr12 + 20S30M 50 7243955 7243984 30
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
27 / 46

GAlignment objects

These objects are similar to GRanges objects, as we have the width of our ranges (end coordinate - start coordinate) in the GAlignments objects.

Additionally we have the qwidth which contains information on the width of the original read.

myReads[1:2,]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M19S 50 956642 956672 31
## [2] chr12 + 20S30M 50 7243955 7243984 30
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
28 / 46

GAlignment objects

The GAlignments object also contains information on the cigar strings within alignments and the number of junctions a read spans in the njunc column

Cigar strings denote the matches against reference.

75M - This is 75 matches in a row.

myReads[1:2,]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M19S 50 956642 956672 31
## [2] chr12 + 20S30M 50 7243955 7243984 30
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
29 / 46

GAlignment objects - Accesors

The GAlignments object inherits much of the functionality we have seen with GRanges object.

We can access information using the same accessors as we saw with GRanges objects.

seqnames(myReads)
## factor-Rle of length 48294 with 1 run
## Lengths: 48294
## Values : chr12
## Levels(1): chr12
start(myReads)[1:2]
## [1] 956642 7243955
30 / 46

GAlignment objects - Accesors

The GAlignments object also has some new accessors to access the cigar and njunc information using the cigar and njunc functions.

cigar(myReads)[1:2]
## [1] "31M19S" "20S30M"
njunc(myReads)[1:2]
## [1] 0 0
31 / 46

GAlignment objects - Indexing

We can also index and subset the same way as with GRanges objects.

Here we only keep reads on positive strand.

myReads[strand(myReads) == "+"]
## GAlignments object with 24520 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 + 20S30M 50 7243955 7243984 30
## [2] chr12 + 29M21S 50 7923596 7923624 29
## [3] chr12 + 29M21S 50 7923596 7923624 29
## [4] chr12 + 10S30M10S 50 24010815 24010844 30
## [5] chr12 + 10S30M10S 50 24010815 24010844 30
## ... ... ... ... ... ... ... ...
## [24516] chr12 + 23M27S 50 123412219 123412241 23
## [24517] chr12 + 23M27S 50 123412219 123412241 23
## [24518] chr12 + 23M27S 50 123412219 123412241 23
## [24519] chr12 + 23M27S 50 123412219 123412241 23
## [24520] chr12 + 29M21S 50 132800747 132800775 29
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [24516] 0
## [24517] 0
## [24518] 0
## [24519] 0
## [24520] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
32 / 46

GAlignment objects - Narrow

We can alter range positions using the GenomicRanges narrow() function.

Here we resize our reads to be the 5' 1st base pair at the beginning of every read. Note that narrow() function does not take notice of strand.

The cigar strings and njunc will automatically be altered as well.

my5primeReads <- narrow(myReads, start=1, width = 1)
my5primeReads[1:2]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 1M 1 956642 956642 1
## [2] chr12 + 1M 1 7243955 7243955 1
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
33 / 46

GAlignment objects - Narrow

We can control this ourselves with a little subsetting.

myReadsPos <- narrow(myReads[strand(myReads) == "+"],
start=1, width = 1)
myReadsNeg <- narrow(myReads[strand(myReads) == "-"],
end=-1, width = 1)
my5primeReads <- c(myReadsPos,myReadsNeg)
my5primeReads[1:2]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 + 1M 1 7243955 7243955 1
## [2] chr12 + 1M 1 7923596 7923596 1
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
34 / 46

GAlignments to GRanges

We can convert a GAlignments object to a GRanges to take advantage of other functions using the granges() function.

This is most useful when reads align continously to a genome (WGS, ChIP-seq, ATAC-seq).

myReadAsGRanges <- granges(myReads,use.mcols = TRUE)
myReadAsGRanges
## GRanges object with 48294 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 956642-956672 -
## [2] chr12 7243955-7243984 +
## [3] chr12 7923596-7923624 +
## [4] chr12 7923596-7923624 +
## [5] chr12 12556357-12556386 -
## ... ... ... ...
## [48290] chr12 132082663-132082688 -
## [48291] chr12 132082663-132082688 -
## [48292] chr12 132082663-132082688 -
## [48293] chr12 132800747-132800775 +
## [48294] chr12 132811683-132811711 -
## -------
## seqinfo: 1 sequence from an unspecified genome
35 / 46

GAlignments to GRanges

To convert RNA-seq reads which may span exons is a little more difficult. Since a read can potentially span multiple exons,a single read may need to be converted to multiple ranges.

To solve this we can use the grglist() function to return a GRangesList with a separate GRanges for each read.

myReadAsGRangesList <- grglist(myReads,use.mcols = TRUE)
myReadAsGRangesList[njunc(myReads) == 1]
## GRangesList object of length 9291:
## [[1]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 98593676-98593740 +
## [2] chr12 98593975-98593984 +
## -------
## seqinfo: 1 sequence from an unspecified genome
##
## [[2]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 98593678-98593740 +
## [2] chr12 98593975-98593984 +
## -------
## seqinfo: 1 sequence from an unspecified genome
##
## [[3]]
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr12 98593681-98593740 +
## [2] chr12 98593975-98593989 +
## -------
## seqinfo: 1 sequence from an unspecified genome
##
## ...
## <9288 more elements>
36 / 46

GRanges back to GAlignments

We can convert GRanges back to a GAlignments object using the function as(myGranges, "GAlignments").

myReadAsGRanges <- granges(myReads, use.mcols = TRUE)
myReadsAgain <- as(myReadAsGRanges, "GAlignments")
myReadsAgain[1:2]
## GAlignments object with 2 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M 31 956642 956672 31
## [2] chr12 + 30M 30 7243955 7243984 30
## njunc
## <integer>
## [1] 0
## [2] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
37 / 46

Why use GRanges?

This allows to perform some complex operations as GRanges and then convert back to a GAlignments object.

myReadAsGRanges <- granges(myReads, use.mcols = TRUE)
my5Prime <- resize(myReadAsGRanges, fix = "start", width = 1)
my5PrimeAsReads <- as(my5Prime, "GAlignments")
my5PrimeAsReads
## GAlignments object with 48294 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 1M 1 956672 956672 1
## [2] chr12 + 1M 1 7243955 7243955 1
## [3] chr12 + 1M 1 7923596 7923596 1
## [4] chr12 + 1M 1 7923596 7923596 1
## [5] chr12 - 1M 1 12556386 12556386 1
## ... ... ... ... ... ... ... ...
## [48290] chr12 - 1M 1 132082688 132082688 1
## [48291] chr12 - 1M 1 132082688 132082688 1
## [48292] chr12 - 1M 1 132082688 132082688 1
## [48293] chr12 + 1M 1 132800747 132800747 1
## [48294] chr12 - 1M 1 132811711 132811711 1
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [48290] 0
## [48291] 0
## [48292] 0
## [48293] 0
## [48294] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
38 / 46

GAlignments to a BAM

One very good reason to convert our GRanges objects back to a GAlignments object is so we can export our modified reads back to a BAM.

We can use the rtracklayer packages export() function to export our GAlignments file to a BAM file.

library(rtracklayer)
export(my5PrimeAsReads, con="myModifiedReads.bam")
39 / 46

Working with large BAM files

Handling large BAM file can mean we will often need to import only a subset of reads from a BAM file in one go. We can have a fine degree of control over what we import from a BAM file using the ScanBamParam() function.

Most importantly we can specify the what and which parameters to control what read information we import and which regions we import respectively.

  • what - Information we import from reads (sequences, read ids, flags).
  • which - Genomic locations we want to extract reads for.
40 / 46

Working with large BAM files

We can specify to import information from only specific regions by providing a GRanges of regions of interest to the which parameter in the ScanBamParam() function.

myRanges <- GRanges("chr12", IRanges(98591400,98608400))
myParam <- ScanBamParam(which=myRanges)
myParam
## class: ScanBamParam
## bamFlag (NA unless specified):
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:
## bamTagFilter:
## bamWhich: 1 ranges
## bamWhat:
## bamMapqFilter: NA
41 / 46

Working with large BAM files

We can then use the newly created ScanBamParam object within our readGAlignments() function call.

Now we import only reads which overlap our specified GRanges.

filteredReads <- readGAlignments("Sorted_liver.bam", param = myParam)
filteredReads
## GAlignments object with 48147 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 + 50M 50 98593646 98593695 50
## [2] chr12 + 50M 50 98593646 98593695 50
## [3] chr12 + 50M 50 98593646 98593695 50
## [4] chr12 + 50M 50 98593647 98593696 50
## [5] chr12 + 50M 50 98593647 98593696 50
## ... ... ... ... ... ... ... ...
## [48143] chr12 - 50M 50 98606306 98606355 50
## [48144] chr12 - 50M 50 98606306 98606355 50
## [48145] chr12 - 50M 50 98606306 98606355 50
## [48146] chr12 - 50M 50 98606311 98606360 50
## [48147] chr12 - 50M 50 98606311 98606360 50
## njunc
## <integer>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [48143] 0
## [48144] 0
## [48145] 0
## [48146] 0
## [48147] 0
## -------
## seqinfo: 1 sequence from an unspecified genome
42 / 46

Working with large BAM files

We can also control the information we import using the what parameter in ScanBamParam() function.

Here we import the read name, sequence and qualities.

myParam <- ScanBamParam(what=c("qname", "seq", "qual"))
infoInReads <- readGAlignments("Sorted_liver.bam", param = myParam)
infoInReads[1]
## GAlignments object with 1 alignment and 3 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr12 - 31M19S 50 956642 956672 31
## njunc | qname seq
## <integer> | <character> <DNAStringSet>
## [1] 0 | HWI-BRUNOP16X_0001:8.. GCTCAAGCGA...CACAATTGNG
## qual
## <PhredQuality>
## [1] gg^eg]\\[Z[...SSTTTTSSBT
## -------
## seqinfo: 1 sequence from an unspecified genome
43 / 46

Working with large BAM files

We can access this additional information as we would in GRanges objects by using the mcols function.

mcols(infoInReads)
## DataFrame with 48294 rows and 3 columns
## qname seq qual
## <character> <DNAStringSet> <PhredQuality>
## 1 HWI-BRUNOP16X_0001:8.. GCTCAAGCGA...CACAATTGNG gg^eg]\\[Z[...SSTTTTSSBT
## 2 HWI-BRUNOP16X_0001:8.. AAAAATACAA...TAGTTCCAGC aeebeebfgg...gfgbgeeebe
## 3 HWI-BRUNOP16X_0001:8.. AGAGTGAAAC...ATGATTTTTG gggggggggg...gggggggggg
## 4 HWI-BRUNOP16X_0001:8.. AGAGTGAAAC...ATGATTTTTG gggggggggg...gggggggggg
## 5 HWI-BRUNOP16X_0001:8.. TTGGAGACCA...TACAAAAAAT gggggggggg...ffgggggggg
## ... ... ... ...
## 48290 HWI-BRUNOP16X_0001:8.. TTGGCCAGGC...GCCCACCTCA gggggggggg...eggggggggg
## 48291 HWI-BRUNOP16X_0001:8.. TTGGCCAGGC...GCCCACCTCA gggggf^ece...dfggggeggg
## 48292 HWI-BRUNOP16X_0001:8.. TTGGCCAGGC...GCCCACCTCA gggegggggg...gggggggggg
## 48293 HWI-BRUNOP16X_0001:8.. CCAGGAGTTG...CTCTCTCTAC gggggggggg...gggggggggg
## 48294 HWI-BRUNOP16X_0001:8.. GGCTGTTCTT...TCAGCCTCCC BBBBBBBBBB...TTTSKSQXTT
44 / 46

Working with large BAM files

Using the ScanBamParam() function in combination with a loop and information from our header file we can process our BAM files, one chromosome at a time.

bamHeader <- scanBamHeader("Sorted_liver.bam")
myChromosomes <- bamHeader$Sorted_liver.bam$targets
for(i in 1:length(myChromosomes)){
grangesForImport <- GRanges(names(myChromosomes)[i],
IRanges(1,myChromosomes)[i])
myParam <- ScanBamParam(which = grangesForImport)
myReads <- readGAlignments("Sorted_liver.bam",
param=myParam)
print(length(myReads))
}
## [1] 48294
45 / 46

Time for an exercise.

Link_to_exercises

Link_to_answers

46 / 46

Aligned Sequences


2 / 46
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow