AlignedDataInBioconductor.knit

class: center, middle, inverse, title-slide

# Aligned data In Bioconductor
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=796px>
</html>
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/Bioconductor_Introduction/" class="uri">http://rockefelleruniversity.github.io/Bioconductor_Introduction/</a>

---

class: inverse, center, middle

# Aligned Sequences

---

# Aligned Sequences

Aligned sequence reads are stored in BAM format.

---
# Aligned Sequences - BAM header

Information on the content and state of BAM file is stored in its header.

---
# Aligned Sequences - BAM reads

The body of the BAM file hold information on the original reads.

---
# Aligned Sequences - BAM reads

As well as on the positions reads map to in the genome.

---
# Aligned data and Bioconductor

Aligned data is handled in Bioconductor using the **GenomicAlignments** package.

**GenomicAlignments** package builds on tools in other packages we have already encountered such as the 
**Rsamtools**, **GenomicRanges**, **ShortRead**, **BSgenome** and **rtracklayer** packages.

---
# From unaligned to aligned data

In our previous session we saw how to align data to the genome using an aligner and that we can use a splice aware aligner to map RNAseq reads across splice junctions.

---
# The Data

In this session we will work with aligned data as BAM files.

I have provided BAM files we saw in our last session as our test data from today.

This can be found in **data/liver.bodyMap.bam**

---
# Rsamtools

We introduced the **Rsamtools** package in our last session to help us post process our newly aligned BAM files.

The Rsamtools package is the basis for many R/Bioconductor packages working with alignments in BAM format including another package we will work with today, the **GenomicAligments** package.

First we can load the **Rsamtools** package.

```r
library(Rsamtools)
```

---
# Rsamtools - Sorting by coordinate

We saw in our last session we can sort files using the **sortBam()** function.
This function returns the name of sorted BAM file.

By default the file is sorted by chromosome name and then by coordinates of reads within these chromosomes.

```r
coordSorted <- sortBam("data/liver.bodyMap.bam",
                       "Sorted_liver")
coordSorted
```

```
## [1] "Sorted_liver.bam"
```
---

# Rsamtools - Sorting by read name

Some external programs will require reads be sorted by read name, not coordinates. 
To sort by read name we can set the **sortBam** arguement **byQname** to **TRUE**.

```r
readnameSorted <- sortBam("data/liver.bodyMap.bam",
                          "SortedByName_liver",
                          byQname=TRUE)
readnameSorted
```

```
## [1] "SortedByName_liver.bam"
```

---

# Rsamtools - Optimizing memory

We can control how much memory we use with the **maxMemory** parameter. This allows to sort very large files on smaller memory computers (such as our laptops).

The maxMemory is specified as the maximum MB of RAM which our **sortBam()** function call can use. In sorting **Rsamtools** will produce mulitple smaller BAM files, the smaller the maxMemory value the greater then number of temporary files.

Here in this example, we sort our file in 1MB of memory and it will produce several temporary files.

```r
coordSorted <- sortBam("data/liver.bodyMap.bam",
                          "Sorted_liver",
                          maxMemory=1)
coordSorted
```

```
## [1] "Sorted_liver.bam"
```

---

# Rsamtools - Indexing

Once we have a coordinate sorted file we can index these files to allow for use in other programs such as IGV.

Importantly for us, an indexed BAM file allows us to access information from a BAM file by genomic location.

```r
indexBam("Sorted_liver.bam")
```

```
##       Sorted_liver.bam 
## "Sorted_liver.bam.bai"
```

---
# Rsamtools - Summary

We saw in our last session that we can use **quickBamFlagSummary()** function to get information on alignnment rates and other BAM flags.

```r
quickBamFlagSummary("Sorted_liver.bam")
```

```
##                                 group |    nb of |    nb of | mean / max
##                                    of |  records |   unique | records per
##                               records | in group |   QNAMEs | unique QNAME
## All records........................ A |    48401 |    33538 | 1.44 / 2
##   o template has single segment.... S |    48401 |    33538 | 1.44 / 2
##   o template has multiple segments. M |        0 |        0 |   NA / NA
##       - first segment.............. F |        0 |        0 |   NA / NA
##       - last segment............... L |        0 |        0 |   NA / NA
##       - other segment.............. O |        0 |        0 |   NA / NA
## 
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
## 
## Details for group S:
##   o record is mapped.............. S1 |    48294 |    33489 | 1.44 / 2
##       - primary alignment......... S2 |    48294 |    33489 | 1.44 / 2
##       - secondary alignment....... S3 |        0 |        0 |   NA / NA
##   o record is unmapped............ S4 |      107 |      106 | 1.01 / 2
```

---
# Rsamtools  - Summary

Despite the name, the **quickBamFlagSummary** can take a noticeable amount of time when working with large files.

To get a very quick overview of number of mapped reads we can use the indexed BAM file and the **idxstatsBam()** function.

The **idxstatsBam()** function returns a data.frame containing chromosome name, chromosome lengths and number of mapped reads.

```r
idxstatsBam("Sorted_liver.bam")
```

```
##   seqnames seqlength mapped unmapped
## 1    chr12 133275309  48294        0
## 2        *         0      0      107
```

---

# GenomicAlignments

Importing and handling of BAM files is handled largely in the GenomicAlignments package.

We first load the package.

```r
BiocManager::install('GenomicAlignments')
library(GenomicAlignments)
```

---
# GenomicAlignments - BAM header

We saw in our earlier sessions on file formats that BAM files contain header information.

This header information provides a list of the chromosomes used and their lengths as well as the current state of BAM sorting (unsorted, coordinate or by name).

Additional information on programs used in processing of BAM file may also be present in BAM header

---
# GenomicAlignments - BAM header

We can retrieve the BAM header in R using the **scanBamHeader()** function and the name of the file we wish to access header information from.

Header information is returned as a list.

```r
myHeader <- scanBamHeader("Sorted_liver.bam")
str(myHeader)
```

```
## List of 1
##  $ Sorted_liver.bam:List of 2
##   ..$ targets: Named int 133275309
##   .. ..- attr(*, "names")= chr "chr12"
##   ..$ text   :List of 3
##   .. ..$ @HD: chr [1:2] "VN:1.0" "SO:coordinate"
##   .. ..$ @SQ: chr [1:2] "SN:chr12" "LN:133275309"
##   .. ..$ @PG: chr [1:4] "ID:subread" "PN:subread" "VN:Rsubread 2.0.1" "CL:\"subjunc\" \"-r\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_cour"| __truncated__
```

---
# GenomicAlignments - BAM header

We can access useful information such as chromosome names and lengths as with other lists using **$** accessors.
The first level of list is the file names.
The second levels of list are

* Chromosome names and lengths - named **targets**
* Unparsed Text from lines of header - named **text**

```r
names(myHeader)
```

```
## [1] "Sorted_liver.bam"
```

```r
names(myHeader$Sorted_liver.bam)
```

```
## [1] "targets" "text"
```

---

# GenomicAlignments - BAM header

The chromosome lengths are stored in a named vector under the **targets list**.

```r
myHeader$Sorted_liver.bam$targets
```

```
##     chr12 
## 133275309
```
---

# GenomicAlignments - BAM header

The **text list** contains information on sorting and programs.

```r
myHeader$Sorted_liver.bam$text
```

```
## $`@HD`
## [1] "VN:1.0"        "SO:coordinate"
## 
## $`@SQ`
## [1] "SN:chr12"     "LN:133275309"
## 
## $`@PG`
## [1] "ID:subread"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [2] "PN:subread"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [3] "VN:Rsubread 2.0.1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [4] "CL:\"subjunc\" \"-r\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.fq\" \"-o\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.bam\" \"-i\" \"chr12\" \"-n\" \"14\" \"-m\" \"1\" \"-p\" \"1\" \"-M\" \"3\" \"-T\" \"1\" \"-I\" \"5\" \"--multiMapping\" \"-B\" \"1\" \"-d\" \"50\" \"-D\" \"600\" \"-S\" \"fr\" \"--trim5\" \"0\" \"--trim3\" \"0\" \"-G\" \"-1\" \"-E\" \"0\" \"-X\" \"0\" \"-Y\" \"2\" \"-P\" \"3\" "
```

---
# GenomicAlignments - BAM header

We can see the order by reviewing the **HD** element. Here we sorted by coordinate.

```r
myHeader$Sorted_liver.bam$text["@HD"]
```

```
## $`@HD`
## [1] "VN:1.0"        "SO:coordinate"
```

---
# GenomicAlignments - BAM header

We can check the order for our name sorted BAM file too.

```r
myHeader <- scanBamHeader("SortedByName_liver.bam")
myHeader$SortedByName_liver.bam$text["@HD"]
```

```
## $`@HD`
## [1] "VN:1.0"       "SO:queryname"
```

---

# GenomicAlignments - BAM header

We can see the program information by reviewing the **PG** element.
Note that **PG** elements are not always complete and depend on tools used. Here we can see aligner, version used and the command line command itself.

```r
myHeader <- scanBamHeader("Sorted_liver.bam")
myHeader$Sorted_liver.bam$text["@PG"]
```

```
## $`@PG`
## [1] "ID:subread"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [2] "PN:subread"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
## [3] "VN:Rsubread 2.0.1"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      
## [4] "CL:\"subjunc\" \"-r\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.fq\" \"-o\" \"/Users/mattpaul/Documents/Box Sync/RU/Teaching/RU_side/Bioconductor_Introduction/r_course/Data/liver.bodyMap.bam\" \"-i\" \"chr12\" \"-n\" \"14\" \"-m\" \"1\" \"-p\" \"1\" \"-M\" \"3\" \"-T\" \"1\" \"-I\" \"5\" \"--multiMapping\" \"-B\" \"1\" \"-d\" \"50\" \"-D\" \"600\" \"-S\" \"fr\" \"--trim5\" \"0\" \"--trim3\" \"0\" \"-G\" \"-1\" \"-E\" \"0\" \"-X\" \"0\" \"-Y\" \"2\" \"-P\" \"3\" "
```

---

# GenomicAlignments - BAM Alignments

Now we have an idea how our BAM was constructed, we want to start to retrieve some of the data from our BAM file.

We can use the **readGAlignments()** function to import the BAM data into R.

The returned object is a **GAlignments** object.

```r
myReads <- readGAlignments("Sorted_liver.bam")
class(myReads)
```

```
## [1] "GAlignments"
## attr(,"package")
## [1] "GenomicAlignments"
```

---
# GAlignment objects

The resulting **GAlignments** object contains much of the information we saw in the SAM file earlier on.

This include the chromosome, strand, start and end position of alignment.

```r
myReads[1:2,]
```

```
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]    chr12      -      31M19S        50    956642    956672        31
##   [2]    chr12      +      20S30M        50   7243955   7243984        30
##           njunc
##       <integer>
##   [1]         0
##   [2]         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome
```

---
# GAlignment objects

These objects are similar to **GRanges** objects, as we have the **width** of our ranges (end coordinate - start coordinate) in the **GAlignments** objects.

Additionally we have the **qwidth** which contains information on the width of the original read.

```r
myReads[1:2,]
```

---
# GAlignment objects

The **GAlignments** object also contains information on the **cigar** strings within alignments and the number of junctions a read spans in the **njunc** column

Cigar strings denote the matches against reference.

75M - This is 75 matches in a row.

```r
myReads[1:2,]
```

---
# GAlignment objects - Accesors

The **GAlignments** object inherits much of the functionality we have seen with **GRanges** object.

We can access information using the same accessors as we saw with GRanges objects.

```r
seqnames(myReads)
```

```
## factor-Rle of length 48294 with 1 run
##   Lengths: 48294
##   Values : chr12
## Levels(1): chr12
```

```r
start(myReads)[1:2]
```

```
## [1]  956642 7243955
```

---
# GAlignment objects - Accesors

The **GAlignments** object also has some new accessors to access the cigar and njunc information using the **cigar** and **njunc** functions.

```r
cigar(myReads)[1:2]
```

```
## [1] "31M19S" "20S30M"
```

```r
njunc(myReads)[1:2]
```

```
## [1] 0 0
```

---
# GAlignment objects - Indexing

We can also index and subset the same way as with **GRanges** objects.

Here we only keep reads on positive strand.

```r
myReads[strand(myReads) == "+"]
```

```
## GAlignments object with 24520 alignments and 0 metadata columns:
##           seqnames strand       cigar    qwidth     start       end     width
##              <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##       [1]    chr12      +      20S30M        50   7243955   7243984        30
##       [2]    chr12      +      29M21S        50   7923596   7923624        29
##       [3]    chr12      +      29M21S        50   7923596   7923624        29
##       [4]    chr12      +   10S30M10S        50  24010815  24010844        30
##       [5]    chr12      +   10S30M10S        50  24010815  24010844        30
##       ...      ...    ...         ...       ...       ...       ...       ...
##   [24516]    chr12      +      23M27S        50 123412219 123412241        23
##   [24517]    chr12      +      23M27S        50 123412219 123412241        23
##   [24518]    chr12      +      23M27S        50 123412219 123412241        23
##   [24519]    chr12      +      23M27S        50 123412219 123412241        23
##   [24520]    chr12      +      29M21S        50 132800747 132800775        29
##               njunc
##           <integer>
##       [1]         0
##       [2]         0
##       [3]         0
##       [4]         0
##       [5]         0
##       ...       ...
##   [24516]         0
##   [24517]         0
##   [24518]         0
##   [24519]         0
##   [24520]         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome
```

---
# GAlignment objects - Narrow

We can alter range positions using the GenomicRanges **narrow()** function.

Here we resize our reads to be the 5' 1st base pair at the beginning of every read. Note that **narrow()** function does not take notice of strand.

The cigar strings and njunc will automatically be altered as well.

```r
my5primeReads <- narrow(myReads, start=1, width = 1)
my5primeReads[1:2]
```

```
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]    chr12      -          1M         1    956642    956642         1
##   [2]    chr12      +          1M         1   7243955   7243955         1
##           njunc
##       <integer>
##   [1]         0
##   [2]         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome
```

---
# GAlignment objects - Narrow

We can control this ourselves with a little subsetting.

```r
myReadsPos <- narrow(myReads[strand(myReads) == "+"],
                     start=1, width = 1)
myReadsNeg <- narrow(myReads[strand(myReads) == "-"],
                     end=-1, width = 1)

my5primeReads <- c(myReadsPos,myReadsNeg)
my5primeReads[1:2]
```

```
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]    chr12      +          1M         1   7243955   7243955         1
##   [2]    chr12      +          1M         1   7923596   7923596         1
##           njunc
##       <integer>
##   [1]         0
##   [2]         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome
```

---
# GAlignments to GRanges

We can convert a GAlignments object to a GRanges to take advantage of other functions using the **granges()** function.

This is most useful when reads align continously to a genome (WGS, ChIP-seq, ATAC-seq).

```r
myReadAsGRanges <- granges(myReads,use.mcols = TRUE)
myReadAsGRanges
```

```
## GRanges object with 48294 ranges and 0 metadata columns:
##           seqnames              ranges strand
##              <Rle>           <IRanges>  <Rle>
##       [1]    chr12       956642-956672      -
##       [2]    chr12     7243955-7243984      +
##       [3]    chr12     7923596-7923624      +
##       [4]    chr12     7923596-7923624      +
##       [5]    chr12   12556357-12556386      -
##       ...      ...                 ...    ...
##   [48290]    chr12 132082663-132082688      -
##   [48291]    chr12 132082663-132082688      -
##   [48292]    chr12 132082663-132082688      -
##   [48293]    chr12 132800747-132800775      +
##   [48294]    chr12 132811683-132811711      -
##   -------
##   seqinfo: 1 sequence from an unspecified genome
```

---
# GAlignments to GRanges

To convert RNA-seq reads which may span exons is a little more difficult. Since a read can potentially span multiple exons,a single read may need to be converted to multiple ranges.

To solve this we can use the **grglist()** function to return a GRangesList with a separate GRanges for each read.

```r
myReadAsGRangesList <- grglist(myReads,use.mcols = TRUE)
myReadAsGRangesList[njunc(myReads) == 1]
```

```
## GRangesList object of length 9291:
## [[1]]
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr12 98593676-98593740      +
##   [2]    chr12 98593975-98593984      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome
## 
## [[2]]
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr12 98593678-98593740      +
##   [2]    chr12 98593975-98593984      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome
## 
## [[3]]
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames            ranges strand
##          <Rle>         <IRanges>  <Rle>
##   [1]    chr12 98593681-98593740      +
##   [2]    chr12 98593975-98593989      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome
## 
## ...
## <9288 more elements>
```

---

# GRanges back to GAlignments

We can convert **GRanges** back to a **GAlignments** object using the function as(*myGranges*, "GAlignments").

```r
myReadAsGRanges <- granges(myReads, use.mcols = TRUE)
myReadsAgain <- as(myReadAsGRanges, "GAlignments")
myReadsAgain[1:2]
```

```
## GAlignments object with 2 alignments and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]    chr12      -         31M        31    956642    956672        31
##   [2]    chr12      +         30M        30   7243955   7243984        30
##           njunc
##       <integer>
##   [1]         0
##   [2]         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome
```

---
# Why use GRanges?

This allows to perform some complex operations as **GRanges** and then convert back to a **GAlignments** object.

```r
myReadAsGRanges <- granges(myReads, use.mcols = TRUE)
my5Prime <- resize(myReadAsGRanges, fix = "start", width = 1)
my5PrimeAsReads <- as(my5Prime, "GAlignments")
my5PrimeAsReads
```

```
## GAlignments object with 48294 alignments and 0 metadata columns:
##           seqnames strand       cigar    qwidth     start       end     width
##              <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##       [1]    chr12      -          1M         1    956672    956672         1
##       [2]    chr12      +          1M         1   7243955   7243955         1
##       [3]    chr12      +          1M         1   7923596   7923596         1
##       [4]    chr12      +          1M         1   7923596   7923596         1
##       [5]    chr12      -          1M         1  12556386  12556386         1
##       ...      ...    ...         ...       ...       ...       ...       ...
##   [48290]    chr12      -          1M         1 132082688 132082688         1
##   [48291]    chr12      -          1M         1 132082688 132082688         1
##   [48292]    chr12      -          1M         1 132082688 132082688         1
##   [48293]    chr12      +          1M         1 132800747 132800747         1
##   [48294]    chr12      -          1M         1 132811711 132811711         1
##               njunc
##           <integer>
##       [1]         0
##       [2]         0
##       [3]         0
##       [4]         0
##       [5]         0
##       ...       ...
##   [48290]         0
##   [48291]         0
##   [48292]         0
##   [48293]         0
##   [48294]         0
##   -------
##   seqinfo: 1 sequence from an unspecified genome
```

---
# GAlignments to a BAM

One very good reason to convert our **GRanges** objects back to a **GAlignments** object is so we can export our modified reads back to a BAM.

We can use the **rtracklayer** packages **export()** function to export our **GAlignments** file to a BAM file.

```r
library(rtracklayer)
export(my5PrimeAsReads, con="myModifiedReads.bam")
```