GenomicScores_In_Bioconductor.knit

class: center, middle, inverse, title-slide

# Genomic Scores In Bioconductor
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=796px>
</html>
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/Bioconductor_Introduction/" class="uri">http://rockefelleruniversity.github.io/Bioconductor_Introduction/</a>

---

# Genomic Scores

As we have seen earlier, genomic scores are often stored in wiggle format.

---
# Genomic Scores

A perhaps more common human readable format is bedGraph.

---
# Genomic Scores

But in many situations we would want a highly compressed format such as bigWig.

We used a bigWig from Encode in our last session.

---
# Genomic Scores

Genomic Scores are heavily used in Genomics and High throughput sequencing as they offer a simple mechanism to review a defined metric over the linear genome at a specified resolution.

- RNA-seq, ChIP-seq, ATAC-seq signals (as well many other *seq* types).
- Phylogenetic conservation.

---
# Our Genomic Scores data.

In our last session [Genomic Intervals In Bioconductor](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/GenomicIntervals_In_Bioconductor.html#1) we reviewed some of the Myc ChIP-seq signal available to us on encode.

This was the data from Experiment [ENCSR000ERN](https://www.encodeproject.org/experiments/ENCSR000ERN/), containing information on Myc ChIP-seq in mouse genome version mm10 CH12 cell line.

If you have not already downloaded the bigWig file then download it from [this link](https://www.encodeproject.org/files/ENCFF940MBK/@@download/ENCFF940MBK.bigWig) for our exercise later

---
# Our Genomic Scores data.

From our last session we identified Myc peaks within the Igfbp2 locus and in IGV compared Myc ChIP-seq signal from Encode over our peaks.

---
# Our Genomic Scores data

For this course I have provided bedGraph and bigWig files from this data for the window with the IGV image.

We will demonstrate how to create this later in today's session but for now data is available in

- **data/TSS_ENCFF940MBK.bedGraph** - bedGraph of region.
- **data/TSS_ENCFF940MBK.bw** - bigWig of region.

---
# Genomic Scores in Bioconductor.

Two popular Bioconductor packages for dealing with Genomics Scores are:

- [**rtracklayer**](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html) -- Importing/exporting genomic intervals into/out of R.
- [**GenomicRanges**](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) -- Handling genomic intervals in R.

---

# Genomic Scores in Bioconductor.

Now we have the package installed, we can load the library **rtracklayer** which we will use to import and export from/to bedGraph and bigWig.

```r
library(rtracklayer)
```
---

# Genomic Scores in Bioconductor.

We will also be making use of the functions in the **GenomicRanges** package. We dont need to load **GenomicRanges** directly here because the **rtracklayer** does this for us.

Package dependencies and imports allow one package to make use of functions from another.

---

# Reading in a bedGraph.

The rtracklayer package provides functions to import genomic scores from a bedGraph using the **import.bedGraph()** function.

```r
myBedG <- import.bedGraph("data/TSS_ENCFF940MBK.bedGraph")
```
---
# Reading in a bedGraph.

The genomic scores by default are stored as a familiar [**GRanges object**](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/GenomicIntervals_In_Bioconductor.html#9) containing the original 4 columns of information, **contig**, **start**, **end** and **score**.

```r
myBedG[1:3]
```

```
## GRanges object with 3 ranges and 1 metadata column:
##       seqnames            ranges strand |     score
##          <Rle>         <IRanges>  <Rle> | <numeric>
##   [1]     chr1        1-72811054      * |   0.00000
##   [2]     chr1 72811055-72811119      * |   0.69040
##   [3]     chr1 72811120-72811145      * |   0.57802
##   -------
##   seqinfo: 54 sequences from an unspecified genome; no seqlengths
```
---
# Reading in a bedGraph.

Because we only have 4 columns in a bedGraph and no strand information, the **GRanges** intervals are unstranded with * in their strand column

```r
strand(myBedG)
```

```
## factor-Rle of length 2161 with 1 run
##   Lengths: 2161
##   Values :    *
## Levels(3): + - *
```

---
# Reading in a bigWig

Much of the genomic scores we will be working with are infact stored in the compressed bigWig format.

We can also import bigWigs into R using the **import.bw** function.

```r
myBigWig <- import.bw("data/TSS_ENCFF940MBK.bw")
```
---
# Reading in a bigWig

The import bigWig's genomic scores are again imported as a **GRanges** object containing the same information as the imported bedGraph.

```r
myBigWig[1:3]
```

---
# GenomicScores as a GRanges

So far we have retrieved our genomic scores from bedGraphs and bigWigs as **GRanges** objects.

This allows us to use **GRanges** accessors and functions we have already [seen in our last session.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/GenomicIntervals_In_Bioconductor.html#32)

```r
myGRanges <- GRanges("chr1",IRanges(72823698,72824485))
filteredBigWig <- myBigWig[myBigWig %over% myGRanges]
filteredBigWig[1:3]
```

```
## GRanges object with 3 ranges and 1 metadata column:
##       seqnames            ranges strand |     score
##          <Rle>         <IRanges>  <Rle> | <numeric>
##   [1]     chr1 72823686-72823703      * |   1.34249
##   [2]     chr1 72823704-72823710      * |   1.80605
##   [3]     chr1 72823711-72823715      * |   1.68676
##   -------
##   seqinfo: 54 sequences from an unspecified genome
```

---
# GenomicScores as a RLE

We can however specify the type of objects we would like to return from the import.bedGraph and import.bw functions.

Here we will import the bigWig as a object we have briefly seen, the **Rle** object (run length encoding). Here we have an **Rlelist** (a list of **Rle** objects)

```r
myBigWig <- import.bw("data/TSS_ENCFF940MBK.bw",
                      as = "RleList")
class(myBigWig)
```

```
## [1] "SimpleRleList"
## attr(,"package")
## [1] "IRanges"
```
---

# Rle in genomics

Run length encoding allows for a very efficient storage of long stretchs of repeated values.

We have already seen an rle in our cigar string from SAM files.

* 100M - 100 matches to reference for alignment
* 28M1D72M - 28 matches, 1 deletion and 72 matches for aligment

![](imgs/cigar.png)

---
# Rle in genomics

We have also seen **Rle** objects [within the GRanges objects we reviewed last session.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/GenomicIntervals_In_Bioconductor.html#18)

Chromosome/contig names in **GRanges** objects are stored as **Rle** objects for instance.

```r
mycPeaks <- import.bed("data/Myc_Ch12_1_withInput_Input_Ch12_summits.bed")
seqnames(mycPeaks)
```

```
## factor-Rle of length 13910 with 21 runs
##   Lengths:   891   737  1537   498   595 ...   796   646   714   227    10
##   Values : chr1  chr10 chr11 chr12 chr13 ... chr7  chr8  chr9  chrX  chrY 
## Levels(21): chr1 chr10 chr11 chr12 chr13 chr14 ... chr7 chr8 chr9 chrX chrY
```

---
# Rle in genomics
For genomic scores we will be storing long stretchs of numbers as an **Rle**.

To store genomic scores across chromosomes/contig we will use an **RleList**.

Creating an **Rle** is straightforward. We can simply supply a numeric vector of numbers we wish to compress to the **Rle()** function.

```r
myNumbers <- c(0,0,0,0,0,1,1,1,0,0,0,0,0)
Rle(myNumbers)
```

```
## numeric-Rle of length 13 with 3 runs
##   Lengths: 5 3 5
##   Values : 0 1 0
```

---
# Rle in genomics

Now can construct a named **RleList** containing the **Rle** objects using the **RleList()** function.

```r
myNumbers2 <- c(0,0,0,0,0,1,1,1,2,2,2,2,2)
chr1Scores <- Rle(myNumbers)
chr2Scores <- Rle(myNumbers2)
myRleList <- RleList(chr1=chr1Scores,chr2=chr2Scores)
myRleList
```

```
## RleList of length 2
## $chr1
## numeric-Rle of length 13 with 3 runs
##   Lengths: 5 3 5
##   Values : 0 1 0
## 
## $chr2
## numeric-Rle of length 13 with 3 runs
##   Lengths: 5 3 5
##   Values : 0 1 2
```

---
# Rle in genomics

**Rle** allows for us to compress space needed to store genomic scores and so the Bioconductor **Rle** and **RleList** objects allows us make use of this in R.

We can see the compression of space in action by reviewing our imported genomic scores as an **RleList** objects.

```r
myBigWig[1:2]
```

```
## RleList of length 2
## $chr1
## numeric-Rle of length 195471971 with 2108 runs
##   Lengths:  72811054        65        26 ...        10        15 122614997
##   Values :   0.00000   0.69040   0.57802 ...   0.21374   0.20965   0.00000
## 
## $chr10
## numeric-Rle of length 130694993 with 1 run
##   Lengths: 130694993
##   Values :         0
```

---
# Indexing an RLElist

To access elements of a **RleList** we can use our regular accessors for lists **$** and **[[]]**.

Here we retrieve the **Rle** object named *chr1* containing all the genomic scores information for chromosome 1.

```r
chr1_rle <- myBigWig$chr1
# Or
chr1_rle <- myBigWig[["chr1"]]
chr1_rle
```

```
## numeric-Rle of length 195471971 with 2108 runs
##   Lengths:  72811054        65        26 ...        10        15 122614997
##   Values :   0.00000   0.69040   0.57802 ...   0.21374   0.20965   0.00000
```

---
# Indexing an RLE

Now we have an **Rle** object of our genomic scores on chromosome 1 we can index it just like we have for standard vectors.
Here i retrieve values for all basepairs between 1 and 10

```r
chr1_rle[1:10]
```

```
## numeric-Rle of length 10 with 1 run
##   Lengths: 10
##   Values :  0
```

---
# Replacement in an RLE.

We can also replace values in a **Rle**, just as we would with a vector.

Here I replace values for all basepairs between 1 and 10 to 100

```r
chr1_rle[1:10] <- 100
chr1_rle
```

```
## numeric-Rle of length 195471971 with 2109 runs
##   Lengths:        10  72811044        65 ...        10        15 122614997
##   Values : 100.00000   0.00000   0.69040 ...   0.21374   0.20965   0.00000
```

---
# Converting to other data types

We can convert **Rle's** to standard vectors.

```r
rleAsVector <- as.vector(chr1_rle[1:10])
rleAsVector
```

```
##  [1] 100 100 100 100 100 100 100 100 100 100
```

---
# Converting to other data types

Or to a data.frame

```r
rleAsDF <- as.data.frame(chr1_rle[1:10])
rleAsDF
```

```
##    value
## 1    100
## 2    100
## 3    100
## 4    100
## 5    100
## 6    100
## 7    100
## 8    100
## 9    100
## 10   100
```

---
# Working with Rle

Many vector operations and function work with **Rle** objects.

These include

* Arithmetic and Mathematical operations.
* Logical operations and comparisons/
* Summary statistics.

---
# Operations on RLE

We can use many simple arithmetic operations such as **+**, **-**, **/** and *****

```r
chr1_rle+1000
```

```
## numeric-Rle of length 195471971 with 2109 runs
##   Lengths:        10  72811044        65 ...        10        15 122614997
##   Values :   1100.00   1000.00   1000.69 ...   1000.21   1000.21   1000.00
```

```r
(chr1_rle+1000)*10
```

```
## numeric-Rle of length 195471971 with 2109 runs
##   Lengths:        10  72811044        65 ...        10        15 122614997
##   Values :   11000.0   10000.0   10006.9 ...   10002.1   10002.1   10000.0
```
---
# Operations on RLE

Logical operations can be used with **Rle** objects just as with vectors.

For **Rle** objects, logical opertations return a logical Rle instead of standard logical vector.

```r
chr1_rle < 10
```

```
## logical-Rle of length 195471971 with 20 runs
##   Lengths:        10  72823081        55 ...        65         2 122627007
##   Values :     FALSE      TRUE     FALSE ...      TRUE     FALSE      TRUE
```
---
# Operations on RLE

We use this **logical Rle** to replace values less than 10 with 0 as we would use a [logical vector with standard vectors](https://rockefelleruniversity.github.io/Intro_To_R_1Day/r_course/presentations/slides/introToR_Session1.html#/32).

```r
chr1_rle[chr1_rle < 10] <- 0
chr1_rle
```

```
## numeric-Rle of length 195471971 with 122 runs
##   Lengths:        10  72823081        14 ...        65         2 122627007
##   Values :  100.0000    0.0000   10.0802 ...    0.0000   10.2513    0.0000
```

---
# Operations on RLE

Many functions providing summary statisitics are also available to us including **mean()**, **max()** and **sum()** functions.

```r
mean(chr1_rle)
```

```
## [1] 4.181372e-05
```

```r
max(chr1_rle)
```

```
## [1] 100
```

```r
sum(chr1_rle)
```

```
## [1] 8173.411
```

---
# Operations on RLELists

Very usefully, We can also apply arithmetic and mathematical operations to whole RleLists as imported from the bigWig file.

```r
myBigWig <- import.bw("data/TSS_ENCFF940MBK.bw",as="RleList")
myBigWig+10
```

```
## RleList of length 54
## $chr1
## numeric-Rle of length 195471971 with 2108 runs
##   Lengths:  72811054        65        26 ...        10        15 122614997
##   Values :   10.0000   10.6904   10.5780 ...   10.2137   10.2096   10.0000
## 
## $chr10
## numeric-Rle of length 130694993 with 1 run
##   Lengths: 130694993
##   Values :        10
## 
## $chr11
## numeric-Rle of length 122082543 with 1 run
##   Lengths: 122082543
##   Values :        10
## 
## $chr12
## numeric-Rle of length 120129022 with 1 run
##   Lengths: 120129022
##   Values :        10
## 
## $chr13
## numeric-Rle of length 120421639 with 1 run
##   Lengths: 120421639
##   Values :        10
## 
## ...
## <49 more elements>
```
---
# Summaries on RLELists

Summary functions return a summary for every Rle object (here for chromosomes) within the RleList.

```r
chromosomeMax <- max(myBigWig)
chromosomeMax[1:4]
```

```
##     chr1    chr10    chr11    chr12 
## 23.09015  0.00000  0.00000  0.00000
```

---
# Subsetting RleLists with a GRanges.

We can in fact use **GRanges** objects to index our **RleList** objects.The **GRanges** provides the intervals from which genomic scores are retrieved. The resulting **RleList** object contains an entry with the scores for each interval in the **GRanges** object.

To demonstrate first we can retrieve the Myc Peaks calls which overlap the region we are reviewing.

```r
myRanges <- GRanges("chr1",ranges = IRanges(72811055,72856974))
mycPeaks <- import.bed("data/Myc_Ch12_1_withInput_Input_Ch12_summits.bed")
mycPeaks <- resize(mycPeaks,50,fix="center")
newMycPeaks <- mycPeaks[mycPeaks %over% myRanges]
newMycPeaks
```

```
## GRanges object with 2 ranges and 2 metadata columns:
##       seqnames            ranges strand |                   name     score
##          <Rle>         <IRanges>  <Rle> |            <character> <numeric>
##   [1]     chr1 72824021-72824070      * | Myc_Ch12_1_withInput..  18.55062
##   [2]     chr1 72844876-72844925      * | Myc_Ch12_1_withInput..   9.27569
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths
```

---
# RleLists and GRanges.

Now we can simply index our RleList object by providing our **GRanges** as an index in **[]** brackets.

```r
rleOverGranges <- myBigWig[newMycPeaks]
rleOverGranges
```

```
## RleList of length 2
## $chr1
## numeric-Rle of length 50 with 14 runs
##   Lengths:       6       3       2       3 ...       2       1       1       1
##   Values : 16.1810 18.4153 20.7201 19.5592 ... 20.7201 21.8973 23.0902 21.8973
## 
## $chr1
## numeric-Rle of length 50 with 9 runs
##   Lengths:        1       10        6        3 ...        8        4        2
##   Values : 11.25152  9.79599 10.51570  9.79599 ...  9.09299  8.40735  7.09102
```
---
# RleLists and GRanges.

With the **RleList** containing our scores over the Myc peaks we can now gather summary statistics as with all RleList objects

```r
sum(rleOverGranges)
```

```
##      chr1      chr1 
## 1009.0643  480.4242
```

---
# Exporting an RLElist

We may wish to export our **RleList** or **Rle** objects back to a bigWig file.

We would need to convert any **Rle** objects to a **RleList** first in order to provide some chromosome/contig names. We can do this with the **RleList()** function and providing a chromosome/contig name to our **Rle** object.

```r
myRleList <- RleList(chr1=chr1_rle)
myRleList
```

```
## RleList of length 1
## $chr1
## numeric-Rle of length 195471971 with 122 runs
##   Lengths:        10  72823081        14 ...        65         2 122627007
##   Values :  100.0000    0.0000   10.0802 ...    0.0000   10.2513    0.0000
```

---

# Exporting an RLElist

Now we have our **RleList** object we export this to a bigWig using the **export.bw()** function.

```r
export.bw(myRleList,con="chr1_Myc.bw")
```
---
# Importing large files

In some cases we will only wish to import portions of a BigWig file for use in our analysis. This will save us memory and time when loading in big files.

To do this we can take advantage of the **selection** parameter in the **import.bw** function and the **BigWigSelection()** function.

---
# Importing large files

When importing only a portion of a bigWig we simply need to specify a GRanges of regions we wish to retrieve to the **BigWigSelection()** function.

Here we will use the Myc Peaks in our window as the GRanges of regions for selection.

```r
newMycPeaks
```

Now we pass the Myc peak **GRanges** object to the **BigWigSelection()** function.

We then supply the resulting **BigWigSelection** selection object to the **selection** parameter of **import.bw()** function.

```r
mySelection <- BigWigSelection(newMycPeaks)
import.bw("data/TSS_ENCFF940MBK.bw", 
          selection=mySelection, 
          as="RleList")
```

```
## RleList of length 54
## $chr1
## numeric-Rle of length 195471971 with 26 runs
##   Lengths:  72824020         6         3 ...         4         2 122627046
##   Values :   0.00000  16.18099  18.41527 ...   8.40735   7.09102   0.00000
## 
## $chr10
## numeric-Rle of length 130694993 with 1 run
##   Lengths: 130694993
##   Values :         0
## 
## $chr11
## numeric-Rle of length 122082543 with 1 run
##   Lengths: 122082543
##   Values :         0
## 
## $chr12
## numeric-Rle of length 120129022 with 1 run
##   Lengths: 120129022
##   Values :         0
## 
## $chr13
## numeric-Rle of length 120421639 with 1 run
##   Lengths: 120421639
##   Values :         0
## 
## ...
## <49 more elements>
```

---
# Time for an exercise.

[Link_to_exercises](../../exercises/exercises/GS_exercise.html)

[Link_to_answers](../../exercises/answers/GS_answers.html)