GenomicIntervals_In_Bioconductor.knit

class: center, middle, inverse, title-slide

# Genomic Intervals in Bioconductor
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=796px>
</html>
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/Bioconductor_Introduction/" class="uri">http://rockefelleruniversity.github.io/Bioconductor_Introduction/</a>

---

## Genomic Intervals

As we have seen earlier, genomic intervals are often stored in BED format.

---
## Genomic Intervals examples

In Genomics and High Throughput sequencing we make extensive use of Genomic Intervals to store simple annotation and summarised results over regions.

- Transcriptional Start Sites / Enhancers
- Enrichment for ChIP-seq signal i.e. Peak calls.

---
## Genomic Intervals manipulation

We also want to be able to assess spatial relation between such genomic intervals within our linear genome.

- Transcriptional Start Sites and their nearest enhancers.
- Enrichment for ChIP-seq signal within promoters and outside of promoters.

---
class: inverse, center, middle

# Genomic Intervals in Bioconductor

---
## Genomic Intervals in Bioconductor

Two popular Bioconductor packages for dealing with Genomics Intervals are:

- [**rtracklayer**](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html) - Importing/exporting genomic intervals into/out of R.
- [**GenomicRanges**](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) - Handling genomic intervals in R.

---
## Genomic Intervals in Bioconductor

The first package we will look at is the GenomicRanges package.

Remember we can install Bioconductor packages (fairly) easily using Bioconductor package [commands](https://www.bioconductor.org/install/).

```r
BiocManager::install(version = "3.10")
BiocManager::install("GenomicRanges")
```

---
# Genomic Intervals in Bioconductor

Now we have the package installed, we can load the library.

```r
library(GenomicRanges)
```

---

class: inverse, center, middle

# Creating Genomic Ranges.

---

# Building a GenomicRanges

To construct a GenomicRanges object we require a set of genomic intervals.

Intervals (genomic and non genomic intervals) are created using the **IRanges()** function. We must specify **start** and **end** positions as numeric vectors to create our IRanges object.

```r
myIntervals <- IRanges(start=c(1,10,20),end=c(2,11,30))
class(myIntervals)
```

```
## [1] "IRanges"
## attr(,"package")
## [1] "IRanges"
```

---
# Building a GenomicRanges

The resulting IRanges object contains our intervals and an additional column describing widths of intervals.

```r
myIntervals
```

```
## IRanges object with 3 ranges and 0 metadata columns:
##           start       end     width
##       <integer> <integer> <integer>
##   [1]         1         2         2
##   [2]        10        11         2
##   [3]        20        30        11
```

---

# Building a GenomicRanges

Now to create a Genomic Ranges interval object or **GRanges** object. We extend our IRanges object to also include chrosomosome/contig information.

To do this we can supply our new IRanges object and a vector of corresponding chromosome locations to the **seqnames** parameter in **GRanges()** function.

```r
myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"),
                              myIntervals)
class(myGenomicIntervals)
```

```
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
```

---
# Building a GenomicRanges

The created **GRanges** object contains our original intervals from the IRanges object and now the chromosomes/contigs on which the intervals exist.

```r
myGenomicIntervals
```

```
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       1-2      *
##   [2]     chr1     10-11      *
##   [3]     chr2     20-30      *
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Indexing a GenomicRanges

We can retrieve genomic intervals from our GRanges using indexing as with [vectors, matrices and dataframes.](https://rockefelleruniversity.github.io/Intro_To_R_1Day/r_course/presentations/singlepage/introToR_Session1.html#vectors_-_indexing)

```r
myGenomicIntervals[c(2,3),]
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1     10-11      *
##   [2]     chr2     20-30      *
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---

# Indexing a GenomicRanges

Unlike data frames and matrices, we dont need the second **","** to specify rows not columns.

```r
myGenomicIntervals[c(2,3)]
```

---
# Accessing columns in GenomicRanges

To access the columns of a GRanges object we must use special accessor functions - **seqnames()**,**start()**, **end()** which correspond to the columns we wish to retrieve.

We get start and stop positions using the  **start()** and **end()** functions.

```r
start(myGenomicIntervals)
```

```
## [1]  1 10 20
```

```r
end(myGenomicIntervals)
```

```
## [1]  2 11 30
```

---
# Replacing values in GenomicRanges

We can use the **start()** and **end()** functions to update our start and end positions as well.

```r
start(myGenomicIntervals) <- c(1,5,15)
```

```r
start(myGenomicIntervals)
```

```
## [1]  1  5 15
```

---
# Replacing values in GenomicRanges

We can not however create intervals with negative widths. Our start positions cannot be greater than our end positions.

```r
start(myGenomicIntervals) <- c(10,50,150)
```

```
## Error in validObject(x): invalid class "IRanges" object: 
##     'width(x)' cannot contain negative integers
```

---
# Retrieving chromosome/contig info

We can retrieve the chromosome names from GenomicRanges in a similar manner to **start()** and **end()** functions using the **seqnames()** function.

```r
contigNames <- seqnames(myGenomicIntervals)
contigNames
```

```
## factor-Rle of length 3 with 2 runs
##   Lengths:    2    1
##   Values : chr1 chr2
## Levels(2): chr1 chr2
```

---
# Retrieving chromosome/contig info

The contig names are stored in another special object called a RLE (run length encoded).
This RLE object saves memory and allows for more efficient manipulation of long stretches of repeated names.

We can convert to a character vector using the **as.character()** function

```r
as.character(contigNames)
```

```
## [1] "chr1" "chr1" "chr2"
```

---
# Replacing chromosome/contig info

As with the **start()** and **end()** functions we can update our contig names using the **seqnames()** function.

```r
seqnames(myGenomicIntervals) <- c("chr2","chr2","chr1")
seqnames(myGenomicIntervals)
```

```
## factor-Rle of length 3 with 2 runs
##   Lengths:    2    1
##   Values : chr2 chr1
## Levels(2): chr1 chr2
```

---
# Replacing chromosome/contig info

As with [all factors however](https://rockefelleruniversity.github.io/Intro_To_R_1Day/r_course/presentations/singlepage/introToR_Session1.html#factors_-_creating_factors), we can not add contigs not already in use in seqnames column.

```r
seqnames(myGenomicIntervals) <- c("chr3","chr2","chr1")
```

```
## Error in .normalize_seqnames_replacement_value(value, x): levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
```
---
# Replacing chromosome/contig info

To add new contig we must update the seqlevels, like levels for factors but now for contigs in GRanges objects.

The **seqlevels()** identifies which contigs we are using.

```r
seqlevels(myGenomicIntervals)
```

```
## [1] "chr1" "chr2"
```

---
# Replacing chromosome/contig info

We can update the seqlevels in use by again using the **seqlevels()** function.

```r
seqlevels(myGenomicIntervals) <- c("chr1","chr2","chr3")
seqlevels(myGenomicIntervals)
```

```
## [1] "chr1" "chr2" "chr3"
```

---
# Replacing chromosome/contig info

Now we can add contig for chromosome 3 (**chr3**).

```r
seqnames(myGenomicIntervals) <- c("chr1","chr2","chr3")
myGenomicIntervals
```

```
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       1-2      *
##   [2]     chr2      5-11      *
##   [3]     chr3     15-30      *
##   -------
##   seqinfo: 3 sequences from an unspecified genome; no seqlengths
```

---
# Strand information in a GenomicRanges

We can include strand information in GRanges objects.

We can set strand to be "+" , "-" , or unstranded with "*".

```r
myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"), 
                              myIntervals,
                              strand=c("+","-","*"))
myGenomicIntervals
```

```
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       1-2      +
##   [2]     chr1     10-11      -
##   [3]     chr2     20-30      *
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Strand information in a GenomicRanges

Similarly we can extract and set strand information using the **strand()** function.

```r
strand(myGenomicIntervals) <- c("+","+","-")
strand(myGenomicIntervals)
```

```
## factor-Rle of length 3 with 2 runs
##   Lengths: 2 1
##   Values : + -
## Levels(3): + - *
```

---
# Strand information in a GenomicRanges

The strand information is also stored as a factor RLE. This contains the levels "+", "-", "*" and only these levels.

```r
myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"),
                              myIntervals,
                              strand=c("+","+","+"))
strand(myGenomicIntervals)
```

```
## factor-Rle of length 3 with 1 run
##   Lengths: 3
##   Values : +
## Levels(3): + - *
```

---
# Names in a GenomicRanges

Finally we can set and extract names for our genomic intervals using the **names()** function.

```r
names(myGenomicIntervals) <- c("peak1","peak2","peak3")
myGenomicIntervals
```

```
## GRanges object with 3 ranges and 0 metadata columns:
##         seqnames    ranges strand
##            <Rle> <IRanges>  <Rle>
##   peak1     chr1       1-2      +
##   peak2     chr1     10-11      +
##   peak3     chr2     20-30      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Names in a GenomicRanges

As with vectors we can use to names to index and retrieve genomic intervals of interest.

```r
myGenomicIntervals["peak2"]
```

```
## GRanges object with 1 range and 0 metadata columns:
##         seqnames    ranges strand
##            <Rle> <IRanges>  <Rle>
##   peak2     chr1     10-11      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Additional genomic interval information

We will often want to store more information on our genomic intervals. We can use the **mcols** (metadata columns) to store this information.

To add additional metadata columns we can simply supply the desired column name and the values we want to the **GRanges()** function as with data frames.

```r
myIntervals <- IRanges(start=c(1,2,2),end=c(2,11,30))
myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"),
                              myIntervals,strand=c("+","+","+"),
                              Score=c(10,20,40),
                              Comment=c("GoodQC","GoodQC","BadQC"))
```

---
# Additional genomic interval information

We can retrieve this metadata column information as with data frames by using the **$** accessor

```r
myGenomicIntervals$Score
```

```
## [1] 10 20 40
```

```r
myGenomicIntervals$Comment
```

```
## [1] "GoodQC" "GoodQC" "BadQC"
```

---
# Additional genomic interval information

We to retrieve all metadata columns as a data frames we can use the **mcols()** function

```r
mcols(myGenomicIntervals)
```

```
## DataFrame with 3 rows and 2 columns
##       Score     Comment
##   <numeric> <character>
## 1        10      GoodQC
## 2        20      GoodQC
## 3        40       BadQC
```

---
# Convert to data frame

We can convert the entire GRanges to a data.frame using the **as.data.frame()** function.

```r
as.data.frame(myGenomicIntervals)
```

```
##   seqnames start end width strand Score Comment
## 1     chr1     1   2     2      +    10  GoodQC
## 2     chr1     2  11    10      +    20  GoodQC
## 3     chr2     2  30    29      +    40   BadQC
```

---
## GRanges from IDs

The GenomicRanges package also has a quick short cut to create GRanges from character strings.

We simply specify a genomic range in a similar format to that used by IGV - **Chromosome:Start-End** _**(chr1:110-120)**_.

```r
GRanges("chr1:110-120")
```

```
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1   110-120      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

---
## GRanges from IDs

Often genomics programs will export results in table with a similar format for the GenomicIntervals. The BRC pipeline outputs in a format of **Chromosome:Start:End** _**(chr1:110:120)**_ to allow separation into columns in programs such as excel.

To convert the formats we can use some string manipulation tools. In the BRC example we will need to replace last **":"** with a **"-"**. We can use the **stringi** package's **stri_replace_last_fixed** function.

```r
library(stringi)
myRange <- "chr1:110:120"
newRange <- stri_replace_last_fixed(myRange, ':', '-')
newRange
```

```
## [1] "chr1:110-120"
```

---
## GRanges from IDs

Another common format is **IDforRange:Chromosome:Start:End**. We can clean this up in additional step using another **stringi** function.

```r
library(stringi)
myRange <- "MyID:chr1:110:120"
newRange <- stri_replace_last_fixed(myRange, ':', '-')
newerRange <- stri_replace_first_regex(newRange, '\\w*:', '')
newerRange
```

```
## [1] "chr1:110-120"
```

```r
GRanges(newerRange)
```

---
# Special Genomic Ranges functions

One of the big advantage of using GRanges objects to handle genomic data is the set of functions which allows us to manipulate and handle genomic intervals.

The include functions to resize, merge/reduce and shift genomic ranges.

![](imgs/Unknown2.jpg)

---
# Adjusting genomic ranges positions

To adjust the postion of a GRanges object we can manually adjust start and end positions, but GenomicRanges provides some more user friendly methods.

The **shift** function allows us to move by a specified number of basepairs.

```r
myGenomicIntervals[1]
```

```
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     Score     Comment
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr1       1-2      + |        10      GoodQC
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

```r
shift(myGenomicIntervals[1],shift = 10)
```

```
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     Score     Comment
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr1     11-12      + |        10      GoodQC
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Adjusting genomic ranges positions

The **resize()** function allows to adjust the width of regions.

```r
myGenomicIntervals[3]
```

```
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     Score     Comment
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr2      2-30      + |        40       BadQC
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

```r
resize(myGenomicIntervals[3], width=20)
```

```
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     Score     Comment
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr2      2-21      + |        40       BadQC
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Adjusting genomic ranges positions

By default the **resize()** function adjust the width of regions from the **start** of the region. We can control this behavior by specifying the **fix** parameter as **start**, **end** or **center** to specify where we wish to resize from.

```r
myGenomicIntervals[3]
```

```r
resize(myGenomicIntervals[3], width=20, fix="end")
```

```
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     Score     Comment
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr2     11-30      + |        40       BadQC
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Adjusting genomic ranges positions

The **resize()** function by default takes into account strand information.

For negative strand intervals, when specifying **fix=start** it considers the start as the end position i.e. Useful for Gene positions.

```r
resize(myGenomicIntervals[3],width=20, fix="start")
```

```r
strand(myGenomicIntervals)[3] <- "-"
resize(myGenomicIntervals[3],width=20, fix="start")
```

```
## GRanges object with 1 range and 2 metadata columns:
##       seqnames    ranges strand |     Score     Comment
##          <Rle> <IRanges>  <Rle> | <numeric> <character>
##   [1]     chr2     11-30      - |        40       BadQC
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Adjusting genomic ranges positions

A final function for adjusting genomic ranges which we will review is the **reduce()** function.

The reduce function merges overlapping regions into a single genomic interval.

First lets make some overlapping intervals

```r
myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"),
                              ranges=IRanges(start=c(1,2,2),end=c(2,11,30)),
                              strand=c("+","+","+"))
myGenomicIntervals
```

```
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       1-2      +
##   [2]     chr1      2-11      +
##   [3]     chr2      2-30      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Adjusting genomic ranges positions

Now we can merge them with **reduce()** function taking into account chromosomes.

```r
mergedGenomicIntervals <- reduce(myGenomicIntervals)
mergedGenomicIntervals
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      1-11      +
##   [2]     chr2      2-30      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Adjusting genomic ranges positions

The **reduce()** function also takes into account the strand information and only merges genomic intervals on same strand.

```r
strand(myGenomicIntervals) <- c("+","-","+")
mergedGenomicIntervals <- reduce(myGenomicIntervals)
mergedGenomicIntervals
```

```
## GRanges object with 3 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       1-2      +
##   [2]     chr1      2-11      -
##   [3]     chr2      2-30      +
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Adjusting genomic ranges positions

We can merge intervals across strands by setting **ignore.strand=TRUE**. Note that all regions are now considered strandless.

```r
strand(myGenomicIntervals) <- c("+","-","+")
mergedGenomicIntervals <- reduce(myGenomicIntervals,
                                 ignore.strand=TRUE)
mergedGenomicIntervals
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      1-11      *
##   [2]     chr2      2-30      *
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Overlapping genomic intervals

Some particularly useful methods/operators in GenomicRanges allows us to assess the relationship between two sets of GRanges objects.

One of the most common is the **%over%** operator which allows us to identify overlapping regions.

First we make two GRanges objects.

```r
myGenomicIntervals1 <- GRanges(seqnames=c("chr1","chr1"),
                              ranges=IRanges(start=c(1,25),
                                             end=c(20,30)),
                              strand=c("+","+"))

myGenomicIntervals2 <- GRanges(seqnames=c("chr1","chr1"),
                              ranges=IRanges(start=c(22,100),
                                             end=c(27,130)),
                              strand=c("+","+"))
```

---
# Overlapping genomic intervals

```r
myGenomicIntervals1
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1      1-20      +
##   [2]     chr1     25-30      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

```r
myGenomicIntervals2
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1     22-27      +
##   [2]     chr1   100-130      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

---
# Overlapping genomic intervals
Now to find all regions in **myGenomicIntervals1** which overlap with intervals in **myGenomicIntervals2** we simply use the **%over%** operator.

This will return a logical vector specifying if regions in **myGenomicIntervals1** overlap any regions in **myGenomicIntervals2**.

```r
myGenomicIntervals1 %over% myGenomicIntervals2
```

```
## [1] FALSE  TRUE
```

---
# Overlapping genomic intervals

We can use this logical vector then to subset our **myGenomicIntervals1** set [as we have with vectors](https://rockefelleruniversity.github.io/Intro_To_R_1Day/r_course/presentations/slides/introToR_Session1.html#factors_-_creating_factors).

```r
myGenomicIntervals1[myGenomicIntervals1 %over% myGenomicIntervals2]
```

```
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1     25-30      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

---
# Overlapping genomic intervals

Another useful function to find overlaps is the **findOverlaps()** function.

This returns a new object **Hits** object.

```r
myOverlaps <- findOverlaps(myGenomicIntervals1,myGenomicIntervals2)
class(myOverlaps)
```

```
## [1] "SortedByQueryHits"
## attr(,"package")
## [1] "S4Vectors"
```

```r
myOverlaps 
```

```
## Hits object with 1 hit and 0 metadata columns:
##       queryHits subjectHits
##       <integer>   <integer>
##   [1]         2           1
##   -------
##   queryLength: 2 / subjectLength: 2
```

---
# Overlapping genomic intervals

To extract the indices of genomic intervals in **myGenomicIntervals1** overlapping **myGenomicIntervals2** we can use the **queryHits()** function.

To get the corresponding indicies of overlapping **myGenomicIntervals2** regions we can use the **subjectHits()** function.

```r
queryHits(myOverlaps)
```

```
## [1] 2
```

```r
subjectHits(myOverlaps)
```

```
## [1] 1
```

---
# Overlapping genomic intervals

We can use these indices to retrieve the overlapping intervals.

```r
myGenomicIntervals1[queryHits(myOverlaps)]
```

```r
myGenomicIntervals2[subjectHits(myOverlaps)]
```

```
## GRanges object with 1 range and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1     22-27      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

---
# Finding closest regions

Another common set of operations we often want to perform between our Genomic Intervals is to find regions which are nearest to, precede or follow each other.

We can perform these actions using the **nearest**, **precede** and **follow** functions.

```r
myGenomicIntervals1 <- GRanges(seqnames=c("chr1","chr1"),
                              ranges=IRanges(start=c(10,20),
                                             end=c(25,30)),
                              strand=c("+","+"))

myGenomicIntervals2 <- GRanges(seqnames=c("chr1","chr1"),
                              ranges=IRanges(start=c(1,10000),
                                             end=c(2,10002)),
                              strand=c("+","+"))
```

---
## Finding closest regions

```r
myGenomicIntervals1
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1     10-25      +
##   [2]     chr1     20-30      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

```r
myGenomicIntervals2
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames      ranges strand
##          <Rle>   <IRanges>  <Rle>
##   [1]     chr1         1-2      +
##   [2]     chr1 10000-10002      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

---
# Nearest genomics intervals

The **nearest()** function accepts the two GenomicRanges we wish to compare as arguments and returns the index of intervals in second GenomicRanges nearest to those in first GenomicRanges.

```r
indexOfNearest <- nearest(myGenomicIntervals1,myGenomicIntervals2)
indexOfNearest
```

```
## [1] 1 1
```

---
# Nearest genomics intervals

I can use this vector of index positions to then extract intervals in second GenomicRanges nearest to those in first GenomicRanges.

```r
myGenomicIntervals2[indexOfNearest]
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]     chr1       1-2      +
##   [2]     chr1       1-2      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

---
# Preceding/following intervals

We can use the **precede()** and **follow()** functions in much the same way to the **nearest()** function. This returns the index of elements in second GenomicRanges argument that are preceeded or followed by elements in first GenomicRanges argument.

```r
precedeIndex <- precede(myGenomicIntervals1,myGenomicIntervals2)
followIndex <- follow(myGenomicIntervals1,myGenomicIntervals2)
myGenomicIntervals2[precedeIndex]
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames      ranges strand
##          <Rle>   <IRanges>  <Rle>
##   [1]     chr1 10000-10002      +
##   [2]     chr1 10000-10002      +
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

---
# Distance to nearest genomics intervals

Another useful function is **distanceToNearest()** which again accepts arguments like **nearest()** or **findOverlaps()** functions and provides an object containing indexes of nearest intervals including a column of distances between nearest intervals.

```r
distances <- distanceToNearest(myGenomicIntervals1,myGenomicIntervals2)
distances
```

```
## Hits object with 2 hits and 1 metadata column:
##       queryHits subjectHits |  distance
##       <integer>   <integer> | <integer>
##   [1]         1           1 |         7
##   [2]         2           1 |        17
##   -------
##   queryLength: 2 / subjectLength: 2
```

```r
mcols(distances)
```

```
## DataFrame with 2 rows and 1 column
##    distance
##   <integer>
## 1         7
## 2        17
```

---
# Importing and exporting to bed

As we have seen, Genomic Intervals are often stores as BED files.

The **rtracklayer** package provide mechanisms to import and export from BED3/6 files using the **import.bed** and **export.bed** functions.

```r
library(rtracklayer)
mySicerPeaks <- import.bed(con="data/SicerPeaks.bed")
mySicerPeaks
```

```
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames            ranges strand |        name     score
##          <Rle>         <IRanges>  <Rle> | <character> <numeric>
##   [1]     chrY 90739201-90744599      * |       peak1         1
##   [2]     chrY 90766601-90767999      * |       peak2        20
##   [3]     chrX   6399201-6401199      * |       peak3        40
##   -------
##   seqinfo: 2 sequences from an unspecified genome; no seqlengths
```

---
# Importing and exporting to bed

Exporting to BED is achieved by using the **export.bed()** function.

This produces a valid BED file for use in IGV and other softwares.

```r
export.bed(mySicerPeaks, con="moreSicerPeaks.bed")
```

---
# Using GRanges to extract sequences

We saw how we can extract sequences from a BSgenome object previously using **subseq()** function.

```r
library(BSgenome.Mmusculus.UCSC.mm10)
subseq(BSgenome.Mmusculus.UCSC.mm10$chr10,1,100)
```

```
## 100-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
```
---
# Using GRanges to extract sequences

With GRanges objects we can specify the BSgenome for the sequence and GRanges as regions to be extracted.

Here we can use the **getSeq()** function to extract sequences under our Sicer Peaks on the mm10 genome.

```r
sicerSeq <- getSeq(BSgenome.Mmusculus.UCSC.mm10, names=mySicerPeaks)
sicerSeq
```

```
## DNAStringSet object of length 3:
##     width seq
## [1]  5399 AACAGGGCCTGACTGCGTACAGACATGTCCCCCT...GCCCCGGAGATCGGCGCTGGGCCCTAGCCTAGA
## [2]  1399 CATTCACCCACAGTGATCCCCCCGCCTCAGAAAC...TCAGAAAATCACTCACAGTCATCCCCCTGCCTC
## [3]  1999 TTGGCCAGCTTATCAGTGGGTGTTCTTTCGAGTA...CAAGATTGCCAGGTGGAGGGACTTCATGAACTT
```

---
# Using GRanges to extract sequences

The resulting **DNAStringSet** may be written to a FASTA file for further analysis in external programs.

```r
writeXStringSet(sicerSeq,"sicerSeq.fa")
```

---
# Time for an exercise.

[Link_to_exercises](../../exercises/exercises/GI_exercise.html)

[Link_to_answers](../../exercises/answers/GI_answers.html)