class: center, middle, inverse, title-slide # Genomic Intervals in Bioconductor
### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/Bioconductor_Introduction/
--- ## Genomic Intervals As we have seen earlier, genomic intervals are often stored in BED format. <div align="center"> <img src="imgs/bed.png" alt="igv" height="400" width="400"> </div> --- ## Genomic Intervals examples In Genomics and High Throughput sequencing we make extensive use of Genomic Intervals to store simple annotation and summarised results over regions. - Transcriptional Start Sites / Enhancers - Enrichment for ChIP-seq signal i.e. Peak calls. --- ## Genomic Intervals manipulation We also want to be able to assess spatial relation between such genomic intervals within our linear genome. - Transcriptional Start Sites and their nearest enhancers. - Enrichment for ChIP-seq signal within promoters and outside of promoters. --- class: inverse, center, middle # Genomic Intervals in Bioconductor <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Genomic Intervals in Bioconductor Two popular Bioconductor packages for dealing with Genomics Intervals are: - [**rtracklayer**](https://bioconductor.org/packages/release/bioc/html/rtracklayer.html) - Importing/exporting genomic intervals into/out of R. - [**GenomicRanges**](https://bioconductor.org/packages/release/bioc/html/GenomicRanges.html) - Handling genomic intervals in R. --- ## Genomic Intervals in Bioconductor The first package we will look at is the GenomicRanges package. Remember we can install Bioconductor packages (fairly) easily using Bioconductor package [commands](https://www.bioconductor.org/install/). ```r BiocManager::install(version = "3.10") BiocManager::install("GenomicRanges") ``` --- # Genomic Intervals in Bioconductor Now we have the package installed, we can load the library. ```r library(GenomicRanges) ``` --- class: inverse, center, middle # Creating Genomic Ranges. <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- # Building a GenomicRanges To construct a GenomicRanges object we require a set of genomic intervals. Intervals (genomic and non genomic intervals) are created using the **IRanges()** function. We must specify **start** and **end** positions as numeric vectors to create our IRanges object. ```r myIntervals <- IRanges(start=c(1,10,20),end=c(2,11,30)) class(myIntervals) ``` ``` ## [1] "IRanges" ## attr(,"package") ## [1] "IRanges" ``` --- # Building a GenomicRanges The resulting IRanges object contains our intervals and an additional column describing widths of intervals. ```r myIntervals ``` ``` ## IRanges object with 3 ranges and 0 metadata columns: ## start end width ## <integer> <integer> <integer> ## [1] 1 2 2 ## [2] 10 11 2 ## [3] 20 30 11 ``` --- # Building a GenomicRanges Now to create a Genomic Ranges interval object or **GRanges** object. We extend our IRanges object to also include chrosomosome/contig information. To do this we can supply our new IRanges object and a vector of corresponding chromosome locations to the **seqnames** parameter in **GRanges()** function. ```r myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"), myIntervals) class(myGenomicIntervals) ``` ``` ## [1] "GRanges" ## attr(,"package") ## [1] "GenomicRanges" ``` --- # Building a GenomicRanges The created **GRanges** object contains our original intervals from the IRanges object and now the chromosomes/contigs on which the intervals exist. ```r myGenomicIntervals ``` ``` ## GRanges object with 3 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-2 * ## [2] chr1 10-11 * ## [3] chr2 20-30 * ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Indexing a GenomicRanges We can retrieve genomic intervals from our GRanges using indexing as with [vectors, matrices and dataframes.](https://rockefelleruniversity.github.io/Intro_To_R_1Day/r_course/presentations/singlepage/introToR_Session1.html#vectors_-_indexing) ```r myGenomicIntervals[c(2,3),] ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 10-11 * ## [2] chr2 20-30 * ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Indexing a GenomicRanges Unlike data frames and matrices, we dont need the second **","** to specify rows not columns. ```r myGenomicIntervals[c(2,3)] ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 10-11 * ## [2] chr2 20-30 * ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Accessing columns in GenomicRanges To access the columns of a GRanges object we must use special accessor functions - **seqnames()**,**start()**, **end()** which correspond to the columns we wish to retrieve. We get start and stop positions using the **start()** and **end()** functions. ```r start(myGenomicIntervals) ``` ``` ## [1] 1 10 20 ``` ```r end(myGenomicIntervals) ``` ``` ## [1] 2 11 30 ``` --- # Replacing values in GenomicRanges We can use the **start()** and **end()** functions to update our start and end positions as well. ```r start(myGenomicIntervals) <- c(1,5,15) ``` ```r start(myGenomicIntervals) ``` ``` ## [1] 1 5 15 ``` --- # Replacing values in GenomicRanges We can not however create intervals with negative widths. Our start positions cannot be greater than our end positions. ```r start(myGenomicIntervals) <- c(10,50,150) ``` ``` ## Error in validObject(x): invalid class "IRanges" object: ## 'width(x)' cannot contain negative integers ``` --- # Retrieving chromosome/contig info We can retrieve the chromosome names from GenomicRanges in a similar manner to **start()** and **end()** functions using the **seqnames()** function. ```r contigNames <- seqnames(myGenomicIntervals) contigNames ``` ``` ## factor-Rle of length 3 with 2 runs ## Lengths: 2 1 ## Values : chr1 chr2 ## Levels(2): chr1 chr2 ``` --- # Retrieving chromosome/contig info The contig names are stored in another special object called a RLE (run length encoded). This RLE object saves memory and allows for more efficient manipulation of long stretches of repeated names. We can convert to a character vector using the **as.character()** function ```r as.character(contigNames) ``` ``` ## [1] "chr1" "chr1" "chr2" ``` --- # Replacing chromosome/contig info As with the **start()** and **end()** functions we can update our contig names using the **seqnames()** function. ```r seqnames(myGenomicIntervals) <- c("chr2","chr2","chr1") seqnames(myGenomicIntervals) ``` ``` ## factor-Rle of length 3 with 2 runs ## Lengths: 2 1 ## Values : chr2 chr1 ## Levels(2): chr1 chr2 ``` --- # Replacing chromosome/contig info As with [all factors however](https://rockefelleruniversity.github.io/Intro_To_R_1Day/r_course/presentations/singlepage/introToR_Session1.html#factors_-_creating_factors), we can not add contigs not already in use in seqnames column. ```r seqnames(myGenomicIntervals) <- c("chr3","chr2","chr1") ``` ``` ## Error in .normalize_seqnames_replacement_value(value, x): levels of supplied 'seqnames' must be identical to 'seqlevels(x)' ``` --- # Replacing chromosome/contig info To add new contig we must update the seqlevels, like levels for factors but now for contigs in GRanges objects. The **seqlevels()** identifies which contigs we are using. ```r seqlevels(myGenomicIntervals) ``` ``` ## [1] "chr1" "chr2" ``` --- # Replacing chromosome/contig info We can update the seqlevels in use by again using the **seqlevels()** function. ```r seqlevels(myGenomicIntervals) <- c("chr1","chr2","chr3") seqlevels(myGenomicIntervals) ``` ``` ## [1] "chr1" "chr2" "chr3" ``` --- # Replacing chromosome/contig info Now we can add contig for chromosome 3 (**chr3**). ```r seqnames(myGenomicIntervals) <- c("chr1","chr2","chr3") myGenomicIntervals ``` ``` ## GRanges object with 3 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-2 * ## [2] chr2 5-11 * ## [3] chr3 15-30 * ## ------- ## seqinfo: 3 sequences from an unspecified genome; no seqlengths ``` --- # Strand information in a GenomicRanges We can include strand information in GRanges objects. We can set strand to be "+" , "-" , or unstranded with "*". ```r myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"), myIntervals, strand=c("+","-","*")) myGenomicIntervals ``` ``` ## GRanges object with 3 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-2 + ## [2] chr1 10-11 - ## [3] chr2 20-30 * ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Strand information in a GenomicRanges Similarly we can extract and set strand information using the **strand()** function. ```r strand(myGenomicIntervals) <- c("+","+","-") strand(myGenomicIntervals) ``` ``` ## factor-Rle of length 3 with 2 runs ## Lengths: 2 1 ## Values : + - ## Levels(3): + - * ``` --- # Strand information in a GenomicRanges The strand information is also stored as a factor RLE. This contains the levels "+", "-", "*" and only these levels. ```r myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"), myIntervals, strand=c("+","+","+")) strand(myGenomicIntervals) ``` ``` ## factor-Rle of length 3 with 1 run ## Lengths: 3 ## Values : + ## Levels(3): + - * ``` --- # Names in a GenomicRanges Finally we can set and extract names for our genomic intervals using the **names()** function. ```r names(myGenomicIntervals) <- c("peak1","peak2","peak3") myGenomicIntervals ``` ``` ## GRanges object with 3 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## peak1 chr1 1-2 + ## peak2 chr1 10-11 + ## peak3 chr2 20-30 + ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Names in a GenomicRanges As with vectors we can use to names to index and retrieve genomic intervals of interest. ```r myGenomicIntervals["peak2"] ``` ``` ## GRanges object with 1 range and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## peak2 chr1 10-11 + ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Additional genomic interval information We will often want to store more information on our genomic intervals. We can use the **mcols** (metadata columns) to store this information. To add additional metadata columns we can simply supply the desired column name and the values we want to the **GRanges()** function as with data frames. ```r myIntervals <- IRanges(start=c(1,2,2),end=c(2,11,30)) myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"), myIntervals,strand=c("+","+","+"), Score=c(10,20,40), Comment=c("GoodQC","GoodQC","BadQC")) ``` --- # Additional genomic interval information We can retrieve this metadata column information as with data frames by using the **$** accessor ```r myGenomicIntervals$Score ``` ``` ## [1] 10 20 40 ``` ```r myGenomicIntervals$Comment ``` ``` ## [1] "GoodQC" "GoodQC" "BadQC" ``` --- # Additional genomic interval information We to retrieve all metadata columns as a data frames we can use the **mcols()** function ```r mcols(myGenomicIntervals) ``` ``` ## DataFrame with 3 rows and 2 columns ## Score Comment ## <numeric> <character> ## 1 10 GoodQC ## 2 20 GoodQC ## 3 40 BadQC ``` --- # Convert to data frame We can convert the entire GRanges to a data.frame using the **as.data.frame()** function. ```r as.data.frame(myGenomicIntervals) ``` ``` ## seqnames start end width strand Score Comment ## 1 chr1 1 2 2 + 10 GoodQC ## 2 chr1 2 11 10 + 20 GoodQC ## 3 chr2 2 30 29 + 40 BadQC ``` --- ## GRanges from IDs The GenomicRanges package also has a quick short cut to create GRanges from character strings. We simply specify a genomic range in a similar format to that used by IGV - **Chromosome:Start-End** _**(chr1:110-120)**_. ```r GRanges("chr1:110-120") ``` ``` ## GRanges object with 1 range and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 110-120 * ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- ## GRanges from IDs Often genomics programs will export results in table with a similar format for the GenomicIntervals. The BRC pipeline outputs in a format of **Chromosome:Start:End** _**(chr1:110:120)**_ to allow separation into columns in programs such as excel. To convert the formats we can use some string manipulation tools. In the BRC example we will need to replace last **":"** with a **"-"**. We can use the **stringi** package's **stri_replace_last_fixed** function. ```r library(stringi) myRange <- "chr1:110:120" newRange <- stri_replace_last_fixed(myRange, ':', '-') newRange ``` ``` ## [1] "chr1:110-120" ``` --- ## GRanges from IDs Another common format is **IDforRange:Chromosome:Start:End**. We can clean this up in additional step using another **stringi** function. ```r library(stringi) myRange <- "MyID:chr1:110:120" newRange <- stri_replace_last_fixed(myRange, ':', '-') newerRange <- stri_replace_first_regex(newRange, '\\w*:', '') newerRange ``` ``` ## [1] "chr1:110-120" ``` ```r GRanges(newerRange) ``` ``` ## GRanges object with 1 range and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 110-120 * ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- # Special Genomic Ranges functions One of the big advantage of using GRanges objects to handle genomic data is the set of functions which allows us to manipulate and handle genomic intervals. The include functions to resize, merge/reduce and shift genomic ranges. ![](imgs/Unknown2.jpg) --- # Adjusting genomic ranges positions To adjust the postion of a GRanges object we can manually adjust start and end positions, but GenomicRanges provides some more user friendly methods. The **shift** function allows us to move by a specified number of basepairs. ```r myGenomicIntervals[1] ``` ``` ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | Score Comment ## <Rle> <IRanges> <Rle> | <numeric> <character> ## [1] chr1 1-2 + | 10 GoodQC ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` ```r shift(myGenomicIntervals[1],shift = 10) ``` ``` ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | Score Comment ## <Rle> <IRanges> <Rle> | <numeric> <character> ## [1] chr1 11-12 + | 10 GoodQC ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Adjusting genomic ranges positions The **resize()** function allows to adjust the width of regions. ```r myGenomicIntervals[3] ``` ``` ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | Score Comment ## <Rle> <IRanges> <Rle> | <numeric> <character> ## [1] chr2 2-30 + | 40 BadQC ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` ```r resize(myGenomicIntervals[3], width=20) ``` ``` ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | Score Comment ## <Rle> <IRanges> <Rle> | <numeric> <character> ## [1] chr2 2-21 + | 40 BadQC ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Adjusting genomic ranges positions By default the **resize()** function adjust the width of regions from the **start** of the region. We can control this behavior by specifying the **fix** parameter as **start**, **end** or **center** to specify where we wish to resize from. ```r myGenomicIntervals[3] ``` ``` ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | Score Comment ## <Rle> <IRanges> <Rle> | <numeric> <character> ## [1] chr2 2-30 + | 40 BadQC ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` ```r resize(myGenomicIntervals[3], width=20, fix="end") ``` ``` ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | Score Comment ## <Rle> <IRanges> <Rle> | <numeric> <character> ## [1] chr2 11-30 + | 40 BadQC ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Adjusting genomic ranges positions The **resize()** function by default takes into account strand information. For negative strand intervals, when specifying **fix=start** it considers the start as the end position i.e. Useful for Gene positions. ```r resize(myGenomicIntervals[3],width=20, fix="start") ``` ``` ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | Score Comment ## <Rle> <IRanges> <Rle> | <numeric> <character> ## [1] chr2 2-21 + | 40 BadQC ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` ```r strand(myGenomicIntervals)[3] <- "-" resize(myGenomicIntervals[3],width=20, fix="start") ``` ``` ## GRanges object with 1 range and 2 metadata columns: ## seqnames ranges strand | Score Comment ## <Rle> <IRanges> <Rle> | <numeric> <character> ## [1] chr2 11-30 - | 40 BadQC ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Adjusting genomic ranges positions A final function for adjusting genomic ranges which we will review is the **reduce()** function. The reduce function merges overlapping regions into a single genomic interval. First lets make some overlapping intervals ```r myGenomicIntervals <- GRanges(seqnames=c("chr1","chr1","chr2"), ranges=IRanges(start=c(1,2,2),end=c(2,11,30)), strand=c("+","+","+")) myGenomicIntervals ``` ``` ## GRanges object with 3 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-2 + ## [2] chr1 2-11 + ## [3] chr2 2-30 + ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Adjusting genomic ranges positions Now we can merge them with **reduce()** function taking into account chromosomes. ```r mergedGenomicIntervals <- reduce(myGenomicIntervals) mergedGenomicIntervals ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-11 + ## [2] chr2 2-30 + ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Adjusting genomic ranges positions The **reduce()** function also takes into account the strand information and only merges genomic intervals on same strand. ```r strand(myGenomicIntervals) <- c("+","-","+") mergedGenomicIntervals <- reduce(myGenomicIntervals) mergedGenomicIntervals ``` ``` ## GRanges object with 3 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-2 + ## [2] chr1 2-11 - ## [3] chr2 2-30 + ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Adjusting genomic ranges positions We can merge intervals across strands by setting **ignore.strand=TRUE**. Note that all regions are now considered strandless. ```r strand(myGenomicIntervals) <- c("+","-","+") mergedGenomicIntervals <- reduce(myGenomicIntervals, ignore.strand=TRUE) mergedGenomicIntervals ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-11 * ## [2] chr2 2-30 * ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Overlapping genomic intervals Some particularly useful methods/operators in GenomicRanges allows us to assess the relationship between two sets of GRanges objects. One of the most common is the **%over%** operator which allows us to identify overlapping regions. First we make two GRanges objects. ```r myGenomicIntervals1 <- GRanges(seqnames=c("chr1","chr1"), ranges=IRanges(start=c(1,25), end=c(20,30)), strand=c("+","+")) myGenomicIntervals2 <- GRanges(seqnames=c("chr1","chr1"), ranges=IRanges(start=c(22,100), end=c(27,130)), strand=c("+","+")) ``` --- # Overlapping genomic intervals ```r myGenomicIntervals1 ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-20 + ## [2] chr1 25-30 + ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` ```r myGenomicIntervals2 ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 22-27 + ## [2] chr1 100-130 + ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- # Overlapping genomic intervals Now to find all regions in **myGenomicIntervals1** which overlap with intervals in **myGenomicIntervals2** we simply use the **%over%** operator. This will return a logical vector specifying if regions in **myGenomicIntervals1** overlap any regions in **myGenomicIntervals2**. ```r myGenomicIntervals1 %over% myGenomicIntervals2 ``` ``` ## [1] FALSE TRUE ``` --- # Overlapping genomic intervals We can use this logical vector then to subset our **myGenomicIntervals1** set [as we have with vectors](https://rockefelleruniversity.github.io/Intro_To_R_1Day/r_course/presentations/slides/introToR_Session1.html#factors_-_creating_factors). ```r myGenomicIntervals1[myGenomicIntervals1 %over% myGenomicIntervals2] ``` ``` ## GRanges object with 1 range and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 25-30 + ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- # Overlapping genomic intervals Another useful function to find overlaps is the **findOverlaps()** function. This returns a new object **Hits** object. ```r myOverlaps <- findOverlaps(myGenomicIntervals1,myGenomicIntervals2) class(myOverlaps) ``` ``` ## [1] "SortedByQueryHits" ## attr(,"package") ## [1] "S4Vectors" ``` ```r myOverlaps ``` ``` ## Hits object with 1 hit and 0 metadata columns: ## queryHits subjectHits ## <integer> <integer> ## [1] 2 1 ## ------- ## queryLength: 2 / subjectLength: 2 ``` --- # Overlapping genomic intervals To extract the indices of genomic intervals in **myGenomicIntervals1** overlapping **myGenomicIntervals2** we can use the **queryHits()** function. To get the corresponding indicies of overlapping **myGenomicIntervals2** regions we can use the **subjectHits()** function. ```r queryHits(myOverlaps) ``` ``` ## [1] 2 ``` ```r subjectHits(myOverlaps) ``` ``` ## [1] 1 ``` --- # Overlapping genomic intervals We can use these indices to retrieve the overlapping intervals. ```r myGenomicIntervals1[queryHits(myOverlaps)] ``` ``` ## GRanges object with 1 range and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 25-30 + ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` ```r myGenomicIntervals2[subjectHits(myOverlaps)] ``` ``` ## GRanges object with 1 range and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 22-27 + ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- # Finding closest regions Another common set of operations we often want to perform between our Genomic Intervals is to find regions which are nearest to, precede or follow each other. We can perform these actions using the **nearest**, **precede** and **follow** functions. ```r myGenomicIntervals1 <- GRanges(seqnames=c("chr1","chr1"), ranges=IRanges(start=c(10,20), end=c(25,30)), strand=c("+","+")) myGenomicIntervals2 <- GRanges(seqnames=c("chr1","chr1"), ranges=IRanges(start=c(1,10000), end=c(2,10002)), strand=c("+","+")) ``` --- ## Finding closest regions ```r myGenomicIntervals1 ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 10-25 + ## [2] chr1 20-30 + ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` ```r myGenomicIntervals2 ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-2 + ## [2] chr1 10000-10002 + ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- # Nearest genomics intervals The **nearest()** function accepts the two GenomicRanges we wish to compare as arguments and returns the index of intervals in second GenomicRanges nearest to those in first GenomicRanges. ```r indexOfNearest <- nearest(myGenomicIntervals1,myGenomicIntervals2) indexOfNearest ``` ``` ## [1] 1 1 ``` --- # Nearest genomics intervals I can use this vector of index positions to then extract intervals in second GenomicRanges nearest to those in first GenomicRanges. ```r myGenomicIntervals2[indexOfNearest] ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 1-2 + ## [2] chr1 1-2 + ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- # Preceding/following intervals We can use the **precede()** and **follow()** functions in much the same way to the **nearest()** function. This returns the index of elements in second GenomicRanges argument that are preceeded or followed by elements in first GenomicRanges argument. ```r precedeIndex <- precede(myGenomicIntervals1,myGenomicIntervals2) followIndex <- follow(myGenomicIntervals1,myGenomicIntervals2) myGenomicIntervals2[precedeIndex] ``` ``` ## GRanges object with 2 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 10000-10002 + ## [2] chr1 10000-10002 + ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- # Distance to nearest genomics intervals Another useful function is **distanceToNearest()** which again accepts arguments like **nearest()** or **findOverlaps()** functions and provides an object containing indexes of nearest intervals including a column of distances between nearest intervals. ```r distances <- distanceToNearest(myGenomicIntervals1,myGenomicIntervals2) distances ``` ``` ## Hits object with 2 hits and 1 metadata column: ## queryHits subjectHits | distance ## <integer> <integer> | <integer> ## [1] 1 1 | 7 ## [2] 2 1 | 17 ## ------- ## queryLength: 2 / subjectLength: 2 ``` ```r mcols(distances) ``` ``` ## DataFrame with 2 rows and 1 column ## distance ## <integer> ## 1 7 ## 2 17 ``` --- # Importing and exporting to bed As we have seen, Genomic Intervals are often stores as BED files. The **rtracklayer** package provide mechanisms to import and export from BED3/6 files using the **import.bed** and **export.bed** functions. ```r library(rtracklayer) mySicerPeaks <- import.bed(con="data/SicerPeaks.bed") mySicerPeaks ``` ``` ## GRanges object with 3 ranges and 2 metadata columns: ## seqnames ranges strand | name score ## <Rle> <IRanges> <Rle> | <character> <numeric> ## [1] chrY 90739201-90744599 * | peak1 1 ## [2] chrY 90766601-90767999 * | peak2 20 ## [3] chrX 6399201-6401199 * | peak3 40 ## ------- ## seqinfo: 2 sequences from an unspecified genome; no seqlengths ``` --- # Importing and exporting to bed Exporting to BED is achieved by using the **export.bed()** function. This produces a valid BED file for use in IGV and other softwares. ```r export.bed(mySicerPeaks, con="moreSicerPeaks.bed") ``` --- # Using GRanges to extract sequences We saw how we can extract sequences from a BSgenome object previously using **subseq()** function. ```r library(BSgenome.Mmusculus.UCSC.mm10) subseq(BSgenome.Mmusculus.UCSC.mm10$chr10,1,100) ``` ``` ## 100-letter DNAString object ## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ``` --- # Using GRanges to extract sequences With GRanges objects we can specify the BSgenome for the sequence and GRanges as regions to be extracted. Here we can use the **getSeq()** function to extract sequences under our Sicer Peaks on the mm10 genome. ```r sicerSeq <- getSeq(BSgenome.Mmusculus.UCSC.mm10, names=mySicerPeaks) sicerSeq ``` ``` ## DNAStringSet object of length 3: ## width seq ## [1] 5399 AACAGGGCCTGACTGCGTACAGACATGTCCCCCT...GCCCCGGAGATCGGCGCTGGGCCCTAGCCTAGA ## [2] 1399 CATTCACCCACAGTGATCCCCCCGCCTCAGAAAC...TCAGAAAATCACTCACAGTCATCCCCCTGCCTC ## [3] 1999 TTGGCCAGCTTATCAGTGGGTGTTCTTTCGAGTA...CAAGATTGCCAGGTGGAGGGACTTCATGAACTT ``` --- # Using GRanges to extract sequences The resulting **DNAStringSet** may be written to a FASTA file for further analysis in external programs. ```r writeXStringSet(sicerSeq,"sicerSeq.fa") ``` --- # Time for an exercise. [Link_to_exercises](../../exercises/exercises/GI_exercise.html) [Link_to_answers](../../exercises/answers/GI_answers.html)