As we have seen earlier, genomic scores are often stored in wiggle format.
A perhaps more common human readable format is bedGraph.
But in many situations we would want a highly compressed format such as bigWig.
We used a bigWig from Encode in our last session.
Genomic Scores are heavily used in Genomics and High throughput sequencing as they offer a simple mechanism to review a defined metric over the linear genome at a specified resolution.
In our last session Genomic Intervals In Bioconductor we reviewed some of the Myc ChIP-seq signal available to us on encode.
This was the data from Experiment ENCSR000ERN, containing information on Myc ChIP-seq in mouse genome version mm10 CH12 cell line.
If you have not already downloaded the bigWig file then download it from this link for our exercise later
From our last session we identified Myc peaks within the Igfbp2 locus and in IGV compared Myc ChIP-seq signal from Encode over our peaks.
For this course I have provided bedGraph and bigWig files from this data for the window with the IGV image.
We will demonstrate how to create this later in today’s session but for now data is available in
Two popular Bioconductor packages for dealing with Genomics Scores are:
Now we have the package installed, we can load the library rtracklayer which we will use to import and export from/to bedGraph and bigWig.
library(rtracklayer)
We will also be making use of the functions in the GenomicRanges package. We dont need to load GenomicRanges directly here because the rtracklayer does this for us.
Package dependencies and imports allow one package to make use of functions from another.
The rtracklayer package provides functions to import genomic scores from a bedGraph using the import.bedGraph() function.
<- import.bedGraph("data/TSS_ENCFF940MBK.bedGraph") myBedG
The genomic scores by default are stored as a familiar GRanges object containing the original 4 columns of information, contig, start, end and score.
1:3] myBedG[
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 1-72811054 * | 0.00000
## [2] chr1 72811055-72811119 * | 0.69040
## [3] chr1 72811120-72811145 * | 0.57802
## -------
## seqinfo: 54 sequences from an unspecified genome; no seqlengths
Because we only have 4 columns in a bedGraph and no strand information, the GRanges intervals are unstranded with * in their strand column
strand(myBedG)
## factor-Rle of length 2161 with 1 run
## Lengths: 2161
## Values : *
## Levels(3): + - *
Much of the genomic scores we will be working with are infact stored in the compressed bigWig format.
We can also import bigWigs into R using the import.bw function.
<- import.bw("data/TSS_ENCFF940MBK.bw") myBigWig
The import bigWig’s genomic scores are again imported as a GRanges object containing the same information as the imported bedGraph.
1:3] myBigWig[
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 1-72811054 * | 0.00000
## [2] chr1 72811055-72811119 * | 0.69040
## [3] chr1 72811120-72811145 * | 0.57802
## -------
## seqinfo: 54 sequences from an unspecified genome
So far we have retrieved our genomic scores from bedGraphs and bigWigs as GRanges objects.
This allows us to use GRanges accessors and functions we have already seen in our last session.
<- GRanges("chr1",IRanges(72823698,72824485))
myGRanges <- myBigWig[myBigWig %over% myGRanges]
filteredBigWig 1:3] filteredBigWig[
## GRanges object with 3 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 72823686-72823703 * | 1.34249
## [2] chr1 72823704-72823710 * | 1.80605
## [3] chr1 72823711-72823715 * | 1.68676
## -------
## seqinfo: 54 sequences from an unspecified genome
We can however specify the type of objects we would like to return from the import.bedGraph and import.bw functions.
Here we will import the bigWig as a object we have briefly seen, the Rle object (run length encoding). Here we have an Rlelist (a list of Rle objects)
<- import.bw("data/TSS_ENCFF940MBK.bw",
myBigWig as = "RleList")
class(myBigWig)
## [1] "SimpleRleList"
## attr(,"package")
## [1] "IRanges"
Run length encoding allows for a very efficient storage of long stretchs of repeated values.
We have already seen an rle in our cigar string from SAM files.
We have also seen Rle objects within the GRanges objects we reviewed last session.
Chromosome/contig names in GRanges objects are stored as Rle objects for instance.
<- import.bed("data/Myc_Ch12_1_withInput_Input_Ch12_summits.bed")
mycPeaks seqnames(mycPeaks)
## factor-Rle of length 13910 with 21 runs
## Lengths: 891 737 1537 498 595 ... 796 646 714 227 10
## Values : chr1 chr10 chr11 chr12 chr13 ... chr7 chr8 chr9 chrX chrY
## Levels(21): chr1 chr10 chr11 chr12 chr13 chr14 ... chr7 chr8 chr9 chrX chrY
For genomic scores we will be storing long stretchs of numbers as an Rle.
To store genomic scores across chromosomes/contig we will use an RleList.
Creating an Rle is straightforward. We can simply supply a numeric vector of numbers we wish to compress to the Rle() function.
<- c(0,0,0,0,0,1,1,1,0,0,0,0,0)
myNumbers Rle(myNumbers)
## numeric-Rle of length 13 with 3 runs
## Lengths: 5 3 5
## Values : 0 1 0
Now can construct a named RleList containing the Rle objects using the RleList() function.
<- c(0,0,0,0,0,1,1,1,2,2,2,2,2)
myNumbers2 <- Rle(myNumbers)
chr1Scores <- Rle(myNumbers2)
chr2Scores <- RleList(chr1=chr1Scores,chr2=chr2Scores)
myRleList myRleList
## RleList of length 2
## $chr1
## numeric-Rle of length 13 with 3 runs
## Lengths: 5 3 5
## Values : 0 1 0
##
## $chr2
## numeric-Rle of length 13 with 3 runs
## Lengths: 5 3 5
## Values : 0 1 2
Rle allows for us to compress space needed to store genomic scores and so the Bioconductor Rle and RleList objects allows us make use of this in R.
We can see the compression of space in action by reviewing our imported genomic scores as an RleList objects.
1:2] myBigWig[
## RleList of length 2
## $chr1
## numeric-Rle of length 195471971 with 2108 runs
## Lengths: 72811054 65 26 ... 10 15 122614997
## Values : 0.00000 0.69040 0.57802 ... 0.21374 0.20965 0.00000
##
## $chr10
## numeric-Rle of length 130694993 with 1 run
## Lengths: 130694993
## Values : 0
To access elements of a RleList we can use our regular accessors for lists $ and [[]].
Here we retrieve the Rle object named chr1 containing all the genomic scores information for chromosome 1.
<- myBigWig$chr1
chr1_rle # Or
<- myBigWig[["chr1"]]
chr1_rle chr1_rle
## numeric-Rle of length 195471971 with 2108 runs
## Lengths: 72811054 65 26 ... 10 15 122614997
## Values : 0.00000 0.69040 0.57802 ... 0.21374 0.20965 0.00000
Now we have an Rle object of our genomic scores on chromosome 1 we can index it just like we have for standard vectors. Here i retrieve values for all basepairs between 1 and 10
1:10] chr1_rle[
## numeric-Rle of length 10 with 1 run
## Lengths: 10
## Values : 0
We can also replace values in a Rle, just as we would with a vector.
Here I replace values for all basepairs between 1 and 10 to 100
1:10] <- 100
chr1_rle[ chr1_rle
## numeric-Rle of length 195471971 with 2109 runs
## Lengths: 10 72811044 65 ... 10 15 122614997
## Values : 100.00000 0.00000 0.69040 ... 0.21374 0.20965 0.00000
We can convert Rle’s to standard vectors.
<- as.vector(chr1_rle[1:10])
rleAsVector rleAsVector
## [1] 100 100 100 100 100 100 100 100 100 100
Or to a data.frame
<- as.data.frame(chr1_rle[1:10])
rleAsDF rleAsDF
## value
## 1 100
## 2 100
## 3 100
## 4 100
## 5 100
## 6 100
## 7 100
## 8 100
## 9 100
## 10 100
Many vector operations and function work with Rle objects.
These include
We can use many simple arithmetic operations such as +, -, / and *****
+1000 chr1_rle
## numeric-Rle of length 195471971 with 2109 runs
## Lengths: 10 72811044 65 ... 10 15 122614997
## Values : 1100.00 1000.00 1000.69 ... 1000.21 1000.21 1000.00
+1000)*10 (chr1_rle
## numeric-Rle of length 195471971 with 2109 runs
## Lengths: 10 72811044 65 ... 10 15 122614997
## Values : 11000.0 10000.0 10006.9 ... 10002.1 10002.1 10000.0
Logical operations can be used with Rle objects just as with vectors.
For Rle objects, logical opertations return a logical Rle instead of standard logical vector.
< 10 chr1_rle
## logical-Rle of length 195471971 with 20 runs
## Lengths: 10 72823081 55 ... 65 2 122627007
## Values : FALSE TRUE FALSE ... TRUE FALSE TRUE
We use this logical Rle to replace values less than 10 with 0 as we would use a logical vector with standard vectors.
< 10] <- 0
chr1_rle[chr1_rle chr1_rle
## numeric-Rle of length 195471971 with 122 runs
## Lengths: 10 72823081 14 ... 65 2 122627007
## Values : 100.0000 0.0000 10.0802 ... 0.0000 10.2513 0.0000
Many functions providing summary statisitics are also available to us including mean(), max() and sum() functions.
mean(chr1_rle)
## [1] 4.181372e-05
max(chr1_rle)
## [1] 100
sum(chr1_rle)
## [1] 8173.411
Very usefully, We can also apply arithmetic and mathematical operations to whole RleLists as imported from the bigWig file.
<- import.bw("data/TSS_ENCFF940MBK.bw",as="RleList")
myBigWig +10 myBigWig
## RleList of length 54
## $chr1
## numeric-Rle of length 195471971 with 2108 runs
## Lengths: 72811054 65 26 ... 10 15 122614997
## Values : 10.0000 10.6904 10.5780 ... 10.2137 10.2096 10.0000
##
## $chr10
## numeric-Rle of length 130694993 with 1 run
## Lengths: 130694993
## Values : 10
##
## $chr11
## numeric-Rle of length 122082543 with 1 run
## Lengths: 122082543
## Values : 10
##
## $chr12
## numeric-Rle of length 120129022 with 1 run
## Lengths: 120129022
## Values : 10
##
## $chr13
## numeric-Rle of length 120421639 with 1 run
## Lengths: 120421639
## Values : 10
##
## ...
## <49 more elements>
Summary functions return a summary for every Rle object (here for chromosomes) within the RleList.
<- max(myBigWig)
chromosomeMax 1:4] chromosomeMax[
## chr1 chr10 chr11 chr12
## 23.09015 0.00000 0.00000 0.00000
We can in fact use GRanges objects to index our RleList objects.The GRanges provides the intervals from which genomic scores are retrieved. The resulting RleList object contains an entry with the scores for each interval in the GRanges object.
To demonstrate first we can retrieve the Myc Peaks calls which overlap the region we are reviewing.
<- GRanges("chr1",ranges = IRanges(72811055,72856974))
myRanges <- import.bed("data/Myc_Ch12_1_withInput_Input_Ch12_summits.bed")
mycPeaks <- resize(mycPeaks,50,fix="center")
mycPeaks <- mycPeaks[mycPeaks %over% myRanges]
newMycPeaks newMycPeaks
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 72824021-72824070 * | Myc_Ch12_1_withInput.. 18.55062
## [2] chr1 72844876-72844925 * | Myc_Ch12_1_withInput.. 9.27569
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
Now we can simply index our RleList object by providing our GRanges as an index in [] brackets.
<- myBigWig[newMycPeaks]
rleOverGranges rleOverGranges
## RleList of length 2
## $chr1
## numeric-Rle of length 50 with 14 runs
## Lengths: 6 3 2 3 ... 2 1 1 1
## Values : 16.1810 18.4153 20.7201 19.5592 ... 20.7201 21.8973 23.0902 21.8973
##
## $chr1
## numeric-Rle of length 50 with 9 runs
## Lengths: 1 10 6 3 ... 8 4 2
## Values : 11.25152 9.79599 10.51570 9.79599 ... 9.09299 8.40735 7.09102
With the RleList containing our scores over the Myc peaks we can now gather summary statistics as with all RleList objects
sum(rleOverGranges)
## chr1 chr1
## 1009.0643 480.4242
We may wish to export our RleList or Rle objects back to a bigWig file.
We would need to convert any Rle objects to a RleList first in order to provide some chromosome/contig names. We can do this with the RleList() function and providing a chromosome/contig name to our Rle object.
<- RleList(chr1=chr1_rle)
myRleList myRleList
## RleList of length 1
## $chr1
## numeric-Rle of length 195471971 with 122 runs
## Lengths: 10 72823081 14 ... 65 2 122627007
## Values : 100.0000 0.0000 10.0802 ... 0.0000 10.2513 0.0000
Now we have our RleList object we export this to a bigWig using the export.bw() function.
export.bw(myRleList,con="chr1_Myc.bw")
In some cases we will only wish to import portions of a BigWig file for use in our analysis. This will save us memory and time when loading in big files.
To do this we can take advantage of the selection parameter in the import.bw function and the BigWigSelection() function.
When importing only a portion of a bigWig we simply need to specify a GRanges of regions we wish to retrieve to the BigWigSelection() function.
Here we will use the Myc Peaks in our window as the GRanges of regions for selection.
newMycPeaks
## GRanges object with 2 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 72824021-72824070 * | Myc_Ch12_1_withInput.. 18.55062
## [2] chr1 72844876-72844925 * | Myc_Ch12_1_withInput.. 9.27569
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
Now we pass the Myc peak GRanges object to the BigWigSelection() function.
We then supply the resulting BigWigSelection selection object to the selection parameter of import.bw() function.
<- BigWigSelection(newMycPeaks)
mySelection import.bw("data/TSS_ENCFF940MBK.bw",
selection=mySelection,
as="RleList")
## RleList of length 54
## $chr1
## numeric-Rle of length 195471971 with 26 runs
## Lengths: 72824020 6 3 ... 4 2 122627046
## Values : 0.00000 16.18099 18.41527 ... 8.40735 7.09102 0.00000
##
## $chr10
## numeric-Rle of length 130694993 with 1 run
## Lengths: 130694993
## Values : 0
##
## $chr11
## numeric-Rle of length 122082543 with 1 run
## Lengths: 122082543
## Values : 0
##
## $chr12
## numeric-Rle of length 120129022 with 1 run
## Lengths: 120129022
## Values : 0
##
## $chr13
## numeric-Rle of length 120421639 with 1 run
## Lengths: 120421639
## Values : 0
##
## ...
## <49 more elements>