In todays session we will continue to review the Myc ChIPseq we were working on in our last sessions.
This include Myc ChIPseq for MEL and Ch12 celllines.
Information and files for the Myc ChIPseq in MEL cell line can be found here
Information and files for the Myc ChIPseq in Ch12 cell line can be found here
I have provided peak calls from MACS2 following the processing steps outlined in our last session.
Peak calls for Myc in MEL and Ch12 cellines can be found in
data/peaks/
A common goal in ChIPseq is characterise genome wide transcription factor binding sites or epigenetic states.
The presence of transcription factor binding sites and epigenetics events is often further analyzed in the context of their putative targets genes to characterize the transcription factor’s and epigenetic event’s function and/or biological role.
With the release of ENCODE’s wide scale mapping of transcription factor binding sites or epigenetic states and the advent of multiplexing technologies for high throughput sequencing, in has become common practice to have replicated ChIPseq experiments so as to have higher confidence in identified epigenetic events.
In addition to the genome wide characterization of epigenetic events, ChIPseq had been increasingly used to identify changes in epigenetic events between conditions and/or cell lines.
We have been working to process and a characterize a Myc ChIPseq replicate in the Mel cell line.
In this session we will look at how we can define a high confidence/reproducible set of Myc peaks in the Mel cell line as well as identify Myc binding events unique or common between Mel and Ch12 cell lines.
First we need to read our peak calls from MACS2 into R.
The Myc peak calls we will review are within the peaks directory, so here we list all files matching our expected file pattern using the dir() function.
<- dir("data/peaks/", pattern = "*.peaks", full.names = TRUE)
peakFiles peakFiles
## [1] "data/peaks//ch12_1_peaks.xls" "data/peaks//ch12_2_peaks.xls"
## [3] "data/peaks//Mel_1_peaks.xls" "data/peaks//Mel_2_peaks.xls"
We can loop through our tab separated files (disguised as .xls functions) and import them into R as a list of data.frames using a loop.
<- list()
macsPeaks_DF for (i in 1:length(peakFiles)) {
<- read.delim(peakFiles[i], comment.char = "#")
macsPeaks_DF[[i]]
}length(macsPeaks_DF)
## [1] 4
Now with our list of data.frames of peak calls, we loop through the list and create a GRanges for each of our peak calls.
Remember you can also do this with the import function from rtracklayer.
library(GenomicRanges)
<- list()
macsPeaks_GR for (i in 1:length(macsPeaks_DF)) {
<- macsPeaks_DF[[i]]
peakDFtemp <- GRanges(seqnames = peakDFtemp[, "chr"], IRanges(peakDFtemp[,
macsPeaks_GR[[i]] "start"], peakDFtemp[, "end"]))
}1]] macsPeaks_GR[[
## GRanges object with 13910 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 4490620-4490773 *
## [2] chr1 4671581-4671836 *
## [3] chr1 4785359-4785661 *
## [4] chr1 5283235-5283439 *
## [5] chr1 6262317-6262572 *
## ... ... ... ...
## [13906] chrY 90742039-90742407 *
## [13907] chrY 90742425-90742646 *
## [13908] chrY 90742687-90742878 *
## [13909] chrY 90742905-90743091 *
## [13910] chrY 90825379-90825714 *
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
We will want to assign a sensible set of names to our peak calls.
We can use the gsub() and basename() function with our file names to create some samplenames.
The basename() function accepts a file path (such as the path to our bam files) and returns just the file name (removing directory paths).
The gsub() function accepts the text to replace, the replacement text and a character vector to replace within.
<- basename(peakFiles)
fileNames fileNames
## [1] "ch12_1_peaks.xls" "ch12_2_peaks.xls" "Mel_1_peaks.xls" "Mel_2_peaks.xls"
<- gsub("_peaks.xls", "", fileNames)
sampleNames sampleNames
## [1] "ch12_1" "ch12_2" "Mel_1" "Mel_2"
Now we have a named list of our peak calls as GRanges objects.
We can convert our list of GRanges objects to a GRangesList using the GRangesList() function.
<- GRangesList(macsPeaks_GR)
macsPeaks_GRL names(macsPeaks_GRL) <- sampleNames
class(macsPeaks_GRL)
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
names(macsPeaks_GRL)
## [1] "ch12_1" "ch12_2" "Mel_1" "Mel_2"
The GRangesList object can behave just as our standard lists. Here we use the lengths() function to a get the number of peaks in each replicate.
lengths(macsPeaks_GRL)
## ch12_1 ch12_2 Mel_1 Mel_2
## 13910 28420 13777 13512
A major advantage of GRangesList objects is that we can apply many of the GRanges accessor and operator functions directly to our GRangesList.
This means there is no need to lapply and convert back to GRangesList if we wish to alter our GRanges by a common method.
library(rtracklayer)
<- resize(macsPeaks_GRL, 10, fix = "center")
macsPeaks_GRLCentred width(macsPeaks_GRLCentred)
## IntegerList of length 4
## [["ch12_1"]] 10 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
## [["ch12_2"]] 10 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
## [["Mel_1"]] 10 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
## [["Mel_2"]] 10 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
Now we have our GRangesList we can extract the peak calls for the Mel replicates.
<- macsPeaks_GRL$Mel_1
Mel_1_Peaks <- macsPeaks_GRL$Mel_2
Mel_2_Peaks length(Mel_1_Peaks)
## [1] 13777
length(Mel_2_Peaks)
## [1] 13512
We can extract peak calls unique to replicate 1 or 2 using the %over% operator.
<- Mel_1_Peaks[!Mel_1_Peaks %over% Mel_2_Peaks]
Mel_1_Unique <- Mel_2_Peaks[!Mel_2_Peaks %over% Mel_1_Peaks]
Mel_2_Unique length(Mel_1_Unique)
## [1] 4668
length(Mel_2_Unique)
## [1] 4263
export.bed(Mel_1_Unique, "Mel_1_Unique.bed")
export.bed(Mel_2_Unique, "Mel_2_Unique.bed")
Similarly we can extract peak calls common to replicate 1 or 2.
The numbers in common however differ. This is because 2 peak calls in one sample can overlap 1 peak call in the other replicate.
<- Mel_1_Peaks[Mel_1_Peaks %over% Mel_2_Peaks]
Mel_1_Common <- Mel_2_Peaks[Mel_2_Peaks %over% Mel_1_Peaks]
Mel_2_Common length(Mel_1_Common)
## [1] 9109
length(Mel_2_Common)
## [1] 9249
export.bed(Mel_1_Common, "Mel_1_Common.bed")
export.bed(Mel_2_Common, "Mel_2_Common.bed")
Despite overlapping these peaks are not identical. So how do we determine a common consensus peak several samples.
To address this problem, a common operation in ChIPseq is to define a nonredundant set of peaks across all samples.
To do this we first pool all our peaks across all replicates, here Mel and Ch12, into one set of redundant, overlapping peaks.
<- unlist(macsPeaks_GRL)
allPeaksSet_Overlapping allPeaksSet_Overlapping
## GRanges object with 69619 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## ch12_1 chr1 4490620-4490773 *
## ch12_1 chr1 4671581-4671836 *
## ch12_1 chr1 4785359-4785661 *
## ch12_1 chr1 5283235-5283439 *
## ch12_1 chr1 6262317-6262572 *
## ... ... ... ...
## Mel_2 chrX 169107487-169107684 *
## Mel_2 chrX 169299643-169299795 *
## Mel_2 chrY 897622-897763 *
## Mel_2 chrY 1245573-1245845 *
## Mel_2 chrY 1286522-1286823 *
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
We can then use the reduce() function to collapse our peaks into nonredundant, distinct peaks representing peaks present in any sample.
<- reduce(allPeaksSet_Overlapping)
allPeaksSet_nR allPeaksSet_nR
## GRanges object with 38937 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 4416806-4417001 *
## [2] chr1 4490620-4490773 *
## [3] chr1 4671488-4671836 *
## [4] chr1 4785341-4785781 *
## [5] chr1 4857628-4857783 *
## ... ... ... ...
## [38933] chrY 90741011-90741723 *
## [38934] chrY 90741934-90743197 *
## [38935] chrY 90744329-90744494 *
## [38936] chrY 90825195-90825796 *
## [38937] chrY 90828711-90828830 *
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
export.bed(allPeaksSet_nR, "allPeaksSet_nR.bed")
With our newly defined nonredundant peak set we can now identify from this set which peaks were present in both our replicates using the %over% operator and a logical expression.
<- allPeaksSet_nR[allPeaksSet_nR %over% Mel_1_Peaks & allPeaksSet_nR %over%
commonPeaks
Mel_2_Peaks] commonPeaks
## GRanges object with 8888 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 4785341-4785781 *
## [2] chr1 7397557-7398093 *
## [3] chr1 9545120-9545648 *
## [4] chr1 9943560-9943856 *
## [5] chr1 9943893-9944649 *
## ... ... ... ...
## [8884] chrX 166535216-166535529 *
## [8885] chrX 168673273-168673855 *
## [8886] chrX 169047634-169047962 *
## [8887] chrY 1245573-1245871 *
## [8888] chrY 1286522-1286823 *
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
export.bed(commonPeaks, "commonPeaks.bed")
Similarly we can identify which peaks are present only in one replicate.
<- allPeaksSet_nR[allPeaksSet_nR %over% Mel_1_Peaks & !allPeaksSet_nR %over%
mel1_Only
Mel_2_Peaks]<- allPeaksSet_nR[!allPeaksSet_nR %over% Mel_1_Peaks & allPeaksSet_nR %over%
mel2_Only
Mel_2_Peaks]length(mel1_Only)
## [1] 4445
length(mel2_Only)
## [1] 4185
export.bed(mel1_Only, "mel1_Only.bed")
export.bed(mel2_Only, "mel2_Only.bed")
When working with larger numbers of peaks we will often define a logical matrix describing in which samples our nonredundant peaks were present.
First then we use a loop to generate a logical vector for the occurence of nonredundant peaks in each sample.
<- list()
overlap for (i in 1:length(macsPeaks_GRL)) {
<- allPeaksSet_nR %over% macsPeaks_GRL[[i]]
overlap[[i]]
}1]][1:2] overlap[[
## [1] FALSE TRUE
We can now use to do.call and cbind function to column bind our list of overlaps into our matrix of peak occurrence.
<- do.call(cbind, overlap)
overlapMatrix colnames(overlapMatrix) <- names(macsPeaks_GRL)
1:2, ] overlapMatrix[
## ch12_1 ch12_2 Mel_1 Mel_2
## [1,] FALSE TRUE FALSE FALSE
## [2,] TRUE FALSE FALSE FALSE
We can add the matrix back into the metadata columns of our GRanges() of nonredundant peaks using the mcols() accessor.
Now we have our nonredundant peaks and the occurence of these peaks in every sample we can easily identify peaks unique or common to replicates and conditions/cell lines.
mcols(allPeaksSet_nR) <- overlapMatrix
1:2, ] allPeaksSet_nR[
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | ch12_1 ch12_2 Mel_1 Mel_2
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical>
## [1] chr1 4416806-4417001 * | FALSE TRUE FALSE FALSE
## [2] chr1 4490620-4490773 * | TRUE FALSE FALSE FALSE
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
The limma package is commonly used in the analysis of RNAseq and microarray data and contains many useful helpful functions.
One very useful function is the vennDiagram function which allows us to plot overlaps from a logical matrix, just like the one we created.
library(limma)
vennDiagram(mcols(allPeaksSet_nR))
The limma package’s vennCounts function allows us to retrieve the counts displayed in the Venn diagram as a data.frame.
vennCounts(mcols(allPeaksSet_nR))
## ch12_1 ch12_2 Mel_1 Mel_2 Counts
## 1 0 0 0 0 0
## 2 0 0 0 1 3361
## 3 0 0 1 0 2435
## 4 0 0 1 1 3876
## 5 0 1 0 0 13042
## 6 0 1 0 1 378
## 7 0 1 1 0 1275
## 8 0 1 1 1 1195
## 9 1 0 0 0 1446
## 10 1 0 0 1 46
## 11 1 0 1 0 24
## 12 1 0 1 1 91
## 13 1 1 0 0 6931
## 14 1 1 0 1 400
## 15 1 1 1 0 711
## 16 1 1 1 1 3726
## attr(,"class")
## [1] "VennCounts"
With our nonredundant set of peaks and our matrix of peak occurrence, we can define replicated peaks within conditions.
Here we define the peaks which occur in both the Ch12 replicates.
Since logical matrix is equivalent to a 1 or 0 matrix (1 = TRUE and 0 = FALSE), we can use the rowSums function to extract peaks in at least 2 of the Ch12 replicates.
<- allPeaksSet_nR[rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("ch12_1",
ch12_HC_Peaks "ch12_2")])) >= 2]
export.bed(ch12_HC_Peaks, "ch12_HC_Peaks.bed")
1:2, ] ch12_HC_Peaks[
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | ch12_1 ch12_2 Mel_1 Mel_2
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical>
## [1] chr1 4671488-4671836 * | TRUE TRUE FALSE FALSE
## [2] chr1 4785341-4785781 * | TRUE TRUE TRUE TRUE
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
Similarly we can define peaks which are replicated in Ch12 but absent in Mel samples.
<- allPeaksSet_nR[rowSums(as.data.frame(mcols(allPeaksSet_nR)[,
ch12_HC_UniquePeaks c("ch12_1", "ch12_2")])) >= 2 & rowSums(as.data.frame(mcols(allPeaksSet_nR)[,
c("Mel_1", "Mel_2")])) == 0]
export.bed(ch12_HC_UniquePeaks, "ch12_HC_UniquePeaks.bed")
1, ] ch12_HC_UniquePeaks[
## GRanges object with 1 range and 4 metadata columns:
## seqnames ranges strand | ch12_1 ch12_2 Mel_1 Mel_2
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical>
## [1] chr1 4671488-4671836 * | TRUE TRUE FALSE FALSE
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
Identifying peaks specific to cell lines or conditions however does not capture the full range of changes in epigenetic events.
To identify differences in epigenetic events we can attempt to quantify the changes in fragment abundance from IP samples across our nonredundant set of peaks.
We first must establish a set of regions within which to quantify IP-ed fragments.
An established technique is to produce a set of nonredundant peaks which occur in the majority of at least one experimental condition under evaluation.
Here we identify peaks which occurred in both replicates in either Mel or Ch12 cell lines.
<- allPeaksSet_nR[rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("ch12_1",
HC_Peaks "ch12_2")])) >= 2 | rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("Mel_1",
"Mel_2")])) >= 2]
HC_Peaks
## GRanges object with 16930 ranges and 4 metadata columns:
## seqnames ranges strand | ch12_1 ch12_2 Mel_1
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical>
## [1] chr1 4671488-4671836 * | TRUE TRUE FALSE
## [2] chr1 4785341-4785781 * | TRUE TRUE TRUE
## [3] chr1 5283217-5283439 * | TRUE TRUE FALSE
## [4] chr1 6262317-6262572 * | TRUE TRUE FALSE
## [5] chr1 6406474-6406849 * | TRUE TRUE FALSE
## ... ... ... ... . ... ... ...
## [16926] chrY 90739739-90740228 * | TRUE TRUE FALSE
## [16927] chrY 90740354-90740695 * | TRUE TRUE FALSE
## [16928] chrY 90741011-90741723 * | TRUE TRUE FALSE
## [16929] chrY 90741934-90743197 * | TRUE TRUE FALSE
## [16930] chrY 90825195-90825796 * | TRUE TRUE FALSE
## Mel_2
## <logical>
## [1] FALSE
## [2] TRUE
## [3] FALSE
## [4] FALSE
## [5] FALSE
## ... ...
## [16926] FALSE
## [16927] FALSE
## [16928] FALSE
## [16929] FALSE
## [16930] FALSE
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
export.bed(HC_Peaks, "HC_Peaks.bed")
We will count from our aligned BAM files to quantify IP fragments.
As we have seen previously we can use the BamFileList() function to specify which BAMs to count and importantly, to control memory we specify the number of reads to be held in memory at one time using the yield() parameter.
library(Rsamtools)
<- c("~/Projects/Results/chipseq/testRun/BAMs/Sorted_Myc_Ch12_1.bam", "~/Projects/Results/chipseq/testRun/BAMs/Sorted_Myc_Ch12_2.bam",
bams "~/Projects/Results/chipseq/testRun/BAMs/Sorted_Myc_Mel_1.bam", "~/Projects/Results/chipseq/testRun/BAMs/Sorted_Myc_Mel_2.bam")
<- BamFileList(bams, yieldSize = 5e+06)
bamFL bamFL
We can count the number of fragments overlapping our peaks using the summarizeOverlaps function. Since ChIPseq is strandless, we set the ignore.strand parameter to TRUE.
The returned object is a familiar RangedSummarizedExperiment containing our GRanges of nonredundant peaks and the counts in these regions for our BAM files.
library(GenomicAlignments)
<- summarizeOverlaps(HC_Peaks, reads = bamFL, ignore.strand = TRUE)
myMycCounts class(myMycCounts)
save(myMycCounts, file = "data/MycCounts.RData")
## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
To assess changes in ChIPseq signal across cell lines we will use the DESeq2 package.
The DESeq2 package contains a workflow for assessing local changes in fragment/read abundance between replicated conditions. This workflow includes normalization, variance estimation, outlier removal/replacement as well as significance testing suited to high throughput sequencing data (i.e. integer counts).
To make use of DESeq2 workflow we must first create a data.frame of conditions of interest with rownames set as our BAM file names.
<- data.frame(CellLine = c("Ch12", "Ch12", "Mel", "Mel"))
metaDataFrame rownames(metaDataFrame) <- colnames(myMycCounts)
metaDataFrame
## CellLine
## Sorted_Myc_Ch12_1.bam Ch12
## Sorted_Myc_Ch12_2.bam Ch12
## Sorted_Myc_Mel_1.bam Mel
## Sorted_Myc_Mel_2.bam Mel
We can use the DESeqDataSetFromMatrix() function to create a DESeq2 object.
We must provide our matrix of counts to countData parameter, our metadata data.frame to colData parameter and we include to an optional parameter of rowRanges the nonredundant peak set we can counted on.
Finally we provide the name of the column in our metadata data.frame within which we wish to test to the design parameter.
library(DESeq2)
<- DESeqDataSetFromMatrix(countData = assay(myMycCounts), colData = metaDataFrame,
deseqMyc design = ~CellLine, rowRanges = HC_Peaks)
We can now run the DESeq2 workflow on our DESeq2 object using the DESeq() function.
<- DESeq(deseqMyc) deseqMyc
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
Our DESeq2 object is updated to include useful statistics such our normalised values and variance of signal within each nonredundant peak call.
deseqMyc
## class: DESeqDataSet
## dim: 16930 4
## metadata(1): version
## assays(4): counts mu H cooks
## rownames: NULL
## rowData names(26): ch12_1 ch12_2 ... deviance maxCooks
## colnames(4): Sorted_Myc_Ch12_1.bam Sorted_Myc_Ch12_2.bam
## Sorted_Myc_Mel_1.bam Sorted_Myc_Mel_2.bam
## colData names(2): CellLine sizeFactor
We can extract our information of differential regions using the results() function.
We provide to the results() function the DESeq2 object, the comparison of interest to the contrast parameter and the type of output to return to the format parameter.
The comparison to contrast parameter is provided as a vector of length 3 including the metadata column of interest and groups to test.
We can sort the results by pvalue using the order() function to rank by the most significant changes.
<- results(deseqMyc, contrast = c("CellLine", "Mel", "Ch12"), format = "GRanges")
MelMinusCh12 <- MelMinusCh12[order(MelMinusCh12$pvalue), ]
MelMinusCh12 class(MelMinusCh12)
## [1] "GRanges"
## attr(,"package")
## [1] "GenomicRanges"
The GRanges object contains information on the comparison made in DESeq2.
Most useful it contains the the difference in IP signal as log2 fold change in log2FoldChange, the significance of the change in the pvalue column and an adjusted p-value to address multiple correction in padj column.
1, ] MelMinusCh12[
## GRanges object with 1 range and 6 metadata columns:
## seqnames ranges strand | baseMean log2FoldChange lfcSE
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## [1] chr4 45953273-45954170 * | 613.942 6.09357 0.297353
## stat pvalue padj
## <numeric> <numeric> <numeric>
## [1] 20.4927 2.50058e-93 4.23348e-89
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
We can now filter our nonredundant peaks to those with significantly more signal in Mel or Ch12 cell lines by filtering by log2FoldChange and padj (p-value adjusted for multiple correction) less than 0.05.
<- MelMinusCh12[!is.na(MelMinusCh12$pvalue) | !is.na(MelMinusCh12$padj)]
MelMinusCh12Filt <- MelMinusCh12[MelMinusCh12$padj < 0.05 & MelMinusCh12$log2FoldChange >
UpinMel 0]
<- MelMinusCh12[MelMinusCh12$padj < 0.05 & MelMinusCh12$log2FoldChange <
DowninMel 0]
export.bed(UpinMel, "UpinMel.bed")
export.bed(DowninMel, "DowninMel.bed")
Finally we can make our reviewing of sites in IGV a little easier using the tracktables package.
The tracktables package’s makebedtable() function accepts a GRanges object and writes an HTML report contains links to IGV.
An example can be found here
library(tracktables)
<- makebedtable(MelMinusCh12Filt, "MelMinusCh12.html", basedirectory = getwd())
myReport
browseURL(myReport)
Exercise on ChIPseq data can be found here
Answers can be found here