class: center, middle, inverse, title-slide # ChIPseq In Bioconductor (part4)
### Rockefeller University, Bioinformatics Resource Center ###
http://rockefelleruniversity.github.io/RU_ChIPseq/
--- ## Data In todays session we will continue to review the Myc ChIPseq we were working on in our last sessions. This include Myc ChIPseq for MEL and Ch12 celllines. Information and files for the [Myc ChIPseq in MEL cell line can be found here](https://www.encodeproject.org/experiments/ENCSR000EUA/) Information and files for the [Myc ChIPseq in Ch12 cell line can be found here](https://www.encodeproject.org/experiments/ENCSR000ERN/) <!-- --- --> <!-- # Data --> <!-- We will be working with peak calls today, so we can download the MACS2 peak calls from the Encode website. --> <!-- [Myc Mel Rep1](https://www.encodeproject.org/files/ENCFF363WUG/@@download/ENCFF363WUG.bed.gz) --> <!-- [Myc Mel Rep2](https://www.encodeproject.org/files/ENCFF139JHS/@@download/ENCFF139JHS.bed.gz) --> <!-- [Myc Ch12 Rep1](https://www.encodeproject.org/files/ENCFF160KXR/@@download/ENCFF160KXR.bed.gz) --> <!-- [Myc Ch12 Rep2](https://www.encodeproject.org/files/ENCFF962BGJ/@@download/ENCFF962BGJ.bed.gz) --> --- ## Data I have provided peak calls from MACS2 following the processing steps outlined in our last session. Peak calls for Myc in MEL and Ch12 cellines can be found in **data/peaks/** * **data/peaks/Mel_1_peaks.xls** * **data/peaks/Mel_2_peaks.xls** * **data/peaks/Ch12_1_peaks.xls** * **data/peaks/Ch12_1_peaks.xls** --- ## TF binding and epigenetic states A common goal in ChIPseq is characterise genome wide transcription factor binding sites or epigenetic states. The presence of transcription factor binding sites and epigenetics events is often further analyzed in the context of their putative targets genes to characterize the transcription factor's and epigenetic event's function and/or biological role. <div align="center"> <img src="imgs/singleMap.png" alt="offset" height="300" width="600"> </div> --- ## TF binding and epigenetic states With the release of ENCODE's wide scale mapping of transcription factor binding sites or epigenetic states and the advent of multiplexing technologies for high throughput sequencing, in has become common practice to have replicated ChIPseq experiments so as to have higher confidence in identified epigenetic events. <div align="center"> <img src="imgs/MappingEvents.png" alt="offset" height="300" width="600"> </div> --- ## TF binding and epigenetic states In addition to the genome wide characterization of epigenetic events, ChIPseq had been increasingly used to identify changes in epigenetic events between conditions and/or cell lines. <div align="center"> <img src="imgs/hoxDiff2.png" alt="offset" height="300" width="600"> </div> --- ## Myc ChIPseq in Mel and Ch12 cell lines We have been working to process and a characterize a Myc ChIPseq replicate in the Mel cell line. In this session we will look at how we can define a high confidence/reproducible set of Myc peaks in the Mel cell line as well as identify Myc binding events unique or common between Mel and Ch12 cell lines. --- class: inverse, center, middle # Consensus Peaks <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Reading in a set of peaks First we need to read our peak calls from MACS2 into R. The Myc peak calls we will review are within the peaks directory, so here we list all files matching our expected file pattern using the **dir() function.** ```r peakFiles <- dir("data/peaks/", pattern = "*.peaks", full.names = TRUE) peakFiles ``` ``` ## [1] "data/peaks//ch12_1_peaks.xls" "data/peaks//ch12_2_peaks.xls" ## [3] "data/peaks//Mel_1_peaks.xls" "data/peaks//Mel_2_peaks.xls" ``` --- ## Reading in a set of peaks We can loop through our tab separated files (disguised as *.xls* functions) and import them into R as a list of data.frames using a loop. ```r macsPeaks_DF <- list() for (i in 1:length(peakFiles)) { macsPeaks_DF[[i]] <- read.delim(peakFiles[i], comment.char = "#") } length(macsPeaks_DF) ``` ``` ## [1] 4 ``` --- ## Reading in a set of peaks Now with our list of data.frames of peak calls, we loop through the list and create a **GRanges** for each of our peak calls. Remember you can also do this with the [import function from rtracklayer](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor2.html#46). ```r library(GenomicRanges) macsPeaks_GR <- list() for (i in 1:length(macsPeaks_DF)) { peakDFtemp <- macsPeaks_DF[[i]] macsPeaks_GR[[i]] <- GRanges(seqnames = peakDFtemp[, "chr"], IRanges(peakDFtemp[, "start"], peakDFtemp[, "end"])) } macsPeaks_GR[[1]] ``` ``` ## GRanges object with 13910 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 4490620-4490773 * ## [2] chr1 4671581-4671836 * ## [3] chr1 4785359-4785661 * ## [4] chr1 5283235-5283439 * ## [5] chr1 6262317-6262572 * ## ... ... ... ... ## [13906] chrY 90742039-90742407 * ## [13907] chrY 90742425-90742646 * ## [13908] chrY 90742687-90742878 * ## [13909] chrY 90742905-90743091 * ## [13910] chrY 90825379-90825714 * ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` --- ## Reading in a set of peaks We will want to assign a sensible set of names to our peak calls. We can use the **gsub()** and **basename()** function with our file names to create some samplenames. The **basename()** function accepts a file path (such as the path to our bam files) and returns just the file name (removing directory paths). The **gsub()** function accepts the text to replace, the replacement text and a character vector to replace within. ```r fileNames <- basename(peakFiles) fileNames ``` ``` ## [1] "ch12_1_peaks.xls" "ch12_2_peaks.xls" "Mel_1_peaks.xls" "Mel_2_peaks.xls" ``` ```r sampleNames <- gsub("_peaks.xls", "", fileNames) sampleNames ``` ``` ## [1] "ch12_1" "ch12_2" "Mel_1" "Mel_2" ``` --- ## Reading in a set of peaks Now we have a named list of our peak calls as **GRanges** objects. We can convert our list of **GRanges** objects to a **GRangesList** using the **GRangesList()** function. ```r macsPeaks_GRL <- GRangesList(macsPeaks_GR) names(macsPeaks_GRL) <- sampleNames class(macsPeaks_GRL) ``` ``` ## [1] "CompressedGRangesList" ## attr(,"package") ## [1] "GenomicRanges" ``` ```r names(macsPeaks_GRL) ``` ``` ## [1] "ch12_1" "ch12_2" "Mel_1" "Mel_2" ``` --- ## GRangesList objects The GRangesList object can behave just as our standard lists. Here we use the **lengths()** function to a get the number of peaks in each replicate. ```r lengths(macsPeaks_GRL) ``` ``` ## ch12_1 ch12_2 Mel_1 Mel_2 ## 13910 28420 13777 13512 ``` --- ## GRangesList objects A major advantage of **GRangesList** objects is that we can apply many of the **GRanges** accessor and operator functions directly to our **GRangesList**. This means there is no need to lapply and convert back to **GRangesList** if we wish to alter our **GRanges** by a common method. ```r library(rtracklayer) macsPeaks_GRLCentred <- resize(macsPeaks_GRL, 10, fix = "center") width(macsPeaks_GRLCentred) ``` ``` ## IntegerList of length 4 ## [["ch12_1"]] 10 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10 ## [["ch12_2"]] 10 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10 ## [["Mel_1"]] 10 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10 ## [["Mel_2"]] 10 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10 ``` --- ## GRangesList objects Now we have our GRangesList we can extract the peak calls for the Mel replicates. ```r Mel_1_Peaks <- macsPeaks_GRL$Mel_1 Mel_2_Peaks <- macsPeaks_GRL$Mel_2 length(Mel_1_Peaks) ``` ``` ## [1] 13777 ``` ```r length(Mel_2_Peaks) ``` ``` ## [1] 13512 ``` --- ## Finding unique peaks We can extract peak calls unique to replicate 1 or 2 using the **%over%** operator. ```r Mel_1_Unique <- Mel_1_Peaks[!Mel_1_Peaks %over% Mel_2_Peaks] Mel_2_Unique <- Mel_2_Peaks[!Mel_2_Peaks %over% Mel_1_Peaks] length(Mel_1_Unique) ``` ``` ## [1] 4668 ``` ```r length(Mel_2_Unique) ``` ``` ## [1] 4263 ``` ```r export.bed(Mel_1_Unique, "Mel_1_Unique.bed") export.bed(Mel_2_Unique, "Mel_2_Unique.bed") ``` --- ## Finding unique peaks ![](imgs/uniquePeaks.png) --- ## Finding common peaks Similarly we can extract peak calls common to replicate 1 or 2. The numbers in common however differ. This is because 2 peak calls in one sample can overlap 1 peak call in the other replicate. ```r Mel_1_Common <- Mel_1_Peaks[Mel_1_Peaks %over% Mel_2_Peaks] Mel_2_Common <- Mel_2_Peaks[Mel_2_Peaks %over% Mel_1_Peaks] length(Mel_1_Common) ``` ``` ## [1] 9109 ``` ```r length(Mel_2_Common) ``` ``` ## [1] 9249 ``` ```r export.bed(Mel_1_Common, "Mel_1_Common.bed") export.bed(Mel_2_Common, "Mel_2_Common.bed") ``` --- ## Finding common peaks ![](imgs/commonP.png) --- ## Finding common peaks Despite overlapping these peaks are not identical. So how do we determine a common consensus peak several samples. ![](imgs/2for1.png) --- ## Define a consensus, redundant set To address this problem, a common operation in ChIPseq is to define a nonredundant set of peaks across all samples. To do this we first pool all our peaks across all replicates, here Mel and Ch12, into one set of redundant, overlapping peaks. ```r allPeaksSet_Overlapping <- unlist(macsPeaks_GRL) allPeaksSet_Overlapping ``` ``` ## GRanges object with 69619 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## ch12_1 chr1 4490620-4490773 * ## ch12_1 chr1 4671581-4671836 * ## ch12_1 chr1 4785359-4785661 * ## ch12_1 chr1 5283235-5283439 * ## ch12_1 chr1 6262317-6262572 * ## ... ... ... ... ## Mel_2 chrX 169107487-169107684 * ## Mel_2 chrX 169299643-169299795 * ## Mel_2 chrY 897622-897763 * ## Mel_2 chrY 1245573-1245845 * ## Mel_2 chrY 1286522-1286823 * ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` --- ## Define a consensus, nonredundant set We can then use the **reduce()** function to collapse our peaks into nonredundant, distinct peaks representing peaks present in any sample. ```r allPeaksSet_nR <- reduce(allPeaksSet_Overlapping) allPeaksSet_nR ``` ``` ## GRanges object with 38937 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 4416806-4417001 * ## [2] chr1 4490620-4490773 * ## [3] chr1 4671488-4671836 * ## [4] chr1 4785341-4785781 * ## [5] chr1 4857628-4857783 * ## ... ... ... ... ## [38933] chrY 90741011-90741723 * ## [38934] chrY 90741934-90743197 * ## [38935] chrY 90744329-90744494 * ## [38936] chrY 90825195-90825796 * ## [38937] chrY 90828711-90828830 * ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` ```r export.bed(allPeaksSet_nR, "allPeaksSet_nR.bed") ``` --- ## Define a consensus, nonredundant set ![](imgs/nrp.png) --- ## Defining a common set of peaks With our newly defined nonredundant peak set we can now identify from this set which peaks were present in both our replicates using the **%over%** operator and a logical expression. ```r commonPeaks <- allPeaksSet_nR[allPeaksSet_nR %over% Mel_1_Peaks & allPeaksSet_nR %over% Mel_2_Peaks] commonPeaks ``` ``` ## GRanges object with 8888 ranges and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr1 4785341-4785781 * ## [2] chr1 7397557-7398093 * ## [3] chr1 9545120-9545648 * ## [4] chr1 9943560-9943856 * ## [5] chr1 9943893-9944649 * ## ... ... ... ... ## [8884] chrX 166535216-166535529 * ## [8885] chrX 168673273-168673855 * ## [8886] chrX 169047634-169047962 * ## [8887] chrY 1245573-1245871 * ## [8888] chrY 1286522-1286823 * ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` ```r export.bed(commonPeaks, "commonPeaks.bed") ``` --- ## Defining a common set of peaks ![](imgs/commonPR.png) --- ## Defining a unique set of peaks Similarly we can identify which peaks are present only in one replicate. ```r mel1_Only <- allPeaksSet_nR[allPeaksSet_nR %over% Mel_1_Peaks & !allPeaksSet_nR %over% Mel_2_Peaks] mel2_Only <- allPeaksSet_nR[!allPeaksSet_nR %over% Mel_1_Peaks & allPeaksSet_nR %over% Mel_2_Peaks] length(mel1_Only) ``` ``` ## [1] 4445 ``` ```r length(mel2_Only) ``` ``` ## [1] 4185 ``` ```r export.bed(mel1_Only, "mel1_Only.bed") export.bed(mel2_Only, "mel2_Only.bed") ``` --- ## Defining a unique set of peaks ![](imgs/newMelOnly.png) --- ## Complex overlaps When working with larger numbers of peaks we will often define a logical matrix describing in which samples our nonredundant peaks were present. First then we use a loop to generate a logical vector for the occurence of nonredundant peaks in each sample. ```r overlap <- list() for (i in 1:length(macsPeaks_GRL)) { overlap[[i]] <- allPeaksSet_nR %over% macsPeaks_GRL[[i]] } overlap[[1]][1:2] ``` ``` ## [1] FALSE TRUE ``` --- ## Complex overlaps We can now use to **do.call** and **cbind** function to column bind our list of overlaps into our matrix of peak occurrence. ```r overlapMatrix <- do.call(cbind, overlap) colnames(overlapMatrix) <- names(macsPeaks_GRL) overlapMatrix[1:2, ] ``` ``` ## ch12_1 ch12_2 Mel_1 Mel_2 ## [1,] FALSE TRUE FALSE FALSE ## [2,] TRUE FALSE FALSE FALSE ``` --- ## Complex overlaps We can add the matrix back into the metadata columns of our **GRanges()** of nonredundant peaks using the **mcols()** accessor. Now we have our nonredundant peaks and the occurence of these peaks in every sample we can easily identify peaks unique or common to replicates and conditions/cell lines. ```r mcols(allPeaksSet_nR) <- overlapMatrix allPeaksSet_nR[1:2, ] ``` ``` ## GRanges object with 2 ranges and 4 metadata columns: ## seqnames ranges strand | ch12_1 ch12_2 Mel_1 Mel_2 ## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical> ## [1] chr1 4416806-4417001 * | FALSE TRUE FALSE FALSE ## [2] chr1 4490620-4490773 * | TRUE FALSE FALSE FALSE ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` --- ## Complex overlaps The **limma** package is commonly used in the analysis of RNAseq and microarray data and contains many useful helpful functions. One very useful function is the **vennDiagram** function which allows us to plot overlaps from a logical matrix, just like the one we created. ```r library(limma) vennDiagram(mcols(allPeaksSet_nR)) ``` ![](ChIPseq_In_Bioconductor4_files/figure-html/unnamed-chunk-22-1.png)<!-- --> --- ## Complex overlaps The **limma** package's **vennCounts** function allows us to retrieve the counts displayed in the Venn diagram as a data.frame. ```r vennCounts(mcols(allPeaksSet_nR)) ``` ``` ## ch12_1 ch12_2 Mel_1 Mel_2 Counts ## 1 0 0 0 0 0 ## 2 0 0 0 1 3361 ## 3 0 0 1 0 2435 ## 4 0 0 1 1 3876 ## 5 0 1 0 0 13042 ## 6 0 1 0 1 378 ## 7 0 1 1 0 1275 ## 8 0 1 1 1 1195 ## 9 1 0 0 0 1446 ## 10 1 0 0 1 46 ## 11 1 0 1 0 24 ## 12 1 0 1 1 91 ## 13 1 1 0 0 6931 ## 14 1 1 0 1 400 ## 15 1 1 1 0 711 ## 16 1 1 1 1 3726 ## attr(,"class") ## [1] "VennCounts" ``` --- ## High confidence peaks With our nonredundant set of peaks and our matrix of peak occurrence, we can define replicated peaks within conditions. Here we define the peaks which occur in both the Ch12 replicates. Since logical matrix is equivalent to a 1 or 0 matrix (1 = TRUE and 0 = FALSE), we can use the rowSums function to extract peaks in at least 2 of the Ch12 replicates. ```r ch12_HC_Peaks <- allPeaksSet_nR[rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("ch12_1", "ch12_2")])) >= 2] export.bed(ch12_HC_Peaks, "ch12_HC_Peaks.bed") ch12_HC_Peaks[1:2, ] ``` ``` ## GRanges object with 2 ranges and 4 metadata columns: ## seqnames ranges strand | ch12_1 ch12_2 Mel_1 Mel_2 ## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical> ## [1] chr1 4671488-4671836 * | TRUE TRUE FALSE FALSE ## [2] chr1 4785341-4785781 * | TRUE TRUE TRUE TRUE ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` --- ## High confidence peaks ![](imgs/ch12HC.png) --- ## High confidence unique peaks Similarly we can define peaks which are replicated in Ch12 but absent in Mel samples. ```r ch12_HC_UniquePeaks <- allPeaksSet_nR[rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("ch12_1", "ch12_2")])) >= 2 & rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("Mel_1", "Mel_2")])) == 0] export.bed(ch12_HC_UniquePeaks, "ch12_HC_UniquePeaks.bed") ch12_HC_UniquePeaks[1, ] ``` ``` ## GRanges object with 1 range and 4 metadata columns: ## seqnames ranges strand | ch12_1 ch12_2 Mel_1 Mel_2 ## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical> ## [1] chr1 4671488-4671836 * | TRUE TRUE FALSE FALSE ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` --- ## High confidence unique peaks ![](imgs/ch12U.png) --- class: inverse, center, middle # Differential Peaks <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Finding differential regions Identifying peaks specific to cell lines or conditions however does not capture the full range of changes in epigenetic events. To identify differences in epigenetic events we can attempt to quantify the changes in fragment abundance from IP samples across our nonredundant set of peaks. ![](imgs/igsss_1.png) --- ## Finding differential regions We first must establish a set of regions within which to quantify IP-ed fragments. An established technique is to produce a set of nonredundant peaks which occur in the majority of at least one experimental condition under evaluation. Here we identify peaks which occurred in both replicates in either Mel or Ch12 cell lines. ```r HC_Peaks <- allPeaksSet_nR[rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("ch12_1", "ch12_2")])) >= 2 | rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("Mel_1", "Mel_2")])) >= 2] HC_Peaks ``` ``` ## GRanges object with 16930 ranges and 4 metadata columns: ## seqnames ranges strand | ch12_1 ch12_2 Mel_1 ## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> ## [1] chr1 4671488-4671836 * | TRUE TRUE FALSE ## [2] chr1 4785341-4785781 * | TRUE TRUE TRUE ## [3] chr1 5283217-5283439 * | TRUE TRUE FALSE ## [4] chr1 6262317-6262572 * | TRUE TRUE FALSE ## [5] chr1 6406474-6406849 * | TRUE TRUE FALSE ## ... ... ... ... . ... ... ... ## [16926] chrY 90739739-90740228 * | TRUE TRUE FALSE ## [16927] chrY 90740354-90740695 * | TRUE TRUE FALSE ## [16928] chrY 90741011-90741723 * | TRUE TRUE FALSE ## [16929] chrY 90741934-90743197 * | TRUE TRUE FALSE ## [16930] chrY 90825195-90825796 * | TRUE TRUE FALSE ## Mel_2 ## <logical> ## [1] FALSE ## [2] TRUE ## [3] FALSE ## [4] FALSE ## [5] FALSE ## ... ... ## [16926] FALSE ## [16927] FALSE ## [16928] FALSE ## [16929] FALSE ## [16930] FALSE ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` ```r export.bed(HC_Peaks, "HC_Peaks.bed") ``` --- ## Finding differential regions ![](imgs/hcPeaks2.png) --- ## Counting regions We will count from our aligned BAM files to quantify IP fragments. As we have seen previously we can use the **BamFileList()** function to specify which BAMs to count and importantly, to control memory we specify the number of reads to be held in memory at one time using the **yield()** parameter. ```r library(Rsamtools) bams <- c("~/Projects/Results/chipseq/testRun/BAMs/Sorted_Myc_Ch12_1.bam", "~/Projects/Results/chipseq/testRun/BAMs/Sorted_Myc_Ch12_2.bam", "~/Projects/Results/chipseq/testRun/BAMs/Sorted_Myc_Mel_1.bam", "~/Projects/Results/chipseq/testRun/BAMs/Sorted_Myc_Mel_2.bam") bamFL <- BamFileList(bams, yieldSize = 5e+06) bamFL ``` --- ## Counting regions We can count the number of fragments overlapping our peaks using the **summarizeOverlaps** function. Since ChIPseq is strandless, we set the **ignore.strand** parameter to **TRUE**. The returned object is a familiar **RangedSummarizedExperiment** containing our GRanges of nonredundant peaks and the counts in these regions for our BAM files. ```r library(GenomicAlignments) myMycCounts <- summarizeOverlaps(HC_Peaks, reads = bamFL, ignore.strand = TRUE) class(myMycCounts) save(myMycCounts, file = "data/MycCounts.RData") ``` ``` ## [1] "RangedSummarizedExperiment" ## attr(,"package") ## [1] "SummarizedExperiment" ``` --- ## Differential regions using DESeq2 To assess changes in ChIPseq signal across cell lines we will use the **DESeq2** package. The DESeq2 package contains a workflow for assessing local changes in fragment/read abundance between replicated conditions. This workflow includes normalization, variance estimation, outlier removal/replacement as well as significance testing suited to high throughput sequencing data (i.e. integer counts). To make use of DESeq2 workflow we must first create a data.frame of conditions of interest with rownames set as our BAM file names. ```r metaDataFrame <- data.frame(CellLine = c("Ch12", "Ch12", "Mel", "Mel")) rownames(metaDataFrame) <- colnames(myMycCounts) metaDataFrame ``` ``` ## CellLine ## Sorted_Myc_Ch12_1.bam Ch12 ## Sorted_Myc_Ch12_2.bam Ch12 ## Sorted_Myc_Mel_1.bam Mel ## Sorted_Myc_Mel_2.bam Mel ``` --- ## Differential regions using DESeq2 We can use the **DESeqDataSetFromMatrix()** function to create a **DESeq2** object. We must provide our matrix of counts to **countData** parameter, our metadata data.frame to **colData** parameter and we include to an optional parameter of **rowRanges** the nonredundant peak set we can counted on. Finally we provide the name of the column in our metadata data.frame within which we wish to test to the **design** parameter. ```r library(DESeq2) deseqMyc <- DESeqDataSetFromMatrix(countData = assay(myMycCounts), colData = metaDataFrame, design = ~CellLine, rowRanges = HC_Peaks) ``` --- ## Differential regions using DESeq2 We can now run the DESeq2 workflow on our **DESeq2** object using the **DESeq()** function. ```r deseqMyc <- DESeq(deseqMyc) ``` ``` ## estimating size factors ``` ``` ## estimating dispersions ``` ``` ## gene-wise dispersion estimates ``` ``` ## mean-dispersion relationship ``` ``` ## final dispersion estimates ``` ``` ## fitting model and testing ``` --- ## Differential regions using DESeq2 Our **DESeq2** object is updated to include useful statistics such our normalised values and variance of signal within each nonredundant peak call. ```r deseqMyc ``` ``` ## class: DESeqDataSet ## dim: 16930 4 ## metadata(1): version ## assays(4): counts mu H cooks ## rownames: NULL ## rowData names(26): ch12_1 ch12_2 ... deviance maxCooks ## colnames(4): Sorted_Myc_Ch12_1.bam Sorted_Myc_Ch12_2.bam ## Sorted_Myc_Mel_1.bam Sorted_Myc_Mel_2.bam ## colData names(2): CellLine sizeFactor ``` --- ## Differential regions using DESeq2 We can extract our information of differential regions using the **results()** function. We provide to the **results()** function the **DESeq2** object, the comparison of interest to the **contrast** parameter and the type of output to return to the **format** parameter. The comparison to **contrast** parameter is provided as a vector of length 3 including the metadata column of interest and groups to test. We can sort the results by pvalue using the **order()** function to rank by the most significant changes. ```r MelMinusCh12 <- results(deseqMyc, contrast = c("CellLine", "Mel", "Ch12"), format = "GRanges") MelMinusCh12 <- MelMinusCh12[order(MelMinusCh12$pvalue), ] class(MelMinusCh12) ``` ``` ## [1] "GRanges" ## attr(,"package") ## [1] "GenomicRanges" ``` --- ## Differential regions using DESeq2 The GRanges object contains information on the comparison made in DESeq2. Most useful it contains the the difference in IP signal as log2 fold change in **log2FoldChange**, the significance of the change in the **pvalue** column and an adjusted p-value to address multiple correction in **padj** column. ```r MelMinusCh12[1, ] ``` ``` ## GRanges object with 1 range and 6 metadata columns: ## seqnames ranges strand | baseMean log2FoldChange lfcSE ## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric> ## [1] chr4 45953273-45954170 * | 613.942 6.09357 0.297353 ## stat pvalue padj ## <numeric> <numeric> <numeric> ## [1] 20.4927 2.50058e-93 4.23348e-89 ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` --- ## Differential regions using DESeq2 We can now filter our nonredundant peaks to those with significantly more signal in Mel or Ch12 cell lines by filtering by log2FoldChange and padj (p-value adjusted for multiple correction) less than 0.05. ```r MelMinusCh12Filt <- MelMinusCh12[!is.na(MelMinusCh12$pvalue) | !is.na(MelMinusCh12$padj)] UpinMel <- MelMinusCh12[MelMinusCh12$padj < 0.05 & MelMinusCh12$log2FoldChange > 0] DowninMel <- MelMinusCh12[MelMinusCh12$padj < 0.05 & MelMinusCh12$log2FoldChange < 0] export.bed(UpinMel, "UpinMel.bed") export.bed(DowninMel, "DowninMel.bed") ``` --- ## Differential regions using DESeq2 ![](imgs/DB.png) --- ## Differential regions using DESeq2 Finally we can make our reviewing of sites in IGV a little easier using the **tracktables** package. The **tracktables** package's **makebedtable()** function accepts a **GRanges** object and writes an HTML report contains links to IGV. An example can be found [here](../../data/MelMinusCh12.html) ```r library(tracktables) myReport <- makebedtable(MelMinusCh12Filt, "MelMinusCh12.html", basedirectory = getwd()) browseURL(myReport) ``` --- ## Time for an exercise! Exercise on ChIPseq data can be found [here](../../exercises/exercises/chipseq_part4_exercise.html) --- ## Answers to exercise Answers can be found [here](../../exercises/answers/chipseq_part4_answers.html) <!-- --- --> <!-- # Defining a consensus, nonredundant set. --> <!-- ```{r,eval=T,echo=T, eval=T, echo=T, warning=FALSE} --> <!-- allPeaksSet_nR <- reduce(allPeaksSet_Overlapping,with.revmap=TRUE) --> <!-- allPeaksSet_nR --> <!-- ``` -->