A common goal in Cut&Run or ATACseq is to characterize genome wide transcription factor binding sites or epigenetic states.
The presence of transcription factor binding sites and epigenetics events is often further analyzed in the context of their putative targets genes to characterize the transcription factor’s and epigenetic event’s function and/or biological role.
Combined with replicates we can then statistically test for and identify changes in epigenetic events between conditions and/or cell lines.
We have been working to process and a characterize developmental changes in the context of the TF Sox9 using data from the Fuchs lab: The pioneer factor SOX9 competes for epigenetic factors to switch stem cell fates
In this session we will look at how we can define a high confidence/reproducible set of Sox9 peaks in the at Day 0 and Week 6 of development as well as identify unique or common peaks between the time points.
First we need to read our peak calls from MACS2 into R. This procedure would also work with other peak callers like SEACR and it is also the same process for ATAC.
The Sox9 peak calls we will review are within the data directory under peaks, so here we list all files matching our expected file pattern using the dir() function.
## [1] "data/peaks/SOX9CNR_D0_rep1_macs_peaks.narrowPeak"
## [2] "data/peaks/SOX9CNR_D0_rep2_macs_peaks.narrowPeak"
## [3] "data/peaks/SOX9CNR_W6_rep1_macs_peaks.narrowPeak"
## [4] "data/peaks/SOX9CNR_W6_rep2_macs_peaks.narrowPeak"
Our MACS peaks are narrowPeak. These are similar to a bed file as they contain information about chromosome and position of the peaks. There is also additional columns containing peak IDs, scores and statistical p/qvalues.
We can import these files (and other bioinformatic formats like bed files), using the rtracklayer package.
## [1] 4
rtracklayer imports peaks into a GRanges object. This is a specialist Bioconductor format designed to store any kind of positional data.
## GRanges object with 30971 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 4815778-4815937 * | SOX9CNR_D0_rep1_macs.. 18
## [2] chr1 4857438-4857727 * | SOX9CNR_D0_rep1_macs.. 77
## [3] chr1 4966150-4966252 * | SOX9CNR_D0_rep1_macs.. 17
## [4] chr1 5015413-5015683 * | SOX9CNR_D0_rep1_macs.. 480
## [5] chr1 5070386-5070533 * | SOX9CNR_D0_rep1_macs.. 120
## ... ... ... ... . ... ...
## [30967] chrY 10320514-10320669 * | SOX9CNR_D0_rep1_macs.. 31
## [30968] chrY 10329515-10329618 * | SOX9CNR_D0_rep1_macs.. 16
## [30969] chrY 10356019-10356123 * | SOX9CNR_D0_rep1_macs.. 36
## [30970] chrY 10366391-10366510 * | SOX9CNR_D0_rep1_macs.. 24
## [30971] chrY 41074909-41075022 * | SOX9CNR_D0_rep1_macs.. 41
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <integer>
## [1] 3.68186 4.43871 1.80034 69
## [2] 7.07188 11.12770 7.71454 174
## [3] 3.65833 4.38951 1.76503 81
## [4] 18.85640 52.67520 48.01540 112
## [5] 9.57073 15.81240 12.09390 71
## ... ... ... ... ...
## [30967] 4.72891 6.11399 3.16008 118
## [30968] 3.56717 4.20449 1.63028 1
## [30969] 4.99404 6.62755 3.61563 69
## [30970] 4.25411 5.33028 2.49480 89
## [30971] 5.26280 7.19715 4.12086 65
## -------
## seqinfo: 37 sequences from an unspecified genome; no seqlengths
We can convert our list of GRanges objects to a GRangesList using the GRangesList() function.
macsPeaks_GRL <- GRangesList(macsPeaks_list)
names(macsPeaks_GRL) <- c("D0_rep1", "D0_rep2", "W6_rep1", "W6_rep2")
class(macsPeaks_GRL)
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"
## [1] "D0_rep1" "D0_rep2" "W6_rep1" "W6_rep2"
The GRangesList object can behave just as our standard lists. Here we use the lengths() function to a get the number of peaks in each replicate.
## D0_rep1 D0_rep2 W6_rep1 W6_rep2
## 30971 7089 72174 46760
A major advantage of GRangesList objects is that we can apply many of the GRanges accessor and operator functions directly to our GRangesList.
This means there is no need to lapply and convert back to GRangesList if we wish to alter our GRanges by a common method.
## IntegerList of length 4
## [["D0_rep1"]] 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
## [["D0_rep2"]] 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
## [["W6_rep1"]] 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
## [["W6_rep2"]] 10 10 10 10 10 10 10 10 10 10 ... 10 10 10 10 10 10 10 10 10 10
We can also extract the peak calls for specific groups easily i.e. D0.
## [1] 30971
## [1] 7089
We can extract peak calls unique to replicate 1 or 2 using the %over% operator. Once isolated it is easy to export these peaks to a bed file which we can visualize in IGV.
D0_rep1_unique.Peaks <- D0_rep1_Peaks[!D0_rep1_Peaks %over% D0_rep2_Peaks]
D0_rep2_unique.Peaks <- D0_rep2_Peaks[!D0_rep2_Peaks %over% D0_rep1_Peaks]
length(D0_rep1_unique.Peaks)
## [1] 25097
## [1] 1210
Similarly we can extract peak calls common to replicate 1 or 2.
The numbers in common however differ. This is because 2 peak calls in one sample can overlap 1 peak call in the other replicate.
D0_rep1_common.Peaks <- D0_rep1_Peaks[D0_rep1_Peaks %over% D0_rep2_Peaks]
D0_rep2_common.Peaks <- D0_rep2_Peaks[D0_rep2_Peaks %over% D0_rep1_Peaks]
length(D0_rep1_common.Peaks)
## [1] 5874
## [1] 5879
Despite overlapping these peaks are not identical. To determine
quantitative differences we need to determine a common consensus peak
across several samples.
To address this problem, a common operation in peak set analysis is to define a non-redundant set of peaks across all samples.
To do this we first pool all our peaks across all replicates, here D0 and W6, into one set of redundant overlapping peaks.
## GRanges object with 156994 ranges and 6 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## D0_rep1 chr1 4815778-4815937 * | SOX9CNR_D0_rep1_macs.. 18
## D0_rep1 chr1 4857438-4857727 * | SOX9CNR_D0_rep1_macs.. 77
## D0_rep1 chr1 4966150-4966252 * | SOX9CNR_D0_rep1_macs.. 17
## D0_rep1 chr1 5015413-5015683 * | SOX9CNR_D0_rep1_macs.. 480
## D0_rep1 chr1 5070386-5070533 * | SOX9CNR_D0_rep1_macs.. 120
## ... ... ... ... . ... ...
## W6_rep2 chrY 6425000-6425144 * | SOX9CNR_W6_rep2_macs.. 40
## W6_rep2 chrY 6858756-6858925 * | SOX9CNR_W6_rep2_macs.. 24
## W6_rep2 chrY 10010276-10010370 * | SOX9CNR_W6_rep2_macs.. 24
## W6_rep2 chrY 10347416-10347605 * | SOX9CNR_W6_rep2_macs.. 24
## W6_rep2 chrY 10404405-10404545 * | SOX9CNR_W6_rep2_macs.. 24
## signalValue pValue qValue peak
## <numeric> <numeric> <numeric> <integer>
## D0_rep1 3.68186 4.43871 1.80034 69
## D0_rep1 7.07188 11.12770 7.71454 174
## D0_rep1 3.65833 4.38951 1.76503 81
## D0_rep1 18.85640 52.67520 48.01540 112
## D0_rep1 9.57073 15.81240 12.09390 71
## ... ... ... ... ...
## W6_rep2 3.88371 7.48547 4.09522 68
## W6_rep2 2.91278 5.35904 2.41824 85
## W6_rep2 2.91278 5.35904 2.41824 47
## W6_rep2 2.91278 5.35904 2.41824 95
## W6_rep2 2.91278 5.35904 2.41824 70
## -------
## seqinfo: 38 sequences from an unspecified genome; no seqlengths
We can then use the reduce() function to collapse our peaks into nonredundant, distinct peaks representing peaks present in any sample.
## GRanges object with 105162 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 3037143-3037265 *
## [2] chr1 3063798-3063963 *
## [3] chr1 3271964-3272108 *
## [4] chr1 3313708-3313837 *
## [5] chr1 3362070-3362199 *
## ... ... ... ...
## [105158] chrY 10366391-10366510 *
## [105159] chrY 10404405-10404545 *
## [105160] chrY 10422709-10422823 *
## [105161] chrY 41074909-41075022 *
## [105162] chrUn_GL456360 18880-18987 *
## -------
## seqinfo: 38 sequences from an unspecified genome; no seqlengths
With our newly defined nonredundant peak set we can now identify from this set which peaks were present in both our replicates using the %over% operator and a logical expression.
commonPeaks <- allPeaksSet_nR[allPeaksSet_nR %over% D0_rep1_Peaks & allPeaksSet_nR %over%
D0_rep2_Peaks]
commonPeaks
## GRanges object with 5858 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr1 5015332-5015707 *
## [2] chr1 6188409-6188778 *
## [3] chr1 7125893-7126460 *
## [4] chr1 7397680-7398254 *
## [5] chr1 13258061-13258276 *
## ... ... ... ...
## [5854] chrY 6528806-6528939 *
## [5855] chrY 6679044-6679269 *
## [5856] chrY 6759314-6759439 *
## [5857] chrY 9488315-9488474 *
## [5858] chrY 9506858-9507069 *
## -------
## seqinfo: 38 sequences from an unspecified genome; no seqlengths
When working with larger numbers of peaks we will often define a logical matrix describing in which samples our nonredundant peaks were present.
First then we use a loop to generate a logical vector for the occurrence of nonredundant peaks in each sample.
overlap <- list()
for (i in 1:length(macsPeaks_GRL)) {
overlap[[i]] <- allPeaksSet_nR %over% macsPeaks_GRL[[i]]
}
overlap[[1]][1:2]
## [1] FALSE FALSE
We can now use to do.call and cbind function to column bind our list of overlaps into our matrix of peak occurrence.
overlapMatrix <- do.call(cbind, overlap)
colnames(overlapMatrix) <- names(macsPeaks_GRL)
overlapMatrix[1:3, ]
## D0_rep1 D0_rep2 W6_rep1 W6_rep2
## [1,] FALSE FALSE FALSE TRUE
## [2,] FALSE FALSE FALSE TRUE
## [3,] FALSE FALSE FALSE TRUE
We can add the matrix back into the metadata columns of our GRanges() of nonredundant peaks using the mcols() accessor.
Now we have our nonredundant peaks and the occurence of these peaks in every sample we can easily identify peaks unique or common to replicates and conditions/cell lines.
## GRanges object with 3 ranges and 4 metadata columns:
## seqnames ranges strand | D0_rep1 D0_rep2 W6_rep1 W6_rep2
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical>
## [1] chr1 3037143-3037265 * | FALSE FALSE FALSE TRUE
## [2] chr1 3063798-3063963 * | FALSE FALSE FALSE TRUE
## [3] chr1 3271964-3272108 * | FALSE FALSE FALSE TRUE
## -------
## seqinfo: 38 sequences from an unspecified genome; no seqlengths
The limma package is commonly used in the analysis of RNAseq and microarray data and contains many useful helpful functions.
One very useful function is the vennDiagram function which allows us to plot overlaps from a logical matrix, just like the one we created.
The limma package’s vennCounts function allows us to retrieve the counts displayed in the Venn diagram as a data.frame.
## D0_rep1 D0_rep2 W6_rep1 W6_rep2 Counts
## 1 0 0 0 0 0
## 2 0 0 0 1 22574
## 3 0 0 1 0 39110
## 4 0 0 1 1 11885
## 5 0 1 0 0 959
## 6 0 1 0 1 13
## 7 0 1 1 0 115
## 8 0 1 1 1 72
## 9 1 0 0 0 8943
## 10 1 0 0 1 304
## 11 1 0 1 0 8229
## 12 1 0 1 1 7100
## 13 1 1 0 0 370
## 14 1 1 0 1 32
## 15 1 1 1 0 1755
## 16 1 1 1 1 3701
## attr(,"class")
## [1] "VennCounts"
With our nonredundant set of peaks and our matrix of peak occurrence, we can define replicated peaks within conditions.
Here we define the peaks which occur in both the D0 replicates.
Since logical matrix is equivalent to a 1 or 0 matrix (1 = TRUE and 0 = FALSE), we can use the rowSums function to extract peaks in at least 2 of the D0 replicates.
D0_HC_Peaks <- allPeaksSet_nR[rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("D0_rep1",
"D0_rep2")])) >= 2]
tail(D0_HC_Peaks)
## GRanges object with 6 ranges and 4 metadata columns:
## seqnames ranges strand | D0_rep1 D0_rep2 W6_rep1 W6_rep2
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical>
## [1] chrY 6231157-6231391 * | TRUE TRUE FALSE FALSE
## [2] chrY 6528806-6528939 * | TRUE TRUE FALSE FALSE
## [3] chrY 6679044-6679269 * | TRUE TRUE FALSE FALSE
## [4] chrY 6759314-6759439 * | TRUE TRUE FALSE FALSE
## [5] chrY 9488315-9488474 * | TRUE TRUE FALSE FALSE
## [6] chrY 9506858-9507069 * | TRUE TRUE TRUE FALSE
## -------
## seqinfo: 38 sequences from an unspecified genome; no seqlengths
Similarly we can define peaks which are replicated in one group, but not the other i.e. D0 but absent in W6 samples.
D0_HC_UniquePeaks <- allPeaksSet_nR[rowSums(as.data.frame(mcols(allPeaksSet_nR)[,
c("D0_rep1", "D0_rep2")])) >= 2 & rowSums(as.data.frame(mcols(allPeaksSet_nR)[,
c("W6_rep1", "W6_rep2")])) == 0]
export.bed(D0_HC_UniquePeaks, "D0_HC_UniquePeaks.bed")
D0_HC_UniquePeaks[1:5, ]
## GRanges object with 5 ranges and 4 metadata columns:
## seqnames ranges strand | D0_rep1 D0_rep2 W6_rep1
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical>
## [1] chr1 61539962-61540358 * | TRUE TRUE FALSE
## [2] chr1 61565337-61565727 * | TRUE TRUE FALSE
## [3] chr1 78532594-78532833 * | TRUE TRUE FALSE
## [4] chr1 106862211-106862379 * | TRUE TRUE FALSE
## [5] chr1 120120999-120121173 * | TRUE TRUE FALSE
## W6_rep2
## <logical>
## [1] FALSE
## [2] FALSE
## [3] FALSE
## [4] FALSE
## [5] FALSE
## -------
## seqinfo: 38 sequences from an unspecified genome; no seqlengths
Identifying peaks specific to cell lines or conditions does not capture the full range of changes in epigenetic events. Though it is a useful descriptor it is reliant completely on the discrete classification of peak callers.
To identify quantitative differences in epigenetic events we can compare the changes in fragment abundance from IP samples across our nonredundant set of peaks.
We first must establish a set of regions within which to quantify fragments. We have already spent time looking at how to extract non-redundant or high confidence peaks.
An established technique is to produce a set of non-redundant peaks which occur in the majority of at least one experimental condition under evaluation i.e. high confident peaks from each group. If you only have duplicates it must be present in both.
Here we identify peaks which occurred in both replicates in either D0 or W6 developmental time points.
HC_Peaks <- allPeaksSet_nR[rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("D0_rep1",
"D0_rep2")])) >= 2 | rowSums(as.data.frame(mcols(allPeaksSet_nR)[, c("W6_rep1",
"W6_rep2")])) >= 2]
export.bed(HC_Peaks, "HC_Peaks.bed")
HC_Peaks
## GRanges object with 24915 ranges and 4 metadata columns:
## seqnames ranges strand | D0_rep1 D0_rep2 W6_rep1
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical>
## [1] chr1 4807146-4807341 * | FALSE FALSE TRUE
## [2] chr1 4807593-4807733 * | FALSE FALSE TRUE
## [3] chr1 4857438-4857809 * | TRUE FALSE TRUE
## [4] chr1 5007395-5007681 * | FALSE FALSE TRUE
## [5] chr1 5015332-5015707 * | TRUE TRUE TRUE
## ... ... ... ... . ... ... ...
## [24911] chrY 6528806-6528939 * | TRUE TRUE FALSE
## [24912] chrY 6679044-6679269 * | TRUE TRUE FALSE
## [24913] chrY 6759314-6759439 * | TRUE TRUE FALSE
## [24914] chrY 9488315-9488474 * | TRUE TRUE FALSE
## [24915] chrY 9506858-9507069 * | TRUE TRUE TRUE
## W6_rep2
## <logical>
## [1] TRUE
## [2] TRUE
## [3] TRUE
## [4] TRUE
## [5] TRUE
## ... ...
## [24911] FALSE
## [24912] FALSE
## [24913] FALSE
## [24914] FALSE
## [24915] FALSE
## -------
## seqinfo: 38 sequences from an unspecified genome; no seqlengths
Now we know the regions of the genome that are relevant i.e our HC peaks, we can then work on counting how many fragments overlap these regions.
To do this we will need to use our BAM files again. As mentioned before these files are big and counting can take from a few minutes to many hours depending on the number samples, peaks and read depth.
The next few slides will show you how to run counting, but we will also provide you with the count matrix after so you do not have to run the counting yourself for this training.
We will count from our aligned BAM files to quantify IP fragments.
We can use the BamFileList() function to specify which BAMs to count and importantly, to control memory we specify the number of reads to be held in memory at one time using the yield() parameter.
The BAM files are on DropBox if you want to try counting yourself, but it will be time consuming.
We can count the number of fragments overlapping our peaks using the summarizeOverlaps function. Since Cut&Run and/or ATAC is strandless, we set the ignore.strand parameter to TRUE.
The returned object is a RangedSummarizedExperiment containing our GRanges of HC peaks and the counts in these regions for our BAM files.
The count object has been saved and you can load it from the data directory.
## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
## class: RangedSummarizedExperiment
## dim: 24915 4
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(2): name score
## colnames(4): SOX9CNR_D0_rep1_sorted.bam SOX9CNR_D0_rep2_sorted.bam
## SOX9CNR_W6_rep1_sorted.bam SOX9CNR_W6_rep2_sorted.bam
## colData names(0):
To assess changes in signal bertween different conditions we will use the DESeq2 package.
The DESeq2 package contains a workflow for assessing local changes in fragment/read abundance between replicated conditions. This workflow includes normalization, variance estimation, outlier removal/replacement as well as significance testing suited to high throughput sequencing data (i.e. integer counts).
To make use of DESeq2 workflow we must first create a data.frame of conditions of interest with rownames set as our BAM file names.
metaDataFrame <- data.frame(DevStage = c("D0", "D0", "W6", "W6"))
rownames(metaDataFrame) <- colnames(MyCounts)
metaDataFrame
## DevStage
## SOX9CNR_D0_rep1_sorted.bam D0
## SOX9CNR_D0_rep2_sorted.bam D0
## SOX9CNR_W6_rep1_sorted.bam W6
## SOX9CNR_W6_rep2_sorted.bam W6
We can use the DESeqDataSetFromMatrix() function to create a DESeq2 object.
We must provide our matrix of counts to countData parameter, our metadata data.frame to colData parameter and we include to an optional parameter of rowRanges the nonredundant peak set we can counted on.
Finally we provide the name of the column in our metadata data.frame within which we wish to test to the design parameter.
We can now run the DESeq2 workflow on our DESeq2 object using the DESeq() function.
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
Our DESeq2 object is updated to include useful statistics such our normalized values and variance of signal within each HC peak call.
## class: DESeqDataSet
## dim: 24915 4
## metadata(1): version
## assays(4): counts mu H cooks
## rownames: NULL
## rowData names(26): D0_rep1 D0_rep2 ... deviance maxCooks
## colnames(4): SOX9CNR_D0_rep1_sorted.bam SOX9CNR_D0_rep2_sorted.bam
## SOX9CNR_W6_rep1_sorted.bam SOX9CNR_W6_rep2_sorted.bam
## colData names(2): DevStage sizeFactor
We can extract our information of differential regions using the results() function.
We provide to the results() function the DESeq2 object, the comparison of interest to the contrast parameter and the type of output to return to the format parameter.
The comparison to contrast parameter is provided as a vector of length 3 including the metadata column of interest and groups to test.
We can sort the results by pvalue using the order() function to rank by the most significant changes.
W6MinusD0 <- results(dds, contrast = c("DevStage", "W6", "D0"), format = "GRanges")
W6MinusD0 <- W6MinusD0[order(W6MinusD0$pvalue), ]
W6MinusD0
## GRanges object with 24915 ranges and 6 metadata columns:
## seqnames ranges strand | baseMean log2FoldChange
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr12 24581325-24586154 * | 9192.998 -8.89109
## [2] chr12 24575912-24578078 * | 2788.408 -9.48731
## [3] chr13 23573848-23574171 * | 1399.161 -7.28962
## [4] chr4 135549991-135552699 * | 439.303 -6.26537
## [5] chrUn_GL456383 22710-26551 * | 2422.719 -5.46235
## ... ... ... ... . ... ...
## [24911] chr2 18214116-18214392 * | 61.0846 4.23216e-04
## [24912] chr17 26601189-26601589 * | 59.1127 -5.37722e-04
## [24913] chr5 149298316-149298577 * | 66.4519 2.83050e-04
## [24914] chr7 76790080-76790396 * | 52.8866 -9.19631e-05
## [24915] chr17 11431472-11431585 * | 32.0941 -3.85222e-05
## lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric>
## [1] 0.325300 -27.3320 1.77006e-164 4.41011e-160
## [2] 0.438005 -21.6603 4.86441e-104 6.05984e-100
## [3] 0.389123 -18.7335 2.64187e-78 2.19408e-74
## [4] 0.446038 -14.0467 8.06944e-45 5.02625e-41
## [5] 0.399348 -13.6782 1.37108e-42 6.83207e-39
## ... ... ... ... ...
## [24911] 0.836693 5.05820e-04 0.999596 0.999757
## [24912] 1.194353 -4.50220e-04 0.999641 0.999761
## [24913] 0.808439 3.50120e-04 0.999721 0.999801
## [24914] 0.799271 -1.15059e-04 0.999908 0.999948
## [24915] 1.019378 -3.77899e-05 0.999970 0.999970
## -------
## seqinfo: 38 sequences from an unspecified genome; no seqlengths
The GRanges object contains information on the comparison made in DESeq2.
Most useful it contains the the difference in signal as log2 fold change in log2FoldChange, the significance of the change in the pvalue column and an adjusted p-value to address multiple correction in padj column.
## GRanges object with 24915 ranges and 6 metadata columns:
## seqnames ranges strand | baseMean log2FoldChange
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr12 24581325-24586154 * | 9192.998 -8.89109
## [2] chr12 24575912-24578078 * | 2788.408 -9.48731
## [3] chr13 23573848-23574171 * | 1399.161 -7.28962
## [4] chr4 135549991-135552699 * | 439.303 -6.26537
## [5] chrUn_GL456383 22710-26551 * | 2422.719 -5.46235
## ... ... ... ... . ... ...
## [24911] chr2 18214116-18214392 * | 61.0846 4.23216e-04
## [24912] chr17 26601189-26601589 * | 59.1127 -5.37722e-04
## [24913] chr5 149298316-149298577 * | 66.4519 2.83050e-04
## [24914] chr7 76790080-76790396 * | 52.8866 -9.19631e-05
## [24915] chr17 11431472-11431585 * | 32.0941 -3.85222e-05
## lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric>
## [1] 0.325300 -27.3320 1.77006e-164 4.41011e-160
## [2] 0.438005 -21.6603 4.86441e-104 6.05984e-100
## [3] 0.389123 -18.7335 2.64187e-78 2.19408e-74
## [4] 0.446038 -14.0467 8.06944e-45 5.02625e-41
## [5] 0.399348 -13.6782 1.37108e-42 6.83207e-39
## ... ... ... ... ...
## [24911] 0.836693 5.05820e-04 0.999596 0.999757
## [24912] 1.194353 -4.50220e-04 0.999641 0.999761
## [24913] 0.808439 3.50120e-04 0.999721 0.999801
## [24914] 0.799271 -1.15059e-04 0.999908 0.999948
## [24915] 1.019378 -3.77899e-05 0.999970 0.999970
## -------
## seqinfo: 38 sequences from an unspecified genome; no seqlengths
We can export our results to an excel spreadsheet using rio so you can easily share and review results.
We can now filter our HC peaks to those with significantly more signal in D0 or W6 timepoints by filtering by log2FoldChange and padj (p-value adjusted for multiple correction) less than 0.05.
Finally we can make our reviewing of sites in IGV a little easier using the tracktables package.
The tracktables package’s makebedtable() function accepts a GRanges object and writes an HTML report contains links to IGV.
An example can be found here
To assess global levels of changes between samples PCA can be a very useful plot. This becomes very important when you have many groups and many samples.
To do this we want the normalized count values. We can get this from our dds object using the rlog() function.
## using ntop=500 top features by variance
Volcano Plots can also be a nice way to show the amount of changes and reveal any global biases.
library(EnhancedVolcano)
EnhancedVolcano(as.data.frame(W6MinusD0.Filt), lab = paste0(seqnames(W6MinusD0.Filt),
ranges(W6MinusD0.Filt)), x = "log2FoldChange", y = "padj", selectLab = "")
Exercise on functions can be found here
Answers can be found here here