##The Course
##Reminder of file types
In this session we will be dealing with a range of data types. For more information on file types you can revisit our material.
For more information on visualizing genomics data in browsers you can visit our IGV course.
##Reminder of data types in Bioconductor
We will also encounter and make use of many data structures and data types which we have seen throughout our courses on HTS data. You can revisit this material to refresh on HTS data analysis in Bioconductor and R below.
##Materials
All material for this course can be found on github. * Visualizing Genomics Data
Or can be downloaded as a zip archive from here.
* Download zip
##Materials
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
##Materials
All data to run code in the presentations and in the practicals is available in the zip archive. This includes coverage as bigWig files, aligned reads as BAM files and genomic intervals stored as BED files.
We also include some RData files containing precompiled results from querying database (in case of external server downtime).
All data can be found under the data directory
data/
##Set the Working directory
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived visualizingGenomicsData folder in the Rstudio menu
Session -> Set Working Directory -> Choose Directory
or in the console.
setwd("/PathToMyDownload/VisualizingGenomicsData/r_course/")
# e.g. setwd("~/Downloads/VisualizingGenomicsData/r_course")
Previously we have reviewed how we can review signal and annotation over individual sites using and genome browser such as IGV and or programmatically using Gviz.
]
]
With many high throughput sequencing experiments we are able to gain a genome wide view of our assays (in RNA-seq we measure every gene’s expression, in ChIPseq we identify all events for a user-defined transcription factor/mark, in ATAC-seq we evaluate all open/accessible regions in the genome.)
Dimension reduction and clustering are very useful tools for simplifying the interpretation of this highly complex and dimensional biological data.
As with RNAseq, epigenomics data can be displayed once reduced in dimension, such as with meta gene and profile plots, as well as with heatmaps.
In this session will review a few approaches to visualize epigenetics data.
There are many alternate packages for visualizing epigenetic data in Bioconductor (soGGi,heatmaps and enrichmentmap packages) and in R (ngsPlot).
Useful software outside R include Homer and Deeptools.
In our ChIPseq session, we identified differential occupancy of Myc transcription factor between Mel and Ch12 cell lines. Here we will assess CTCF signal difference between Mel and Ch12.
First lets read in some CTCF peaks into a GRangesList
peakFiles <- dir("data/CTCFpeaks/",pattern="*.peaks",
full.names = TRUE)
macsPeaks_GR <- list()
for(i in 1:length(peakFiles)){
macsPeaks_DF <- read.delim(peakFiles[i],
comment.char="#")
macsPeaks_GR[[i]] <- GRanges(
seqnames=macsPeaks_DF[,"chr"],
IRanges(macsPeaks_DF[,"start"],
macsPeaks_DF[,"end"]
)
)
}
macsPeaks_GRL <- GRangesList(macsPeaks_GR)
names(macsPeaks_GRL) <- c("Ch12_1","Ch12_2","Mel_1","Mel_2")
Now we can produce our non-redundant set of peaks and score each of the peaks by their occurrence in our Mel and Ch12 samples.
allPeaksSet_nR <- reduce(unlist(macsPeaks_GRL))
overlap <- list()
for(i in 1:length(macsPeaks_GRL)){
overlap[[i]] <- allPeaksSet_nR %over% macsPeaks_GRL[[i]]
}
overlapMatrix <- do.call(cbind,overlap)
colnames(overlapMatrix) <- names(macsPeaks_GRL)
mcols(allPeaksSet_nR) <- overlapMatrix
allPeaksSet_nR[1,]
## GRanges object with 1 range and 4 metadata columns:
## seqnames ranges strand | Ch12_1 Ch12_2 Mel_1 Mel_2
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical>
## [1] chr1 3012656-3012773 * | TRUE TRUE FALSE FALSE
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
With the overlap matrix we can define our set of high-confidence peaks
HC_Peaks <- allPeaksSet_nR[
rowSums(as.data.frame(
mcols(allPeaksSet_nR)[,c("Ch12_1","Ch12_2")])) >= 2 |
rowSums(as.data.frame(
mcols(allPeaksSet_nR)[,c("Mel_1","Mel_2")])) >= 2
]
HC_Peaks[1:2,]
## GRanges object with 2 ranges and 4 metadata columns:
## seqnames ranges strand | Ch12_1 Ch12_2 Mel_1 Mel_2
## <Rle> <IRanges> <Rle> | <logical> <logical> <logical> <logical>
## [1] chr1 3012656-3012773 * | TRUE TRUE FALSE FALSE
## [2] chr1 3448274-3448415 * | FALSE FALSE TRUE TRUE
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
We can count the number of reads overlapping our peaks using the summarizeOverlaps function and a BamFileList of our BAMs to count.
library(Rsamtools)
library(GenomicAlignments)
bams <- c("~/Projects/Results/chipseq/testRun/BAMs/Sorted_CTCF_Ch12_1.bam",
"~/Projects/Results/chipseq/testRun/BAMs/Sorted_CTCF_Ch12_2.bam",
"~/Projects/Results/chipseq/testRun/BAMs/Sorted_CTCF_Mel_1.bam",
"~/Projects/Results/chipseq/testRun/BAMs/Sorted_CTCF_Mel_2.bam")
bamFL <- BamFileList(bams,yieldSize = 5000000)
myCTCFCounts <- summarizeOverlaps(HC_Peaks,
reads = bamFL,
ignore.strand = TRUE)
colData(myCTCFCounts)$Group <- c("Ch12","Ch12","Mel","Mel")
Now we can use the DESeq2 package to compare our 2 cell-lines.
We can also use DESeq2 to give us some normalized, transformed data for visualization using rlog.
With our newly created rlog transformed CTCF signal data we can create similar visualization as we have for RNAseq.
First we can create our measures of sample distances.
sampleDists <- as.dist(1-cor(assay(CTCFrlog)))
sampleDistMatrix <- as.matrix(sampleDists)
pheatmap(sampleDistMatrix,
clustering_distance_rows = sampleDists,
clustering_distance_cols = sampleDists)
We can also review our dimension reduction by PCA to assess differences within and between groups with the plotPCA. Here we use by default the top 500 most variable sites.
We can rerun using all sites to see some additional variance in PC2.
The plotPCA can return a data.frame of the PC scores for samples used in plot.
## PC1 PC2 group Group name
## CTCF_Ch12_1 -95.59844 17.12250 CTCF_Ch12 CTCF_Ch12 CTCF_Ch12_1
## CTCF_Ch12_2 -91.08241 -17.63094 CTCF_Ch12 CTCF_Ch12 CTCF_Ch12_2
## CTCF_MEL_1 94.06208 -11.13739 CTCF_Mel CTCF_Mel CTCF_MEL_1
## CTCF_MEL_2 92.61876 11.64583 CTCF_Mel CTCF_Mel CTCF_MEL_2
So we can use this data.frame and any additional information to produce a customized version of this plot. Here we add fragment length information.
myData$FragmentLength <- c(129,136,133,125)
library(ggplot2)
ggplot(myData,aes(x=PC1,y=PC2,
colour=Group,size=FragmentLength))+
geom_point()+
scale_size_continuous(range=c(2,10))
We can extract information on the influence of sites to the separation of samples along PC1 as we have with RNAseq.
pcRes <- prcomp(t(assay(CTCFrlog)))
RankedPC1 <- rownames(pcRes$rotation)[order(pcRes$rotation[,1],
decreasing=T)]
RankedPC1[1:3]
## [1] "ID_chr4_89129278_89129824" "ID_chr4_89145405_89145765"
## [3] "ID_chrX_159987900_159988285"
We can see from the IGV that these top sites are Mel cell-line specific CTCF peaks.
We can also plot our significantly differential sites as defined by our DESeq2 test in a gene by sample heatmaps.
library(pheatmap)
rlogMat <- assay(CTCFrlog)
Diff <- names(CTCF_Ch12MinusMel)[CTCF_Ch12MinusMel$padj < 0.05 &
!is.na(CTCF_Ch12MinusMel$padj) &
abs(CTCF_Ch12MinusMel$log2FoldChange) > 3]
sigMat <- rlogMat[rownames(rlogMat) %in% Diff,]
pheatmap(sigMat,scale="row",show_rownames = FALSE)
The soGGi package contains a set of tools to profile high-throughput signalling data and motif occurrence over a set of genomic locations.
We have used this earlier to produce some plots of our ATACseq signal split by fragment length over TSS regions and our cut-site footprints around CTCF motifs.
]
]
We will use the development version of soGGi package available from Github to take advantage of the latest features.
Before we start plotting over regions we can update our DEseq2 results GRanges with some more information on the differential binding sites. This is some categorical data, based on the direction and significance of change.
UpInMel <- names(CTCF_Ch12MinusMel)[CTCF_Ch12MinusMel$padj < 0.05 &
!is.na(CTCF_Ch12MinusMel$padj) &
CTCF_Ch12MinusMel$log2FoldChange < -3]
DownInMel <- names(CTCF_Ch12MinusMel)[CTCF_Ch12MinusMel$padj < 0.05 &
!is.na(CTCF_Ch12MinusMel$padj) &
abs(CTCF_Ch12MinusMel$log2FoldChange) > 3]
SigVec <- ifelse(names(CTCF_Ch12MinusMel) %in% UpInMel,"UpInMel",
ifelse(names(CTCF_Ch12MinusMel) %in% DownInMel,"DownInMel",
"Other"))
CTCF_Ch12MinusMel$Sig <- SigVec
CTCF_Ch12MinusMel$name <- names(CTCF_Ch12MinusMel)
We also add some information on CTCF site signal strength. We are dividing our peaks into two groups: those that have high or low signal.
grPs <- quantile(CTCF_Ch12MinusMel$baseMean,c(0.33,0.66))
high <- names(CTCF_Ch12MinusMel)[CTCF_Ch12MinusMel$baseMean > grPs[2]]
low <- names(CTCF_Ch12MinusMel)[CTCF_Ch12MinusMel$baseMean < grPs[2]]
Strength <- ifelse(names(CTCF_Ch12MinusMel) %in% high,"high",
ifelse(names(CTCF_Ch12MinusMel) %in% low,"low",
"")
)
CTCF_Ch12MinusMel$Strength <- Strength
Now we have our updated GRanges of DEseq2 results with our custom annotation.
## GRanges object with 2 ranges and 9 metadata columns:
## seqnames ranges strand | baseMean
## <Rle> <IRanges> <Rle> | <numeric>
## ID_chr16_11670061_11670805 chr16 11670061-11670805 * | 3608.19
## ID_chr1_75559493_75559979 chr1 75559493-75559979 * | 1400.68
## log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric>
## ID_chr16_11670061_11670805 3.73522 0.121393 30.7696 6.69175e-208
## ID_chr1_75559493_75559979 4.23851 0.139268 30.4341 1.94252e-203
## padj Sig name
## <numeric> <character> <character>
## ID_chr16_11670061_11670805 4.12473e-203 DownInMel ID_chr16_11670061_11..
## ID_chr1_75559493_75559979 5.98675e-199 DownInMel ID_chr1_75559493_755..
## Strength
## <character>
## ID_chr16_11670061_11670805 high
## ID_chr1_75559493_75559979 high
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
We can use the regionPlot() function with our normalised bigWig of CTCF signal by setting format parameter to bigwig. Additionally we set the plot area width with the distanceAround parameter.
We can produce a simple profile plot by using the plotRegion() function with our ChIPprofile object.
As with our other packages, soGGi returns a ggplot object which can be further customized.
We can use the summariseBy argument to group our data for plotting by column information. Here by specify to group data by values in the Strength column.
We can supply additional groups to the gts argument as a list of sets of IDS/names to create groups from. By default the names are matched against the rownames of our ChIPprofile object.
When using the gts argument, we can supply the name of column to check group IDs/names against for grouping our plot to the summariseBy argument.
We can also use lists of GRanges to group our data in soGGi.
First we can create some TSS and TTS regions to investigate.
We can then supply this list of GRanges to the gts parameter as we have our lists of IDs/names. Groups are defined by regions overlapping the specified ranges.
We can control colouring with the colourBy argument. Here we specify Group to colour by our newly defined groups.
We can control faceting with the groupBy argument.
Again we can use ggplot basics to add some of our own colour schemes to our plots.
p <- plotRegion(CTCF_ch12,
summariseBy = "Strength",
colourBy = "Group",
groupBy = "Group")
cols <- c("high" = "darkred", "low" = "darkblue")
p + scale_colour_manual(values = cols)+
theme_minimal()
We can use to soGGi to compare signal across samples.
Here we load in some data now for CTCF signal in our Mel cell-line.
We can combine ChIPprofile objects just as with vectors and lists using the c() function.
## class: ChIPprofile
## dim: 61639 1501
## metadata(2): names AlignedReadsInBam
## assays(2): '' ''
## rownames(61639): giID2175 giID2176 ... giID59822 giID59823
## rowData names(10): baseMean log2FoldChange ... Strength giID
## colnames(1501): Point_Centre-750 Point_Centre-749 ... Point_Centre749
## Point_Centre750
## colData names(0):
We can now plot our CTCF samples together in the same plot. We can control the plot by setting the colourBy argument to Sample.
We could also use the sample to facet our plot. Here we set groupBy to Sample.
And we combine Sample and Grouping to produce some more complex plots. Here we group our data by the Sig column.
We have previously used soGGi to produce our plots of footprints across motifs. Now we will use motif occurrence as the signal to plot instead of regions to plot over.
First we extract the CTCF motif and plot the seqLogo.
library(MotifDb)
library(Biostrings)
library(seqLogo)
library(BSgenome.Mmusculus.UCSC.mm10)
CTCFm <- query(MotifDb, c("CTCF"))
ctcfMotif <- CTCFm[[1]]
seqLogo(ctcfMotif)
We can use the Biostrings matchPWM() function to scan the genome for CTCF motifs, convert sites to a GRanges and this to an rleList using the coverage function.
myRes <- matchPWM(ctcfMotif,
BSgenome.Mmusculus.UCSC.mm10[["chr19"]])
CTCFmotifs <- GRanges("chr19",ranges(myRes))
CTCFrle <- coverage(CTCFmotifs)
CTCFrle[[1]]
## integer-Rle of length 61322860 with 5646 runs
## Lengths: 3055542 19 24798 19 ... 8069 19 13447 19
## Values : 0 1 0 1 ... 0 1 0 1
We can now plot our profile of motif occurrence across CTCF sites using the regionPlot() function and setting format parameter to rlelist.
We can combine our ChIP profiles from CTCF signal in Mel and Motif occurrence and plot side by side.
CTCFSig19 <- CTCF_mel[seqnames(CTCF_mel) %in% "chr19",]
rownames(motifPlot) <- rownames(CTCFSig19) <- NULL
CTCFall <- c(motifPlot,CTCFSig19)
p <- plotRegion(CTCFall)
p
Since these two objects represent different data types, they will be in different scales.
We can set the scales to resize within our facets by setting freeScale parameter to TRUE.
In some instances we will want to plot over regions of unequal lengths, such as genes.
## 66 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
## GRanges object with 24528 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## [1] chr9 21062393-21073096 - | 100009600
## [2] chr7 84935565-84964115 - | 100009609
## [3] chr10 77711457-77712009 + | 100009614
## [4] chr11 45808087-45841171 + | 100009664
## [5] chr4 144157557-144162663 - | 100012
## ... ... ... ... . ...
## [24524] chr3 84496093-85887516 - | 99889
## [24525] chr3 110246109-110250998 - | 99890
## [24526] chr3 151730922-151749960 - | 99899
## [24527] chr3 65528410-65555518 + | 99929
## [24528] chr4 136550540-136602723 - | 99982
## -------
## seqinfo: 66 sequences (1 circular) from mm10 genome
To produce average plots over genes we must rescale genes to be the same size.This is performed by splitting genes into equal number of bins, with bin size variable across genes and plotting as percent of regions.
We can produce these plots in soGGi by setting style to percentOfRegion.
Now we can plot the distribution of Polymerase 2 serine 2 across genes with plotRegion() function.
We can now include the CTCF signal to review distibution of CTCF in genes compared to Pol2s2 signal.
We can now plot the signal of CTCF and Pol2serine2 side by side.
rownames(CTCF_gene) <- rownames(pol2Ser2) <- NULL
plotRegion(c(CTCF_gene,pol2Ser2),
colourBy = "Sample",
freeScale = TRUE,
groupBy="Sample")
We may also want to review our CTCF profile data as heatmap of signal across our sites.
We can produce heatmaps in soGGi using the plotHeatmap function. First we can extract a subset of sites to plot, here our sites which are higher in Mel.
Since ChIPprofile objects act like GRanges we can subset by our metadata columns using mcols() function.
We can pass individual ChIPprofile objects to the plotHeatmap function to produce our heatmaps of CTCF signal in Mel sample. Here we set maxValue to make Heatmaps comparable across samples.
And we can produce the heatmap for our other sample in the same way.
The heatmaps generated in soGGi however aren’t as customizable as we would like.
The profileplyr written by Doug Barrows (Allis lab/BRC) provides a framework for plotting and manipulating genomic signal heatmaps.
The profileplyr package has methods to import/export signal heatmaps from external sources such as DeepTools. Most usefully for us here though it can translate our soGGi ChIPprofile objects to profileplyr objects.
The profileplyr object is a version of RangedSummarizedExperiment object we should already recognize.
## [1] "profileplyr"
## attr(,"package")
## [1] "profileplyr"
## class: profileplyr
## dim: 61639 1501
## metadata(0):
## assays(2): Sorted_CTCF_MEL_1Normalised.bw
## Sorted_CTCF_Ch12_1Normalised.bw
## rownames(61639): giID2175 giID2176 ... giID59822 giID59823
## rowData names(12): baseMean log2FoldChange ... names sgGroup
## colnames(1501): Point_Centre-750 Point_Centre-749 ... Point_Centre749
## Point_Centre750
## colData names(0):
This means we can act on the object like any other RangedSummarisedExperiment object including subsetting by GRanges.
## class: profileplyr
## dim: 81 1501
## metadata(0):
## assays(2): Sorted_CTCF_MEL_1Normalised.bw
## Sorted_CTCF_Ch12_1Normalised.bw
## rownames(81): giID2233 giID2238 ... giID6060 giID6128
## rowData names(12): baseMean log2FoldChange ... names sgGroup
## colnames(1501): Point_Centre-750 Point_Centre-749 ... Point_Centre749
## Point_Centre750
## colData names(0):
Before we do any plotting we will subset our profileplyr object to just CTCF peaks changing between Mel and Ch12 cell-lines.
We can use the mcols of our profileplyr object to subset to genes of interest.
Once we have the profileplyr object filtered to the data of interest, we can create our heatmap using the generateEnrichedHeatmap function.
We can perform clustering of our profileplyr object using the clusterRanges() function.
By default this will perform hierarchical clustering and plot the result summarised within peaks.
## No 'kmeans_k' or 'cutree_rows' arguments specified. profileplyr object will be returned new column with hierarchical order from hclust