All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived GenomicHeatmapsAndProfiles folder in the Rstudio menu.
Session -> Set Working Directory -> Choose Directory
or in the console.
A common task in genomics is to review your data in your favourite genome browser. Here we are using IGV and our material is available here
We need to work with aligned data for visualisation and analysis in our genome of choice. Aligned sequence data is often stored in SAM format.
SAM - Sequence Alignment Map
These format are easy for us to read but inefficent for software to work with. When using software we will most often work with highly compressed and indexed equivalent files.
In the next few slides we are going to run through an alignment of ChIP-seq FastQ to sorted and indexed BAM files and then create a bigWig.
For a full summary of this processing please see our relevant courses
We can also extract genomes from Bioconductor objects inside of R. Here we use the BSgenome.Hsapiens.UCSC.hg19 to extract the sequence for major chromosomes and write out to a FASTA file.
require(BSgenome.Hsapiens.UCSC.hg19)
mainChromosomes <- paste0("chr",c(1:22,"X","Y","M"))
mainChrSeq <- lapply(mainChromosomes,
function(x)BSgenome.Hsapiens.UCSC.hg19[[x]])
names(mainChrSeq) <- mainChromosomes
mainChrSeqSet <- DNAStringSet(mainChrSeq)
writeXStringSet(mainChrSeqSet,
"BSgenome.Hsapiens.UCSC.hg19_majorChr.fa")
Once we have our index we can align our FastQ to the genome using the align function to create a BAM file.
Now we have a indexed BAM file we can easily extract the total mapped reads.
## [1] 10302212
class: inverse, center, middle
In IGV we can also review multiple loci in one screen to get an overview of signal over a group of regions.
Many requests for complex heatmaps and summary statistics of data visualised within.
.pull-left[ Profileplyr is available from Bioconductor and developed with Doug Barrows to - Allow import/export between Deeptools and R framework - Provide easy tools to manipulate and summarise heatmap data. ] .pull-right[]
First we need to get hold of some data to plot.
For demonstration we will take a small processed data set of bigWigs from a paper containing ChIP-seq for the transcription factos ZBTB1 and ATf4.
We can retrieve raw and processed data for GSM4332935, GSM4332934, GSM4332940, GSM4332950 and GSM4332944 directly from GEO.
We can use the getGEOSuppFiles function to download any supplementary files to our local working directory.
require(GEOquery)
GSM4332935_BigWig <- getGEOSuppFiles("GSM4332935",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
GSM4332934_BigWig <- getGEOSuppFiles("GSM4332934",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
GSM4332944_BigWig <- getGEOSuppFiles("GSM4332944",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
GSM4332940_BigWig <- getGEOSuppFiles("GSM4332940",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
GSM4332950_BigWig <- getGEOSuppFiles("GSM4332950",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
The BED file regions are stored in R as a GRanges object.
GRanges objects provide much of the functionality seen in BedTools.
## GRanges object with 593 ranges and 2 metadata columns:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr2 70660347-70663275 * | chr2:70660347-70663275
## [2] chr12 6036980-6042542 * | chr12:6036980-6042542
## [3] chr13 113526647-113528481 * | chr13:113526647-113528481
## [4] chr5 881483-884047 * | chr5:881483-884047
## [5] chr17 79397761-79400379 * | chr17:79397761-79400379
## ... ... ... ... . ...
## [589] chr11 9024476-9026040 * | chr11:9024476-9026040
## [590] chr1 58387303-58388627 * | chr1:58387303-58388627
## [591] chr17 32834453-32835985 * | chr17:32834453-32835985
## [592] chrY 13683928-13693098 * | chrY:13683928-13693098
## [593] chr10 90901936-90905010 * | chr10:90901936-90905010
## score
## <numeric>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [589] 0
## [590] 0
## [591] 0
## [592] 0
## [593] 0
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
The function %over% allows us to evaluate which regions overlap/intersect between two region sets.
The DeepTools software provides a set of tools to convert between file types (i.e. BAM to bigWig) and to plot signal from BAM or bigWig over a set of regions as a heatmap.
Deeptools course for more information
computeMatrix reference-point --referencePoint center
-bs binSize -a BP_after_BED -b BP_before_BED
-S BIGWIGSofINTEREST
--regionsFileName BEDFILEOFINTEREST
--outFileName OUTFILE
The computeMatrix function take a set of arguments to generate an intermediate file
Once we have our intermediate file we can use this to create our heatmap using the plotHeatmap command.
The profileplyr package offers a similar set of functionality as the Deeptools software within R. To compute our intermediate matrix we can use the BamBigwig_to_chipProfile function with the same set of inputs as Deeptools.
require(profileplyr)
bigWigs <- dir(full.names=TRUE,pattern = "FLAG")
zbtb1_profile <- BamBigwig_to_chipProfile(bigWigs,
testRanges = "data/beds/ZBTB1Peaks.bed",
style = "point",
format = "bigwig",
distanceAround = 2000)
zbtb1_profile <- as_profileplyr(zbtb1_profile)
zbtb1_profile
## class: profileplyr
## dim: 593 200
## metadata(0):
## assays(4): GSM4332935_Sorted_GFP_MinusN_FLAGNormalised.bw
## GSM4332940_Sorted_GFP_PlusN_FLAGNormalised.bw
## GSM4332945_Sorted_ZBTB_MinusN_FLAGNormalised.bw
## GSM4332950_Sorted_ZBTB_PlusN_FLAGNormalised.bw
## rownames(593): giID450 giID451 ... giID562 giID563
## rowData names(5): name score sgGroup giID names
## colnames: NULL
## colData names(0):
However you generate the intermediate file, you can import/export to Deeptools. This allows you take advantage of any functionality in Deeptools and R together.
ChIP-seq will often show the presence of common artefacts such as ultra-high signal regions or Blacklists. Such regions can confound peak calling, fragment length estimation and QC metrics.
For more information on Blacklists see the Deeptools guide or our paper.
The groupBy by function allows us to add additional information to our Heatmaps. Here we provide the GRanges of our Blacklist BED file and specify to include sites which do not overlap using the include_nonoverlapping argument.
zbtb1_profile_Grouped <- groupBy(zbtb1_profile,
group = blacklist,
include_nonoverlapping = TRUE)
generateEnrichedHeatmap(zbtb1_profile_Grouped)
Now we have created our filtered set we could export it back to Deeptools to plot using it’s plotHeatmap function.
We can subset our heatmap to only review the samples we want using indexing in R.
Our heatmaps have 3 dimension - Ranges/Rows, Bins/Columns, Samples. To subset samples then we can index our profileplyr object as in standard R.
We can then join our ZBTB1 and ATF4 bigwigs using the common R combine function c().
ztbtb1Andatf4_profile <- c(zbtb1_profile_subset,atf4_profile_bl)
generateEnrichedHeatmap(ztbtb1Andatf4_profile)
We can directly change the sample names when plotting the heatmap too by using the sample_names argument.
We can use the information on samples then when we plot our heatmap. Here we tell the generateEnrichedHeatmap() function to colour our heatmaps by their Antibody information we just added.
We can now review the clustering by simply passing the new profileplyr object to the generateEnrichedHeatmap function.
We can see that it has created two clusters which appear to contain high or low ATF4 signal.
If we wish to now review the relationship between the clustering and the GRanges group we can use the extra_annotation_columns argument to specify additional information to add to the Heatmap.
generateEnrichedHeatmap(ztbtb1Andatf4_profile,
color_by_sample_group = "Antibody",
extra_annotation_columns = "cluster")
Now we have our long format data.frame ready we can pass this to ggplot to visualise as violin plots.
require(ggplot2)
ggplot(ztbtb1Andatf4_summarized,
aes(y=Signal,x=Sample,fill=GR_overlap_names))+
facet_wrap(Antibody~.,scales = "free")+
geom_violin()+scale_y_log10()
All the annotation is now contained in the row information for the profileplyr object and can be retieved using the rowRanges() function.
## GRanges object with 1 range and 13 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## giID484 chr1 1508087-1512087 + | chr1:1508899-1511276 0
## sgGroup giID names cluster
## <factor> <character> <character> <factor>
## giID484 ZBTB1Peaks.bed giID484 chr1:1508899-1511276 1
## hierarchical_order overlap_matrix GR_overlap_names name.1
## <integer> <matrix> <factor> <character>
## giID484 252 0 no_overlap <NA>
## score.1 SYMBOL distanceToTSS
## <numeric> <character> <numeric>
## giID484 NA SSU72 162
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
We can use the annotation to label where peaks closest to our genes of interest are in the heatmap
generateEnrichedHeatmap(ztbtb1Andatf4_profile,
color_by_sample_group = "Antibody",
extra_annotation_columns = "cluster",
genes_to_label = "ASNS",
gene_label_font_size = 18)
Now we have named list of the genes of interest, we can use this group our heatmap using the groupBy function and produce our final heatmap.
Here the group argument accepts a list of gene names instead of a GRanges.
The overlap is quite small so we regroup by a different column and add the gene set as annotation instead by specify “GL_overlap_names” to the extra_annotation_columns parameter
ztbtb1Andatf4_profile2 <- groupBy(ztbtb1Andatf4_profile2,group="cluster")
generateEnrichedHeatmap(ztbtb1Andatf4_profile2,
color_by_sample_group = "Antibody",genes_to_label = "ASNS",
extra_annotation_columns = "GL_overlap_names",
gene_label_font_size = 18)
Any suggestions, comments, edits or questions (about content or the slides themselves) please reach out to our GitHub and raise an issue.