+ - 0:00:00
Notes for current slide
Notes for next slide

Genomic Heatmaps and Profiles

Rockefeller University, Bioinformatics Resource Centre

http://rockefelleruniversity.github.io/Genomic_HeatmapsAndProfiles/

1 / 73

Overview

  • Set up
  • Raw data to visualisations
    • Raw data to visualisations in R (Quick)
  • Visualising signal over genomic region sets
2 / 73

Set up


3 / 73

Materials

All prerequisites, links to material and slides for this course can be found on github.

Or can be downloaded as a zip archive from here.

4 / 73

Presentations, source code and practicals

Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.

  • presentations/slides/ Presentations as an HTML slide show.
  • presentations/singlepage/ Presentations as an HTML single page.
  • presentations/r_code/ R code in presentations.
5 / 73

Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived GenomicHeatmapsAndProfiles folder in the Rstudio menu.

Session -> Set Working Directory -> Choose Directory

or in the console.

setwd("/PathToMyDownload/GenomicHeatmapsAndProfiles/r_course")
## For Fuchs Lab
# e.g. setwd("/Workshop_Data/TestFolder/")
6 / 73

Raw data to visualisations


7 / 73

Genomics data visualisations

A common task in genomics is to review your data in your favourite genome browser. Here we are using IGV and our material is available here

offset
8 / 73

Genomics data visualisations

To visualise our data we first need to get our data in a suitable format for visualisation. For more information on file formats see our material is available here

High throughput Sequencing data is typically delivered as unaligned sequences in FastQ format.

FastQ (FASTA with Qualities)

igv
  • "@" followed by identifier.
  • Sequence information.
  • "+"
  • Quality scores encodes as ASCI.
9 / 73

Genomics data visualisations

We need to work with aligned data for visualisation and analysis in our genome of choice. Aligned sequence data is often stored in SAM format.

SAM - Sequence Alignment Map

igv
  • Contains read and alignment information including genomic location
10 / 73

Genomics data visualisations

Typically for visualisation in genome browsers we require only a subset of this information - Depth of reads/Signal at genomic locations. We can then take advantage of more reduced formats such as bedGraph (an extension of BED).

bed(Browser Extensible Data)Graph

igv
  • BED 3 format
    • Chromosome
    • Start
    • End

4th column - Score

11 / 73

Genomics data visualisations

These format are easy for us to read but inefficent for software to work with. When using software we will most often work with highly compressed and indexed equivalent files.

  • SAM -- BAM (with a .bai index)
  • wig/bedGraph -- bigWig
12 / 73

Raw data to visualisations in R (Quick)


13 / 73

Full workflows available.

In the next few slides we are going to run through an alignment of ChIP-seq FastQ to sorted and indexed BAM files and then create a bigWig.

For a full summary of this processing please see our relevant courses

14 / 73

A reference genome.

To align our raw fastQ data we will need a reference genome in FASTA format to align against.

We can retrieve this from Ensembl FTP download page for many organisms.

igv
15 / 73

A reference genome in R.

We can also extract genomes from Bioconductor objects inside of R. Here we use the BSgenome.Hsapiens.UCSC.hg19 to extract the sequence for major chromosomes and write out to a FASTA file.

require(BSgenome.Hsapiens.UCSC.hg19)
mainChromosomes <- paste0("chr",c(1:22,"X","Y","M"))
mainChrSeq <- lapply(mainChromosomes,
function(x)BSgenome.Hsapiens.UCSC.hg19[[x]])
names(mainChrSeq) <- mainChromosomes
mainChrSeqSet <- DNAStringSet(mainChrSeq)
writeXStringSet(mainChrSeqSet,
"BSgenome.Hsapiens.UCSC.hg19_majorChr.fa")
16 / 73

FastQ to BAM and bigWigs.

Once we have the FASTA file available we can prepare the index of FASTA for alignment using the Rsubread package. The indexSplit paramter here will save us memory in alignment but may cost us some speed.

require(Rsubread)
buildindex("BSgenome.Hsapiens.UCSC.hg19",
"BSgenome.Hsapiens.UCSC.hg19_majorChr.fa",
indexSplit = TRUE)
17 / 73

FastQ to BAM and bigWigs.

Once we have our index we can align our FastQ to the genome using the align function to create a BAM file.

align("BSgenome.Hsapiens.UCSC.hg19",
"ENCFF001NQP.fastq.gz",
output_file="ENCFF001NQP.bam",
type="dna")
18 / 73

FastQ to BAM and bigWigs.

We can then sort and index this BAM file to be ready for use in external software by using the R interfact to samtools (Rsamtools)

require(Rsamtools)
sortBam("ENCFF001NQP.bam","Sorted_ENCFF001NQP")
indexBam("Sorted_ENCFF001NQP.bam")
19 / 73

FastQ to BAM and bigWigs.

Now we have a indexed BAM file we can easily extract the total mapped reads.

MappedReads <- idxstatsBam("Sorted_ENCFF001NQP.bam")
TotalMappedReads <- sum(MappedReads$mapped)
## [1] 10302212
20 / 73

Make a bigWig.

We can then make our bigWig in a few lines of R using the rtracklayer package.

require(rtracklayer)
coverageOverGenome <- coverage("Sorted_ENCFF001NQP.bam",
weight = (10^6)/TotalMappedReads)
export.bw(coverageOverGenome,con="Sorted_ENCFF001NQP.bw")

We may also want to extend reads to predicted fragment length before generating coverage.For more details on this see our material here.

21 / 73

Visualising signal over genomic region sets.


22 / 73

Review in IGV

We can easily review our signal data in IGV at a simgle locus to assess the relationship between samples/conditions/antibodies.

offset
23 / 73

Multipanel review in IGV

In IGV we can also review multiple loci in one screen to get an overview of signal over a group of regions.

offset
24 / 73

Visualisation in heatmaps

Often we want to review signal within and between samples/conditions over 1000s of sites. To do this we can take advantage of genomic heatmaps and profiles.

Two major softwares to produce genomic heatmaps

  • Deeptools
  • EnrichedHeatmap
    igv
igv
igv
25 / 73

ProfilePlyr package

Many requests for complex heatmaps and summary statistics of data visualised within.

Profileplyr is available from Bioconductor and developed with Doug Barrows to

  • Allow import/export between Deeptools and R framework
  • Provide easy tools to manipulate and summarise heatmap data.
igv
26 / 73

Getting hold of DeepTools

You will not require Deeptools to follow the course but if you would like to install this can be done easily through Conda or Docker.

For more information on Deeptools you can see our course here.

igv
27 / 73

Getting hold of data.

First we need to get hold of some data to plot.

For demonstration we will take a small processed data set of bigWigs from a paper containing ChIP-seq for the transcription factos ZBTB1 and ATf4.

We can retrieve raw and processed data for GSM4332935, GSM4332934, GSM4332940, GSM4332950 and GSM4332944 directly from GEO.

igv
28 / 73

Getting hold of data.

We can also use some of the tools in Bioconductor to retrieve supplementary files from GEO directly.

To do this we will use the GEOquery package. We can install this following the commands on the Bioconductor page for GEOquery.

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GEOquery")
29 / 73

Getting hold of data.

We can use the getGEOSuppFiles function to download any supplementary files to our local working directory.

require(GEOquery)
GSM4332935_BigWig <- getGEOSuppFiles("GSM4332935",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
GSM4332934_BigWig <- getGEOSuppFiles("GSM4332934",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
GSM4332944_BigWig <- getGEOSuppFiles("GSM4332944",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
GSM4332940_BigWig <- getGEOSuppFiles("GSM4332940",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
GSM4332950_BigWig <- getGEOSuppFiles("GSM4332950",fetch_files = TRUE,
makeDirectory = FALSE,
filter_regex = "*.bw")
30 / 73

Regions of interest

We want to plot our signal over selected sets of regions. We will import our regions from a BED file using the import.bed function in rtracklayer package

require(rtracklayer)
ZBTB1_GR <- import.bed("data/beds/ZBTB1Peaks.bed")
ZBTB1_GR
## GRanges object with 593 ranges and 2 metadata columns:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr2 70660347-70663275 * | chr2:70660347-70663275
## [2] chr12 6036980-6042542 * | chr12:6036980-6042542
## [3] chr13 113526647-113528481 * | chr13:113526647-113528481
## [4] chr5 881483-884047 * | chr5:881483-884047
## [5] chr17 79397761-79400379 * | chr17:79397761-79400379
## ... ... ... ... . ...
## [589] chr11 9024476-9026040 * | chr11:9024476-9026040
## [590] chr1 58387303-58388627 * | chr1:58387303-58388627
## [591] chr17 32834453-32835985 * | chr17:32834453-32835985
## [592] chrY 13683928-13693098 * | chrY:13683928-13693098
## [593] chr10 90901936-90905010 * | chr10:90901936-90905010
## score
## <numeric>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [589] 0
## [590] 0
## [591] 0
## [592] 0
## [593] 0
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
31 / 73

GRanges

The BED file regions are stored in R as a GRanges object.

GRanges objects provide much of the functionality seen in BedTools.

ZBTB1_GR
## GRanges object with 593 ranges and 2 metadata columns:
## seqnames ranges strand | name
## <Rle> <IRanges> <Rle> | <character>
## [1] chr2 70660347-70663275 * | chr2:70660347-70663275
## [2] chr12 6036980-6042542 * | chr12:6036980-6042542
## [3] chr13 113526647-113528481 * | chr13:113526647-113528481
## [4] chr5 881483-884047 * | chr5:881483-884047
## [5] chr17 79397761-79400379 * | chr17:79397761-79400379
## ... ... ... ... . ...
## [589] chr11 9024476-9026040 * | chr11:9024476-9026040
## [590] chr1 58387303-58388627 * | chr1:58387303-58388627
## [591] chr17 32834453-32835985 * | chr17:32834453-32835985
## [592] chrY 13683928-13693098 * | chrY:13683928-13693098
## [593] chr10 90901936-90905010 * | chr10:90901936-90905010
## score
## <numeric>
## [1] 0
## [2] 0
## [3] 0
## [4] 0
## [5] 0
## ... ...
## [589] 0
## [590] 0
## [591] 0
## [592] 0
## [593] 0
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
32 / 73

Overlaps in GRanges

One of the most useful function for GRanges objects is the ability to intersect/overlap differing regions.

To demonstate first we will read in a BED file of promoter positions for hg19.

## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
require(rtracklayer)
promoter_positions <- import.bed("data/beds/Promoter_Positions.bed")
promoter_positions
## GRanges object with 23056 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr19 58874015-58876214 - | 1 0
## [2] chr8 18246755-18248954 + | 10 0
## [3] chr20 43280177-43282376 - | 100 0
## [4] chr18 25757246-25759445 - | 1000 0
## [5] chr1 244006687-244008886 - | 10000 0
## ... ... ... ... . ... ...
## [23052] chr9 115095745-115097944 - | 9991 0
## [23053] chr21 35734323-35736522 + | 9992 0
## [23054] chr22 19109768-19111967 - | 9993 0
## [23055] chr6 90537619-90539818 + | 9994 0
## [23056] chr22 50964706-50966905 - | 9997 0
## -------
## seqinfo: 39 sequences from an unspecified genome; no seqlengths
33 / 73

Overlaps in GRanges

The function %over% allows us to evaluate which regions overlap/intersect between two region sets.

  • A[A %over% B] -- Returns all regions in A which overlap regions in B
  • B[B %over% A] -- Returns all regions in B which overlap regions in A
  • A[!A %over% B] -- Returns all regions in A which do not overlap regions in B
34 / 73

Overlaps in GRanges

So in practice we can easily extract our ZBTB1 peaks which do or do not overlap with promoters. We can then use the export.bed function to write the overlapping GRanges to a BED file

ZBTB1_In_Promoters <- ZBTB1_GR[ZBTB1_GR %over% promoter_positions]
ZBTB1_Outside_Promoters <- ZBTB1_GR[!ZBTB1_GR %over% promoter_positions]
export.bed(ZBTB1_In_Promoters,
con="data/beds/ZBTB1_In_Promoters.bed")
35 / 73

Heatmaps in Deeptools

The DeepTools software provides a set of tools to convert between file types (i.e. BAM to bigWig) and to plot signal from BAM or bigWig over a set of regions as a heatmap.

Deeptools course for more information

igv
36 / 73

Heatmaps in Deeptools

Two main workhorse functions allow you create these heatmaps in deeptools

  • computeMatrix - Create an intermediate file of signal for heatmap.
  • plotHeatmap - Generate the heatmaps from intermediate file.
37 / 73

Creating GenomicHeatmaps

computeMatrix reference-point --referencePoint center
-bs binSize -a BP_after_BED -b BP_before_BED
-S BIGWIGSofINTEREST
--regionsFileName BEDFILEOFINTEREST
--outFileName OUTFILE

The computeMatrix function take a set of arguments to generate an intermediate file

  • reference-point --referencePoint = Specify position in regions in BED file to centre heatmap.
  • -bs = Binsize across regions in BED file
  • -b and -a = bp before and after reference point to plot in heatmap.
  • -S = BigWigs to plot signal from
  • --regionsFileName = BED file containing regions to plot over.
  • --outFileName = Name of intermediate file generated.
38 / 73

Creating GenomicHeatmaps

Using our data, the command looks like this.

computeMatrix reference-point --referencePoint center
-bs 50 -a 2000 -b 2000 -p 4
-S GSM4332935_Sorted_GFP_MinusN_FLAGNormalised.bw GSM4332940_Sorted_GFP_PlusN_FLAGNormalised.bw GSM4332945_Sorted_ZBTB_MinusN_FLAGNormalised.bw GSM4332950_Sorted_ZBTB_PlusN_FLAGNormalised.bw
--regionsFileName ZBTB1Peaks.bed
--outFileName Deeptools_ZBTB1Peaks.MAT
39 / 73

Creating GenomicHeatmaps

Once we have our intermediate file we can use this to create our heatmap using the plotHeatmap command.

plotHeatmap -m MATRIXFILE
-o HEATMAP
--colorList lowColour,highColour
  • -m = Name of intermediate file generated by Deeptools computeMatrix.
  • -o = Output file name for heatmap
  • --colorList = Bottom and top of colour scale for heatmap.
40 / 73

Creating GenomicHeatmaps

We our data, the plotHeatmap command would be.

plotHeatmap -m Deeptools_ZBTB1Peaks.MAT
-o Deeptools_ZBTB1Peaks.MAT.png
--colorList white,darkred
igv
41 / 73

Creating GenomicHeatmaps

The profileplyr package offers a similar set of functionality as the Deeptools software within R. To compute our intermediate matrix we can use the BamBigwig_to_chipProfile function with the same set of inputs as Deeptools.

require(profileplyr)
bigWigs <- dir(full.names=TRUE,pattern = "FLAG")
zbtb1_profile <- BamBigwig_to_chipProfile(bigWigs,
testRanges = "data/beds/ZBTB1Peaks.bed",
style = "point",
format = "bigwig",
distanceAround = 2000)
zbtb1_profile <- as_profileplyr(zbtb1_profile)
zbtb1_profile
## class: profileplyr
## dim: 593 200
## metadata(0):
## assays(4): GSM4332935_Sorted_GFP_MinusN_FLAGNormalised.bw
## GSM4332940_Sorted_GFP_PlusN_FLAGNormalised.bw
## GSM4332945_Sorted_ZBTB_MinusN_FLAGNormalised.bw
## GSM4332950_Sorted_ZBTB_PlusN_FLAGNormalised.bw
## rownames(593): giID450 giID451 ... giID562 giID563
## rowData names(5): name score sgGroup giID names
## colnames: NULL
## colData names(0):
42 / 73

Creating GenomicHeatmaps

We can then use the generateEnrichedHeatmap function to create a heatmap as we did in Deeptools.

generateEnrichedHeatmap(zbtb1_profile)

43 / 73

Import from DeepTools

However you generate the intermediate file, you can import/export to Deeptools. This allows you take advantage of any functionality in Deeptools and R together.

zbtb1_deeptools <- import_deepToolsMat("data/Deeptools_ZBTB1Peaks.MAT")
generateEnrichedHeatmap(zbtb1_deeptools)

44 / 73

Blacklist

ChIP-seq will often show the presence of common artefacts such as ultra-high signal regions or Blacklists. Such regions can confound peak calling, fragment length estimation and QC metrics.

For more information on Blacklists see the Deeptools guide or our paper.

igv
45 / 73

Blacklist

We can retrieve Blacklists from the Encode portal for human and mouse. Here I have pre-downloaded the hg19 Blacklist as a file and put in the beds/ directory.

blacklist <- rtracklayer::import("data/beds/ENCFF001TDO.bed.gz")
blacklist
## GRanges object with 411 ranges and 2 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 564450-570371 * | High_Mappability_island 1000
## [2] chr1 724137-727043 * | Satellite_repeat 1000
## [3] chr1 825007-825115 * | BSR/Beta 1000
## [4] chr1 2583335-2634374 * | Low_mappability_island 1000
## [5] chr1 4363065-4363242 * | (CATTC)n 1000
## ... ... ... ... . ... ...
## [407] chrY 28555027-28555353 * | TAR1 1000
## [408] chrY 28784130-28819695 * | Satellite_repeat 1000
## [409] chrY 58819368-58917648 * | (CATTC)n 1000
## [410] chrY 58971914-58997782 * | (CATTC)n 1000
## [411] chrY 59361268-59362785 * | TAR1 1000
## -------
## seqinfo: 25 sequences from an unspecified genome; no seqlengths
46 / 73

Creating GenomicHeatmaps

The groupBy by function allows us to add additional information to our Heatmaps. Here we provide the GRanges of our Blacklist BED file and specify to include sites which do not overlap using the include_nonoverlapping argument.

zbtb1_profile_Grouped <- groupBy(zbtb1_profile,
group = blacklist,
include_nonoverlapping = TRUE)
generateEnrichedHeatmap(zbtb1_profile_Grouped)

47 / 73

Creating GenomicHeatmaps

We can take advantage of the same %over% function we saw earlier with GRanges to filter to just ranges in heatmap which do not overlap our blacklist regions.

zbtb1_profile_bl <- zbtb1_profile[!zbtb1_profile %over% blacklist]
generateEnrichedHeatmap(zbtb1_profile_bl)

48 / 73

Creating GenomicHeatmaps

Now we have created our filtered set we could export it back to Deeptools to plot using it's plotHeatmap function.

export_deepToolsMat(zbtb1_profile_bl,
con="zbtb1_Profileplyr.MAT",
overwrite = TRUE)
49 / 73

Creating GenomicHeatmaps

plotHeatmap -m zbtb1_Profileplyr.MAT.gz
-o zbtb1_Profileplyr.MAT.png
--colorList white,darkred
igv
50 / 73

Subsetting

We can subset our heatmap to only review the samples we want using indexing in R.

Our heatmaps have 3 dimension - Ranges/Rows, Bins/Columns, Samples. To subset samples then we can index our profileplyr object as in standard R.

zbtb1_profile_subset <- zbtb1_profile_bl[,,3:4]
generateEnrichedHeatmap(zbtb1_profile_subset)

51 / 73

Joining heatmaps.

We can also combine heatmaps when they are processed over the same BED file. Here we plot the ATF4 bigwig over the ZBTB1 peaks as we did earlier for ZBTB1 bigwigs.

bigWigs <- dir(full.names=TRUE,pattern = "ATF4")
atf4_profile <- BamBigwig_to_chipProfile(bigWigs,
testRanges = "data/beds/ZBTB1Peaks.bed",
style = "point",
format = "bigwig",distanceAround = 2000)
## Loading bigwig files.
## Making ChIPprofile object from signal files.
## Importing rlelist..Done
## Filtering regions which extend outside of genome boundaries.....Done
## Filtered 0 of 593 regions
## Splitting regions by Watson and Crick strand....Done
## ..Done
## Found 593 Watson strand regions
## Found 0 Crick strand regions
## Extending regions.....done
## Calculating coverage across regions
## Calculating per contig.
## contig: 1
## contig: 2
## contig: 3
## contig: 4
## contig: 5
## contig: 6
## contig: 7
## contig: 8
## contig: 9
## contig: 10
## contig: 11
## contig: 12
## contig: 13
## contig: 14
## contig: 15
## contig: 16
## contig: 17
## contig: 18
## contig: 19
## contig: 20
## contig: 21
## contig: 22
## contig: 23
## contig: 24
## Creating ChIPprofile.
atf4_profile <- as_profileplyr(atf4_profile)
atf4_profile_bl <- atf4_profile[!atf4_profile %over% blacklist]
52 / 73

Joining samples

We can then join our ZBTB1 and ATF4 bigwigs using the common R combine function c().

ztbtb1Andatf4_profile <- c(zbtb1_profile_subset,atf4_profile_bl)
generateEnrichedHeatmap(ztbtb1Andatf4_profile)

53 / 73

Updating sample names

Sample names in these heatmaps are by default based on the name of the input bigwigs. We can update the samplenames using the rownames() and sampleData() functions with our profileplyr object.

rownames(sampleData(ztbtb1Andatf4_profile)) <- c("ZBTB1_1","ZBTB1_2","ATF4_1")
generateEnrichedHeatmap(ztbtb1Andatf4_profile)

54 / 73

Updating sample names

We can directly change the sample names when plotting the heatmap too by using the sample_names argument.

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
sample_names=c("zbtb_A","zbtb_B","atf4_A"))

55 / 73

Updating sample information

We can also use the sampleData() function to add any additional information to our samples. Here we add information on Antibody.

We can add any arbitary column name we need using the sampleData() function.

sampleData(ztbtb1Andatf4_profile)$Antibody <- c("ZBTB1","ZBTB1","ATF4")
sampleData(ztbtb1Andatf4_profile)$Grade <- c("Favourite","Good","Good")
56 / 73

Updating sample information

We can use the information on samples then when we plot our heatmap. Here we tell the generateEnrichedHeatmap() function to colour our heatmaps by their Antibody information we just added.

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
color_by_sample_group = "Antibody")

57 / 73

Clustering ranges by summarised signal

Often we want to group ranges across samples by the signal within them. The profileplyr package allows the user to summarise signal within a region and cluster rows according to user defined metrics.

Here we summarise each range by taking the sum of signal within ranges specified in the fun paramter and create two clusters from the data specifed by the cutree_rows paramter.

ztbtb1Andatf4_profile <- clusterRanges(ztbtb1Andatf4_profile,
fun = rowSums,
cutree_rows = 2)
## Hierarchical clustering used. It is advised to avoid this option with large matrices as the clustering can take a long time. Kmeans is more suitable for large matrices.
## A column has been added to the range metadata with the column name 'cluster', and the 'rowGroupsInUse' has been set to this column.
58 / 73

Clustering ranges by summarised signal

We can now review the clustering by simply passing the new profileplyr object to the generateEnrichedHeatmap function.

We can see that it has created two clusters which appear to contain high or low ATF4 signal.

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
color_by_sample_group = "Antibody")

59 / 73

Adding Groups with names.

If we have an additional set of regions which we wish to assess the overall ATF4 signal within, we can use the groupBy fnction again to add this information. Here we additionally specify group names in the GRanges_names argument.

ATF4ANDZBTB1_Peaks <- import.bed("data/beds/ATF4ANDZBTB1_Within2000kb.bed")
ztbtb1Andatf4_profile <- groupBy(ztbtb1Andatf4_profile,
group = ATF4ANDZBTB1_Peaks,
GRanges_names = "ProximalPeaks",
include_nonoverlapping = TRUE)
## A column has been added to the range metadata with the column name 'GR_overlap_names' that specifies the GRanges each range overlaps with, but the inherited groups are not included.
60 / 73

Joining samples

If we wish to now review the relationship between the clustering and the GRanges group we can use the extra_annotation_columns argument to specify additional information to add to the Heatmap.

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
color_by_sample_group = "Antibody",
extra_annotation_columns = "cluster")

61 / 73

Summarising

Now we have our heatmap organised as we require, we may wish to assess the differences in signal within the clusters over the heatmap.

The summarize function allows us to again summarise the ranges by a user-defined metric and export the results as a data.frame suitbale for plotting.

ztbtb1Andatf4_summarized <- summarize(ztbtb1Andatf4_profile,
rowSums,
output="long",
sampleData_columns_for_longPlot = "Antibody")
ztbtb1Andatf4_summarized[1:4,]
## GR_overlap_names combined_ranges Sample Signal Antibody
## 1 ProximalPeaks chr1_21042561_21046561 ZBTB1_1 191.10911 ZBTB1
## 2 ProximalPeaks chr1_181127059_181131059 ZBTB1_1 311.86117 ZBTB1
## 3 ProximalPeaks chr1_9469096_9473096 ZBTB1_1 95.21104 ZBTB1
## 4 ProximalPeaks chr1_44494345_44498345 ZBTB1_1 136.83766 ZBTB1
62 / 73

Plotting

Now we have our long format data.frame ready we can pass this to ggplot to visualise as violin plots.

require(ggplot2)
ggplot(ztbtb1Andatf4_summarized,
aes(y=Signal,x=Sample,fill=GR_overlap_names))+
facet_wrap(Antibody~.,scales = "free")+
geom_violin()+scale_y_log10()

63 / 73

Annotating ranges

So far we have been working with genomic locations without any annotation to genes they may regulate.

We can easily annotate our ranges in R using the GREAT software in profileplyr by using the annotateRanges_great function.

ztbtb1Andatf4_profile <- annotateRanges_great(ztbtb1Andatf4_profile,
species = "hg19")

64 / 73

Annotating to genes

All the annotation is now contained in the row information for the profileplyr object and can be retieved using the rowRanges() function.

annotatedRanges <- rowRanges(ztbtb1Andatf4_profile)
annotatedRanges[1,]
## GRanges object with 1 range and 13 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## giID484 chr1 1508087-1512087 + | chr1:1508899-1511276 0
## sgGroup giID names cluster
## <factor> <character> <character> <factor>
## giID484 ZBTB1Peaks.bed giID484 chr1:1508899-1511276 1
## hierarchical_order overlap_matrix GR_overlap_names name.1
## <integer> <matrix> <factor> <character>
## giID484 252 0 no_overlap <NA>
## score.1 SYMBOL distanceToTSS
## <numeric> <character> <numeric>
## giID484 NA SSU72 162
## -------
## seqinfo: 24 sequences from an unspecified genome; no seqlengths
65 / 73

Annotating to genes

If we want to we can write this information to a table to review in outher software or a BED file to review in IGV.

write.table(annotatedRanges,"annotatedRanges.csv",sep=",",row.names=FALSE)
export.bed(annotatedRanges,"annotatedRanges.bed")
66 / 73

Annotating to genes

We can use the annotation to label where peaks closest to our genes of interest are in the heatmap

generateEnrichedHeatmap(ztbtb1Andatf4_profile,
color_by_sample_group = "Antibody",
extra_annotation_columns = "cluster",
genes_to_label = "ASNS",
gene_label_font_size = 18)

67 / 73

Grouping by genes

We can also use the annotation for grouping peaks by their nearest genes. Here we have downloaded a GMT file from GSEA's MsigDB.

We can use the GSEAbase package to read in the data from the GMT file and the get the genes in amin oacid metabolism.

require(GSEABase)
aa_meta <- getGmt("data/GeneSets/aminoacid_metabolism.gmt")
aa_meta_Genes <- geneIds(aa_meta)
68 / 73

Grouping by genes

Now we have named list of the genes of interest, we can use this group our heatmap using the groupBy function and produce our final heatmap.

Here the group argument accepts a list of gene names instead of a GRanges.

ztbtb1Andatf4_profile2 <- groupBy(ztbtb1Andatf4_profile,
group = aa_meta_Genes,
include_nonoverlapping = TRUE)
69 / 73

Grouping by genes

Finally we can plot grouped by our gene-set of interest.

generateEnrichedHeatmap(ztbtb1Andatf4_profile2,
color_by_sample_group = "Antibody",
genes_to_label = "ASNS",
gene_label_font_size = 18)

70 / 73

Grouping by genes

The overlap is quite small so we regroup by a different column and add the gene set as annotation instead by specify "GL_overlap_names" to the extra_annotation_columns parameter

ztbtb1Andatf4_profile2 <- groupBy(ztbtb1Andatf4_profile2,group="cluster")
generateEnrichedHeatmap(ztbtb1Andatf4_profile2,
color_by_sample_group = "Antibody",genes_to_label = "ASNS",
extra_annotation_columns = "GL_overlap_names",
gene_label_font_size = 18)

71 / 73

A more comprehensive workflow

Doug Barrows has compiled a fantastic workflow on his site which talks through more details.

Please work through this workflow here, and ask any questions as needed.

72 / 73

Contact

Any suggestions, comments, edits or questions (about content or the slides themselves) please reach out to our GitHub and raise an issue.

73 / 73

Overview

  • Set up
  • Raw data to visualisations
    • Raw data to visualisations in R (Quick)
  • Visualising signal over genomic region sets
2 / 73
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow