Genomic Heatmaps and Profiles

class: center, middle, inverse, title-slide

# Genomic Heatmaps and Profiles
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/Genomic_HeatmapsAndProfiles/" class="uri">http://rockefelleruniversity.github.io/Genomic_HeatmapsAndProfiles/</a>

---

## Overview

- Set up
- Raw data to visualisations
  - Raw data to visualisations in R (Quick)
- Visualising signal over genomic region sets

---
class: inverse, center, middle

# Set up

---
## Materials

All prerequisites, links to material and slides for this course can be found on github.

* [GenomicHeatmapsAndProfiles](https://rockefelleruniversity.github.io/Genomic_HeatmapsAndProfiles/)

Or can be downloaded as a zip archive from here.

* [Download zip](https://github.com/rockefelleruniversity/Genomic_HeatmapsAndProfiles/zipball/master)

---
## Presentations, source code and practicals

Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.

* **presentations/slides/**
Presentations as an HTML slide show.
* **presentations/singlepage/** 
Presentations as an HTML single page.
* **presentations/r_code/**
R code in presentations.

---
## Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived GenomicHeatmapsAndProfiles folder in the Rstudio menu.

**Session -> Set Working Directory -> Choose Directory**

or in the console.

```r
setwd("/PathToMyDownload/GenomicHeatmapsAndProfiles/r_course")
## For Fuchs Lab
# e.g. setwd("/Workshop_Data/TestFolder/")
```

---
class: inverse, center, middle

# Raw data to visualisations

---
## Genomics data visualisations

A common task in genomics is to review your data in your favourite genome browser.
Here we are using IGV and [our material is available here](https://rockefelleruniversity.github.io/IGV_course/)

---
## Genomics data visualisations

To visualise our data we first need to get our data in a suitable format for visualisation. For more information on file formats see [our material is available here](https://rockefelleruniversity.github.io/Genomic_Data/)

High throughput Sequencing data is typically delivered as unaligned sequences in [FastQ format](https://rockefelleruniversity.github.io/Genomic_Data//r_course/presentations/slides/GenomicsData.html#12).

**FastQ (FASTA with Qualities)**
<div align="center">
<img src="imgs/fq1.png" alt="igv" height="200" width="800">
</div>

- "@" followed by identifier.
- Sequence information.
- "+" 
- Quality scores encodes as ASCI.

---
## Genomics data visualisations

We need to work with aligned data for visualisation and analysis in our genome of choice. Aligned sequence data is often stored in [SAM format](https://rockefelleruniversity.github.io/Genomic_Data//r_course/presentations/slides/GenomicsData.html#15).

**SAM - Sequence Alignment Map**

- Contains read and alignment information including genomic location

---
## Genomics data visualisations

Typically for visualisation in genome browsers we require only a subset of this information - Depth of reads/Signal at genomic locations. 
We can then take advantage of more reduced formats such as [bedGraph](https://rockefelleruniversity.github.io/Genomic_Data//r_course/presentations/slides/GenomicsData.html#27) (an extension of [BED](https://rockefelleruniversity.github.io/Genomic_Data//r_course/presentations/slides/GenomicsData.html#27)).

**bed(Browser Extensible Data)Graph**

.pull-left[
 <div align="center">
<img src="imgs/bedgraph3.png" alt="igv" height="250" width="400">
</div>
  ]
.pull-right[
- BED 3 format
    - Chromosome
    - Start 
    - End
    
4th column - Score
  ]

---
## Genomics data visualisations

These format are easy for us to read but inefficent for software to work with. When using software we will most often work with [highly compressed and indexed equivalent files.](https://rockefelleruniversity.github.io/Genomic_Data//r_course/presentations/slides/GenomicsData.html#39)

* SAM -- BAM (with a .bai index)
* wig/bedGraph -- bigWig

---
class: inverse, center, middle

# Raw data to visualisations in R (Quick)

---
## Full workflows available.

In the next few slides we are going to run through an alignment of ChIP-seq FastQ to sorted and indexed BAM files and then create a bigWig.

For a full summary of this processing please see our relevant courses

* [RNA-seq](https://rockefelleruniversity.github.io/RU_RNAseq/)
* [ChIP-seq](https://rockefelleruniversity.github.io/RU_ChIPseq/)
* [ATAC-seq](https://rockefelleruniversity.github.io/RU_ATACseq/)

---
## A reference genome.

To align our raw fastQ data we will need a reference genome in [FASTA format](https://rockefelleruniversity.github.io/Genomic_Data//r_course/presentations/slides/GenomicsData.html#9) to align against.

We can retrieve this from [Ensembl FTP](https://www.ensembl.org/info/data/ftp/index.html) download page for many organisms.

---
## A reference genome in R.

We can also [extract genomes from Bioconductor objects](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/SequencesInBioconductor.html#3) inside of R. Here we use the **BSgenome.Hsapiens.UCSC.hg19** to [extract the sequence for major chromosomes and write out to a FASTA file](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/SequencesInBioconductor.html#4).

```r
require(BSgenome.Hsapiens.UCSC.hg19)
mainChromosomes <- paste0("chr",c(1:22,"X","Y","M"))
mainChrSeq <- lapply(mainChromosomes,
                     function(x)BSgenome.Hsapiens.UCSC.hg19[[x]])
names(mainChrSeq) <- mainChromosomes
mainChrSeqSet <- DNAStringSet(mainChrSeq)
writeXStringSet(mainChrSeqSet,
                "BSgenome.Hsapiens.UCSC.hg19_majorChr.fa")
```

---
## FastQ to BAM and bigWigs.

Once we have the FASTA file available we can [prepare the index of FASTA for alignment using the Rsubread package](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignmentInBioconductor.html#39).
The **indexSplit** paramter here will save us memory in alignment but may cost us some speed.

```r
require(Rsubread)
buildindex("BSgenome.Hsapiens.UCSC.hg19",
           "BSgenome.Hsapiens.UCSC.hg19_majorChr.fa",
           indexSplit = TRUE)
```

---
## FastQ to BAM and bigWigs.

Once we have our index we can [align our FastQ to the genome using the **align** function](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignmentInBioconductor.html#40) to create a BAM file.

```r
align("BSgenome.Hsapiens.UCSC.hg19",
      "ENCFF001NQP.fastq.gz",
      output_file="ENCFF001NQP.bam",
      type="dna")
```
---
## FastQ to BAM and bigWigs.

We can then [sort and index this BAM file to be ready for use in external software by using the R interfact to samtools (Rsamtools)](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignmentInBioconductor.html#19)

```r
require(Rsamtools)
sortBam("ENCFF001NQP.bam","Sorted_ENCFF001NQP")
indexBam("Sorted_ENCFF001NQP.bam")
```

---
## FastQ to BAM and bigWigs.

Now we have a indexed BAM file we can easily extract the total mapped reads.

```r
MappedReads <- idxstatsBam("Sorted_ENCFF001NQP.bam")
TotalMappedReads <- sum(MappedReads$mapped)
```

```
## [1] 10302212
```

---
## Make a bigWig.

We can then make our bigWig in a few lines of R [using the **rtracklayer** package.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/Summarising_Scores_In_Bioconductor.html#24)

```r
require(rtracklayer)
coverageOverGenome <- coverage("Sorted_ENCFF001NQP.bam",
                               weight = (10^6)/TotalMappedReads)
export.bw(coverageOverGenome,con="Sorted_ENCFF001NQP.bw")
```

We may also want to extend reads to *predicted* fragment length before generating coverage.[For more details on this see our material here](https://rockefelleruniversity.github.io/RU_ATACseq/atacseq_course/Presentations/Slides/RU_ATAC.html#37).
---
class: inverse, center, middle

# Visualising signal over genomic region sets.

---
## Review in IGV

We can easily review our signal data in IGV at a simgle locus to assess the relationship between samples/conditions/antibodies.

---
## Multipanel review in IGV

In IGV [we can also review multiple loci](https://rockefelleruniversity.github.io/IGV_course/r_course/presentations/slides/IGV.html#24) in one screen to get an overview of signal over a group of regions.

---
## Visualisation in heatmaps

Often we want to review signal within and between samples/conditions over 1000s of sites. To do this we can take advantage of genomic heatmaps and profiles.

.pull-left[
Two major softwares to produce genomic heatmaps
* Deeptools
* EnrichedHeatmap
<div align="center">
<img src="imgs/exampleDeeptools.png" alt="igv" height="300" width="300">
</div>
  ]
.pull-right[
 <div align="center">
<img src="imgs/deeptools.png" alt="igv" height="200" width="200">
</div>
 <div align="center">
<img src="imgs/enrichedheatmap.png" alt="igv" height="200" width="200">
</div>
  ]

---
## ProfilePlyr package

Many requests for complex heatmaps and summary statistics of data visualised within.

.pull-left[
Profileplyr is available from Bioconductor and developed with Doug Barrows to 
- Allow import/export between Deeptools and R framework
- Provide easy tools to manipulate and summarise heatmap data.
  ]
.pull-right[
 <div align="center">
<img src="imgs/ppplyer.png" alt="igv" height="400" width="400">
</div>
]

---
## Getting hold of DeepTools

You will not require Deeptools to follow the course but if you would like to install this can be done easily through [Conda](https://anaconda.org/bioconda/deeptools) or [Docker](https://deeptools.readthedocs.io/en/develop/content/installation.html#installation-with-docker).

For more information on [Deeptools you can see our course here.](https://rockefelleruniversity.github.io/RU_VisualizingGenomicsData/viz_course/Presentations/Slides/deeptools_lecture_20190208.html#1)
<div align="center">
<img src="imgs/bioconda_deeeptools.png" alt="igv" height="400" width="600">
</div>

---
## Getting hold of data.

First we need to get hold of some data to plot.

For demonstration we will take a small processed data set of bigWigs from a paper containing ChIP-seq for the transcription factos ZBTB1 and ATf4.

We can retrieve raw and processed data for GSM4332935, GSM4332934, GSM4332940, GSM4332950 and GSM4332944 directly from [GEO](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE145783).

---
## Getting hold of data.

We can also use some of the tools in Bioconductor to retrieve supplementary files from GEO directly.

To do this we will use the GEOquery package. We can install this following the commands on the Bioconductor page for [GEOquery](https://www.bioconductor.org/packages/release/bioc/html/GEOquery.html).

```r
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("GEOquery")
```

---
## Getting hold of data.

We can use the **getGEOSuppFiles** function to download any supplementary files to our local working directory.

```r
require(GEOquery)

GSM4332935_BigWig <- getGEOSuppFiles("GSM4332935",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")
GSM4332934_BigWig <- getGEOSuppFiles("GSM4332934",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")
GSM4332944_BigWig <- getGEOSuppFiles("GSM4332944",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")
GSM4332940_BigWig <- getGEOSuppFiles("GSM4332940",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")
GSM4332950_BigWig <- getGEOSuppFiles("GSM4332950",fetch_files = TRUE,
                                     makeDirectory = FALSE,
                                     filter_regex = "*.bw")
```

---
## Regions of interest

We want to plot our signal over selected sets of regions. We will import our regions from a BED file using the import.bed function in **rtracklayer** package

```r
require(rtracklayer)
ZBTB1_GR <- import.bed("data/beds/ZBTB1Peaks.bed")
ZBTB1_GR
```

```
## GRanges object with 593 ranges and 2 metadata columns:
##         seqnames              ranges strand |                      name
##            <Rle>           <IRanges>  <Rle> |               <character>
##     [1]     chr2   70660347-70663275      * |    chr2:70660347-70663275
##     [2]    chr12     6036980-6042542      * |     chr12:6036980-6042542
##     [3]    chr13 113526647-113528481      * | chr13:113526647-113528481
##     [4]     chr5       881483-884047      * |        chr5:881483-884047
##     [5]    chr17   79397761-79400379      * |   chr17:79397761-79400379
##     ...      ...                 ...    ... .                       ...
##   [589]    chr11     9024476-9026040      * |     chr11:9024476-9026040
##   [590]     chr1   58387303-58388627      * |    chr1:58387303-58388627
##   [591]    chr17   32834453-32835985      * |   chr17:32834453-32835985
##   [592]     chrY   13683928-13693098      * |    chrY:13683928-13693098
##   [593]    chr10   90901936-90905010      * |   chr10:90901936-90905010
##             score
##         <numeric>
##     [1]         0
##     [2]         0
##     [3]         0
##     [4]         0
##     [5]         0
##     ...       ...
##   [589]         0
##   [590]         0
##   [591]         0
##   [592]         0
##   [593]         0
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths
```

---
## GRanges

The [BED file regions are stored in R as a **GRanges** object.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/GenomicIntervals_In_Bioconductor.html#1)

**GRanges** objects provide much of the functionality seen in **BedTools**.

```r
ZBTB1_GR
```

---
## Overlaps in GRanges

One of the most useful function for GRanges objects is the ability to intersect/overlap differing regions.

To demonstate first we will read in a BED file of promoter positions for hg19.

```
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
```

```r
require(rtracklayer)
promoter_positions <- import.bed("data/beds/Promoter_Positions.bed")
promoter_positions
```

```
## GRanges object with 23056 ranges and 2 metadata columns:
##           seqnames              ranges strand |        name     score
##              <Rle>           <IRanges>  <Rle> | <character> <numeric>
##       [1]    chr19   58874015-58876214      - |           1         0
##       [2]     chr8   18246755-18248954      + |          10         0
##       [3]    chr20   43280177-43282376      - |         100         0
##       [4]    chr18   25757246-25759445      - |        1000         0
##       [5]     chr1 244006687-244008886      - |       10000         0
##       ...      ...                 ...    ... .         ...       ...
##   [23052]     chr9 115095745-115097944      - |        9991         0
##   [23053]    chr21   35734323-35736522      + |        9992         0
##   [23054]    chr22   19109768-19111967      - |        9993         0
##   [23055]     chr6   90537619-90539818      + |        9994         0
##   [23056]    chr22   50964706-50966905      - |        9997         0
##   -------
##   seqinfo: 39 sequences from an unspecified genome; no seqlengths
```

---
## Overlaps in GRanges

The function **%over%** allows us to evaluate which regions overlap/intersect between two region sets.

* A[A %over% B] -- Returns all regions in A which overlap regions in B
* B[B %over% A] -- Returns all regions in B which  overlap regions in A
* A[!A %over% B] -- Returns all regions in A which do not overlap regions in B

---
## Overlaps in GRanges

So in practice we can easily extract our ZBTB1 peaks which do or do not overlap with promoters.
We can then use the **export.bed** function to write the overlapping GRanges to a BED file

```r
ZBTB1_In_Promoters <- ZBTB1_GR[ZBTB1_GR %over% promoter_positions]
ZBTB1_Outside_Promoters <- ZBTB1_GR[!ZBTB1_GR %over% promoter_positions]
export.bed(ZBTB1_In_Promoters,
           con="data/beds/ZBTB1_In_Promoters.bed")
```

---
## Heatmaps in Deeptools
The DeepTools software provides a set of tools to convert between file types (i.e. BAM to bigWig) and to plot signal from BAM or bigWig over a set of regions as a heatmap.

[Deeptools course for more information](https://rockefelleruniversity.github.io/RU_VisualizingGenomicsData/viz_course/Presentations/Slides/deeptools_lecture_20190208.html#2)

---
## Heatmaps in Deeptools

Two main workhorse functions allow you create these heatmaps in deeptools

* computeMatrix - Create an intermediate file of signal for heatmap.
* plotHeatmap - Generate the heatmaps from intermediate file.

---
## Creating GenomicHeatmaps

```bash
computeMatrix reference-point --referencePoint center 
-bs binSize -a BP_after_BED -b BP_before_BED 
-S  BIGWIGSofINTEREST 
--regionsFileName BEDFILEOFINTEREST 
--outFileName OUTFILE

```

The computeMatrix function take a set of arguments to generate an intermediate file

* reference-point --referencePoint  = Specify position in regions in BED file to centre heatmap.
* -bs  = Binsize across regions in BED file
* -b and -a = bp before and after reference point to plot in heatmap.
* -S = BigWigs to plot signal from
* --regionsFileName = BED file containing regions to plot over.
* --outFileName = Name of intermediate file generated.

---
## Creating GenomicHeatmaps

Using our data, the command looks like this.

```bash
computeMatrix reference-point --referencePoint center 
-bs 50 -a 2000 -b 2000 -p 4 
-S  GSM4332935_Sorted_GFP_MinusN_FLAGNormalised.bw GSM4332940_Sorted_GFP_PlusN_FLAGNormalised.bw GSM4332945_Sorted_ZBTB_MinusN_FLAGNormalised.bw GSM4332950_Sorted_ZBTB_PlusN_FLAGNormalised.bw 
--regionsFileName ZBTB1Peaks.bed 
--outFileName Deeptools_ZBTB1Peaks.MAT 
```

---
## Creating GenomicHeatmaps

Once we have our intermediate file we can use this to create our heatmap using the **plotHeatmap** command.

```bash
plotHeatmap -m MATRIXFILE 
-o HEATMAP
--colorList lowColour,highColour 
```

* -m  = Name of intermediate file generated by Deeptools computeMatrix.
* -o  = Output file name for heatmap
* --colorList = Bottom and top of colour scale for heatmap.

---
## Creating GenomicHeatmaps

We our data, the plotHeatmap command would be.

```bash
plotHeatmap -m Deeptools_ZBTB1Peaks.MAT 
-o Deeptools_ZBTB1Peaks.MAT.png 
--colorList white,darkred 
```
<div align="center">
<img src="imgs/Deeptools_ZBTB1Peaks.MAT.png" alt="igv" height="400" width="400">
</div>

---
## Creating GenomicHeatmaps

The profileplyr package offers a similar set of functionality as the Deeptools software within R. 
To compute our intermediate matrix we can use the **BamBigwig_to_chipProfile** function with the same set of inputs as Deeptools.

```r
require(profileplyr)
bigWigs <- dir(full.names=TRUE,pattern = "FLAG")
zbtb1_profile <- BamBigwig_to_chipProfile(bigWigs,
                                   testRanges = "data/beds/ZBTB1Peaks.bed",
                                   style = "point",
                                   format = "bigwig",
                                   distanceAround = 2000)

zbtb1_profile <- as_profileplyr(zbtb1_profile)
zbtb1_profile
```

```
## class: profileplyr 
## dim: 593 200 
## metadata(0):
## assays(4): GSM4332935_Sorted_GFP_MinusN_FLAGNormalised.bw
##   GSM4332940_Sorted_GFP_PlusN_FLAGNormalised.bw
##   GSM4332945_Sorted_ZBTB_MinusN_FLAGNormalised.bw
##   GSM4332950_Sorted_ZBTB_PlusN_FLAGNormalised.bw
## rownames(593): giID450 giID451 ... giID562 giID563
## rowData names(5): name score sgGroup giID names
## colnames: NULL
## colData names(0):
```

---
## Creating GenomicHeatmaps

We can then use the **generateEnrichedHeatmap** function to create a heatmap as we did in Deeptools.

```r
generateEnrichedHeatmap(zbtb1_profile)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/zbtb1_1plot-1.png)

---
## Import from DeepTools

However you generate the intermediate file, you can import/export to Deeptools. This allows you take advantage of any functionality in Deeptools and R together.

```r
zbtb1_deeptools <- import_deepToolsMat("data/Deeptools_ZBTB1Peaks.MAT")
generateEnrichedHeatmap(zbtb1_deeptools)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/zbtb1_1Deeptools-1.png)
---
## Blacklist

ChIP-seq will often show the presence of common artefacts such as ultra-high signal regions or Blacklists. Such regions can confound peak calling, fragment length estimation and QC metrics.

For more information on Blacklists [see the Deeptools guide](https://deeptools.readthedocs.io/en/develop/content/feature/blacklist.html) or [our paper](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3989762/).

---
## Blacklist

We can retrieve Blacklists from the [Encode portal for human and mouse](https://www.encodeproject.org/annotations/ENCSR636HFF/). Here I have pre-downloaded the hg19 Blacklist as a file and put in the beds/ directory.

```r
blacklist <- rtracklayer::import("data/beds/ENCFF001TDO.bed.gz")
blacklist
```

```
## GRanges object with 411 ranges and 2 metadata columns:
##         seqnames            ranges strand |                    name     score
##            <Rle>         <IRanges>  <Rle> |             <character> <numeric>
##     [1]     chr1     564450-570371      * | High_Mappability_island      1000
##     [2]     chr1     724137-727043      * |        Satellite_repeat      1000
##     [3]     chr1     825007-825115      * |                BSR/Beta      1000
##     [4]     chr1   2583335-2634374      * |  Low_mappability_island      1000
##     [5]     chr1   4363065-4363242      * |                (CATTC)n      1000
##     ...      ...               ...    ... .                     ...       ...
##   [407]     chrY 28555027-28555353      * |                    TAR1      1000
##   [408]     chrY 28784130-28819695      * |        Satellite_repeat      1000
##   [409]     chrY 58819368-58917648      * |                (CATTC)n      1000
##   [410]     chrY 58971914-58997782      * |                (CATTC)n      1000
##   [411]     chrY 59361268-59362785      * |                    TAR1      1000
##   -------
##   seqinfo: 25 sequences from an unspecified genome; no seqlengths
```

---
## Creating GenomicHeatmaps

The **groupBy** by function allows us to add additional information to our Heatmaps. Here we provide the GRanges of our Blacklist BED file and specify to include sites which do not overlap using the **include_nonoverlapping** argument.

```r
zbtb1_profile_Grouped <- groupBy(zbtb1_profile,
                                 group = blacklist,
                                 include_nonoverlapping = TRUE)
generateEnrichedHeatmap(zbtb1_profile_Grouped)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/grouped-1.png)

---
## Creating GenomicHeatmaps

We can take advantage of the same **%over%** function we saw earlier with GRanges to filter to just ranges in heatmap which do not overlap our blacklist regions.

```r
zbtb1_profile_bl <- zbtb1_profile[!zbtb1_profile %over% blacklist]
generateEnrichedHeatmap(zbtb1_profile_bl)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/bled-1.png)

---
## Creating GenomicHeatmaps

Now we have created our filtered set we could export it back to Deeptools to plot using it's plotHeatmap function.

```r
export_deepToolsMat(zbtb1_profile_bl,
                    con="zbtb1_Profileplyr.MAT",
                    overwrite = TRUE)
```

---
## Creating GenomicHeatmaps

```bash

plotHeatmap -m zbtb1_Profileplyr.MAT.gz 
            -o zbtb1_Profileplyr.MAT.png 
            --colorList white,darkred
```

---
## Subsetting

We can subset our heatmap to only review the samples we want using indexing in R.

Our heatmaps have 3 dimension - Ranges/Rows, Bins/Columns, Samples. To subset samples then we can index our profileplyr object as in standard R.

```r
zbtb1_profile_subset <- zbtb1_profile_bl[,,3:4]
generateEnrichedHeatmap(zbtb1_profile_subset)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/zbtb_Sub-1.png)

---
## Joining heatmaps.

We can also combine heatmaps when they are processed over the same BED file. Here we plot the ATF4 bigwig over the ZBTB1 peaks as we did earlier for ZBTB1 bigwigs.

```r
bigWigs <- dir(full.names=TRUE,pattern = "ATF4")
atf4_profile <- BamBigwig_to_chipProfile(bigWigs,
                                   testRanges = "data/beds/ZBTB1Peaks.bed",
                                   style = "point",
                                   format = "bigwig",distanceAround = 2000)
```

```
## Loading bigwig files.
```

```
## Making ChIPprofile object from signal files.
```

```
## Importing rlelist..Done
## Filtering regions which extend outside of genome boundaries.....Done
## Filtered 0 of 593 regions
## Splitting regions by Watson and Crick strand....Done
## ..Done
## Found 593 Watson strand regions
## Found 0 Crick strand regions
## Extending regions.....done
## Calculating coverage across regions
## Calculating per contig. 
## contig: 1
## contig: 2
## contig: 3
## contig: 4
## contig: 5
## contig: 6
## contig: 7
## contig: 8
## contig: 9
## contig: 10
## contig: 11
## contig: 12
## contig: 13
## contig: 14
## contig: 15
## contig: 16
## contig: 17
## contig: 18
## contig: 19
## contig: 20
## contig: 21
## contig: 22
## contig: 23
## contig: 24
## Creating ChIPprofile.
```

```r
atf4_profile <- as_profileplyr(atf4_profile)
atf4_profile_bl <- atf4_profile[!atf4_profile %over% blacklist]
```

---
## Joining samples

We can then join our ZBTB1 and ATF4 bigwigs using the common R combine function **c()**.

```r
ztbtb1Andatf4_profile <- c(zbtb1_profile_subset,atf4_profile_bl)
generateEnrichedHeatmap(ztbtb1Andatf4_profile)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/JoinAndPlot-1.png)

---
## Updating sample names

Sample names in these heatmaps are by default based on the name of the input bigwigs. We can update the samplenames using the **rownames()** and **sampleData()** functions with our profileplyr object.

```r
rownames(sampleData(ztbtb1Andatf4_profile)) <- c("ZBTB1_1","ZBTB1_2","ATF4_1")
generateEnrichedHeatmap(ztbtb1Andatf4_profile)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/sampleUpdate-1.png)
---
## Updating sample names

We can directly change the sample names when plotting the heatmap too by using the **sample_names** argument.

```r
generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        sample_names=c("zbtb_A","zbtb_B","atf4_A"))
```

![](GenomicHeatmapsAndProfiles_files/figure-html/sampleUpdate22-1.png)

---
## Updating sample information

We can also use the **sampleData()** function to add any additional information to our samples. Here we add information on Antibody.

We can add any arbitary column name we need using the **sampleData()** function.

```r
sampleData(ztbtb1Andatf4_profile)$Antibody <- c("ZBTB1","ZBTB1","ATF4")
sampleData(ztbtb1Andatf4_profile)$Grade <- c("Favourite","Good","Good")
```

---
## Updating sample information

We can use the information on samples then when we plot our heatmap. Here we tell the **generateEnrichedHeatmap()** function to colour our heatmaps by their Antibody information we just added.

```r
generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        color_by_sample_group = "Antibody")
```

![](GenomicHeatmapsAndProfiles_files/figure-html/sampleUpdatePlot-1.png)

---
## Clustering ranges by summarised signal

Often we want to group ranges across samples by the signal within them. The profileplyr package allows the user to summarise signal within a region and cluster rows according to user defined metrics.

Here we summarise each range by taking the sum of signal within ranges specified in the **fun** paramter and create two clusters from the data specifed by the **cutree_rows** paramter.

```r
ztbtb1Andatf4_profile <- clusterRanges(ztbtb1Andatf4_profile,
                        fun = rowSums,
                        cutree_rows = 2)
```

```
## Hierarchical clustering used. It is advised to avoid this option with large matrices as the clustering can take a long time. Kmeans is more suitable for large matrices.
```

```
## A column has been added to the range metadata with the column name 'cluster', and the 'rowGroupsInUse' has been set to this column.
```

---
## Clustering ranges by summarised signal

We can now review the clustering by simply passing the new profileplyr object to the **generateEnrichedHeatmap** function.

We can see that it has created two clusters which appear to contain high or low ATF4 signal.

```r
generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        color_by_sample_group = "Antibody")
```

![](GenomicHeatmapsAndProfiles_files/figure-html/clusterAndPlot2-1.png)

---
## Adding Groups with names.

If we have an additional set of regions which we wish to assess the overall ATF4 signal within, we can use the **groupBy** fnction again to add this information. Here we additionally specify group names in the **GRanges_names** argument.

```r
ATF4ANDZBTB1_Peaks <- import.bed("data/beds/ATF4ANDZBTB1_Within2000kb.bed")
ztbtb1Andatf4_profile <- groupBy(ztbtb1Andatf4_profile,
                                 group = ATF4ANDZBTB1_Peaks,
                                 GRanges_names = "ProximalPeaks",
                                 include_nonoverlapping = TRUE)
```

```
## A column has been added to the range metadata with the column name 'GR_overlap_names' that specifies the GRanges each range overlaps with, but the inherited groups are not included.
```

---
## Joining samples

If we wish to now review the relationship between the clustering and the GRanges group we can use the **extra_annotation_columns** argument to specify additional information to add to the Heatmap.

```r
generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        color_by_sample_group = "Antibody",
                        extra_annotation_columns = "cluster")
```

![](GenomicHeatmapsAndProfiles_files/figure-html/groupAndPlot2-1.png)

---
## Summarising

Now we have our heatmap organised as we require, we may wish to assess the differences in signal within the clusters over the heatmap.

The **summarize** function allows us to again summarise the ranges by a user-defined metric and export the results as a data.frame suitbale for plotting.

```r
ztbtb1Andatf4_summarized <- summarize(ztbtb1Andatf4_profile,
                        rowSums,
                        output="long",
                       sampleData_columns_for_longPlot = "Antibody")
ztbtb1Andatf4_summarized[1:4,]
```

```
##   GR_overlap_names          combined_ranges  Sample    Signal Antibody
## 1    ProximalPeaks   chr1_21042561_21046561 ZBTB1_1 191.10911    ZBTB1
## 2    ProximalPeaks chr1_181127059_181131059 ZBTB1_1 311.86117    ZBTB1
## 3    ProximalPeaks     chr1_9469096_9473096 ZBTB1_1  95.21104    ZBTB1
## 4    ProximalPeaks   chr1_44494345_44498345 ZBTB1_1 136.83766    ZBTB1
```

---
## Plotting

Now we have our long format data.frame ready we can pass this to ggplot to visualise as violin plots.

```r
require(ggplot2)
ggplot(ztbtb1Andatf4_summarized,
       aes(y=Signal,x=Sample,fill=GR_overlap_names))+
  facet_wrap(Antibody~.,scales = "free")+
  geom_violin()+scale_y_log10()
```

![](GenomicHeatmapsAndProfiles_files/figure-html/summarisePlot-1.png)

---
## Annotating ranges

So far we have been working with genomic locations without any annotation to genes they may regulate.

We can easily annotate our ranges in R using the GREAT software in profileplyr by using the annotateRanges_great function.

```r
ztbtb1Andatf4_profile <- annotateRanges_great(ztbtb1Andatf4_profile,
                                              species = "hg19")
```

![](GenomicHeatmapsAndProfiles_files/figure-html/annotate-1.png)

---
## Annotating to genes

All the annotation is now contained in the row information for the profileplyr object and can be retieved using the **rowRanges()** function.

```r
annotatedRanges <- rowRanges(ztbtb1Andatf4_profile)
annotatedRanges[1,]
```

```
## GRanges object with 1 range and 13 metadata columns:
##           seqnames          ranges strand |                 name     score
##              <Rle>       <IRanges>  <Rle> |          <character> <numeric>
##   giID484     chr1 1508087-1512087      + | chr1:1508899-1511276         0
##                  sgGroup        giID                names  cluster
##                 <factor> <character>          <character> <factor>
##   giID484 ZBTB1Peaks.bed     giID484 chr1:1508899-1511276        1
##           hierarchical_order overlap_matrix GR_overlap_names      name.1
##                    <integer>       <matrix>         <factor> <character>
##   giID484                252              0       no_overlap        <NA>
##             score.1      SYMBOL distanceToTSS
##           <numeric> <character>     <numeric>
##   giID484        NA       SSU72           162
##   -------
##   seqinfo: 24 sequences from an unspecified genome; no seqlengths
```

---
## Annotating to genes

If we want to we can write this information to a table to review in outher software or a BED file to review in IGV.

```r
write.table(annotatedRanges,"annotatedRanges.csv",sep=",",row.names=FALSE)
export.bed(annotatedRanges,"annotatedRanges.bed")
```

---
## Annotating to genes

We can use the annotation to label where peaks closest to our genes of interest are in the heatmap

```r
generateEnrichedHeatmap(ztbtb1Andatf4_profile,
                        color_by_sample_group = "Antibody",
                        extra_annotation_columns = "cluster",
                        genes_to_label = "ASNS",
                        gene_label_font_size = 18)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/annotated3-1.png)

---
## Grouping by genes

We can also use the annotation for grouping peaks by their nearest genes. Here we have downloaded a GMT file from [GSEA's MsigDB](https://www.gsea-msigdb.org/gsea/msigdb/index.jsp).

We can use the GSEAbase package to read in the data from the GMT file and the get the genes in amin oacid metabolism.

```r
require(GSEABase)
aa_meta <- getGmt("data/GeneSets/aminoacid_metabolism.gmt")
aa_meta_Genes <- geneIds(aa_meta)
```

---
## Grouping by genes

Now we have named list of the genes of interest, we can use this group our heatmap using the **groupBy** function and produce our final heatmap.

Here the **group** argument accepts a list of gene names instead of a GRanges.

```r
ztbtb1Andatf4_profile2 <- groupBy(ztbtb1Andatf4_profile,
                                 group = aa_meta_Genes,
                                 include_nonoverlapping = TRUE)
```

---
## Grouping by genes

Finally we can plot grouped by our gene-set of interest.

```r
generateEnrichedHeatmap(ztbtb1Andatf4_profile2,
                        color_by_sample_group = "Antibody",
                        genes_to_label = "ASNS",
                        gene_label_font_size = 18)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/annotated8-1.png)

---
## Grouping by genes

The overlap is quite small so we regroup by a different column and add the gene set as annotation instead by specify "GL_overlap_names" to the **extra_annotation_columns** parameter

```r
ztbtb1Andatf4_profile2 <- groupBy(ztbtb1Andatf4_profile2,group="cluster")
generateEnrichedHeatmap(ztbtb1Andatf4_profile2,
                        color_by_sample_group = "Antibody",genes_to_label = "ASNS",
                        extra_annotation_columns = "GL_overlap_names",
                        gene_label_font_size = 18)
```

![](GenomicHeatmapsAndProfiles_files/figure-html/annotated9-1.png)

---
## A more comprehensive workflow

Doug Barrows has compiled a fantastic workflow on his site which talks through more details.

Please work through [this workflow here](https://dougbarrows.github.io/Workflow_talk.html), and ask any questions as needed.

---
## Contact

Any suggestions, comments, edits or questions (about content or the slides themselves) please reach out to our [GitHub](https://github.com/RockefellerUniversity/Genomic_HeatmapsAndProfiles/issues) and raise an issue.