CUT&RUN In Bioconductor (part5)

class: center, middle, inverse, title-slide

.title[
# CUT&RUN In Bioconductor (part5)<br />
<html><br />
<br />
<hr color='#EB811B' size=1px width=796px><br />
</html>
]
.author[
### Rockefeller University, Bioinformatics Resource Centre
]
.date[
### <a href="http://rockefelleruniversity.github.io/RU_CUT&RUN/" class="uri">http://rockefelleruniversity.github.io/RU_CUT&RUN/</a>
]

---

## Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived RU_Course_help folder in the Rstudio menu.

**Session -> Set Working Directory -> Choose Directory**

or in the console.

``` r
setwd("~/Downloads/ATAC.Cut-Run.ChIP-master/r_course")
```

---
## What we will cover

We have now called peaks and then built a consensus peak set that we have counted over and determined differential peaks between W6 and W0 for our CUT&RUN data.

In this section we will:

* Overlap these peak sets to genomic features
  * Annotate peaks to genes
  * Perform functional enrichment for pathways and biological gene sets with the annotated peaks

---

class: inverse, center, middle

# Peak Annotation

---

## Annotation of peaks to genes

So far we have been working with CUT&RUN peaks corresponding to transcription factor binding or ATACseq peaks corresponding to open chromatin regions. Transcription factors, as implied in the name, can affect the expression of their target genes and open regions generally correlate with gene expression.

We will often annotate peaks to genes to try and identify the target of a transcription factor or a gene regulated by a regulatory element uncovered by ATACseq. This is typically done using a simple set of rules:

Peaks are typically annotated to a gene if
* They overlap the gene.
* The gene is the closest (and within a minimum distance).

---

## Peak annotation

A useful package for annotation of peaks to genes is **ChIPseeker**.

By using pre-defined annotation in the form of a **TXDB** object for mouse (mm10 genome), ChIPseeker will provide us with an overview of where peaks land in the gene and distance to TSS sites.

First load the libraries we require for the next part and read in our SOX9 CUT&RUN peaks.

``` r
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
library(ChIPseeker)
library(rtracklayer)

cnrPeaks_GR <- rtracklayer::import("data/peaks/SOX9CNR_W6_rep1_macs_peaks.narrowPeak")
```

---

## Peak annotation

The annotatePeak function accepts a GRanges object of the regions to annotate, a TXDB object for gene locations and a database object name to retrieve gene names from.

``` r
peakAnno <- annotatePeak(cnrPeaks_GR, tssRegion = c(-1000, 1000), TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,
    annoDb = "org.Mm.eg.db")
```

```
## >> preparing features information...		 2025-06-11 07:12:13 
## >> identifying nearest features...		 2025-06-11 07:12:13 
## >> calculating distance from peak to TSS...	 2025-06-11 07:12:14 
## >> assigning genomic annotation...		 2025-06-11 07:12:14 
## >> adding gene annotation...			 2025-06-11 07:12:24
```

```
## >> assigning chromosome lengths			 2025-06-11 07:12:24 
## >> done...					 2025-06-11 07:12:24
```

``` r
class(peakAnno)
```

```
## [1] "csAnno"
## attr(,"package")
## [1] "ChIPseeker"
```
---
## Peak annotation

The result is a csAnno object containing annotation for peaks and overall annotation statistics.

``` r
peakAnno
```

```
## Annotated peaks generated by ChIPseeker
## 72118/72174  peaks were annotated
## Genomic Annotation Summary:
##              Feature  Frequency
## 9           Promoter 20.3180898
## 4             5' UTR  0.1677806
## 3             3' UTR  1.4420810
## 1           1st Exon  1.2687540
## 7         Other Exon  2.5555340
## 2         1st Intron 14.3431598
## 8       Other Intron 24.3115450
## 6 Downstream (<=300)  0.1067695
## 5  Distal Intergenic 35.4862864
```

---
## Peak annotation

The csAnno object contains the information on annotation of individual peaks to genes.

To extract this from the csAnno object the ChIPseeker functions *as.GRanges* or *as.data.frame* can be used to produce the respective object with peaks and their associated genes.

``` r
annotatedPeaksGR <- as.GRanges(peakAnno)
annotatedPeaksDF <- as.data.frame(peakAnno)
```

---
## Peak annotation

``` r
annotatedPeaksGR[1:2, ]
```

```
## GRanges object with 2 ranges and 18 metadata columns:
##       seqnames          ranges strand |                   name     score
##          <Rle>       <IRanges>  <Rle> |            <character> <numeric>
##   [1]     chr1 4466289-4466395      * | SOX9CNR_W6_rep1_macs..        16
##   [2]     chr1 4785750-4785948      * | SOX9CNR_W6_rep1_macs..        53
##       signalValue    pValue    qValue      peak        annotation   geneChr
##         <numeric> <numeric> <numeric> <integer>       <character> <integer>
##   [1]     3.22302   3.93505   1.66890        13 Distal Intergenic         1
##   [2]     5.79731   8.11406   5.39043       138          Promoter         1
##       geneStart   geneEnd geneLength geneStrand      geneId
##       <integer> <integer>  <integer>  <integer> <character>
##   [1]   4492465   4493735       1271          2       20671
##   [2]   4781221   4785739       4519          2       27395
##               transcriptId distanceToTSS            ENSEMBL      SYMBOL
##                <character>     <numeric>        <character> <character>
##   [1] ENSMUST00000191939.1         27340 ENSMUSG00000025902       Sox17
##   [2] ENSMUST00000146665.2           -11 ENSMUSG00000033845      Mrpl15
##                     GENENAME
##                  <character>
##   [1] SRY (sex determining..
##   [2] mitochondrial riboso..
##   -------
##   seqinfo: 35 sequences (1 circular) from mm10 genome
```

## Peak annotation

The genomic annotation for each peak is whin the *annotation* column and the closest gene is in shown in the *geneId*, *ENSEMBL*, and *SYMBOL* columns (geneId is the Entrez ID).

``` r
annotatedPeaksGR$annotation[1:5]
```

```
## [1] "Distal Intergenic"                                 
## [2] "Promoter"                                          
## [3] "Promoter"                                          
## [4] "Promoter"                                          
## [5] "Intron (ENSMUST00000134384.7/18777, intron 9 of 9)"
```

``` r
annotatedPeaksGR$SYMBOL[1:5]
```

```
## [1] "Sox17"  "Mrpl15" "Lypla1" "Lypla1" "Lypla1"
```

---
## Vizualising peak annotation

Now we have the annotated peaks from ChIPseeker we can use some of ChIPseeker's plotting functions to display distribution of peaks in gene features. Here we use the **plotAnnoBar** function to plot this as a bar chart but  **plotAnnoPie** would produce a similar plot as a pie chart.

``` r
plotAnnoBar(peakAnno)
```

![](Session5_Annotation_files/figure-html/unnamed-chunk-11-1.png)

---
## Vizualising peak annotation

Similarly we can plot the distribution of peaks around TSS sites.

``` r
plotDistToTSS(peakAnno)
```

![](Session5_Annotation_files/figure-html/unnamed-chunk-12-1.png)

---
class: inverse, center, middle

# Gene Set Enrichment

---

## Gene Set testing for peaks

Transcription factors or epigenetic marks may act on specific sets of genes grouped by a common biological feature (shared Biological function, common regulation in RNAseq experiment etc).

A frequent step in CUT&RUN or ATACseq analysis is to test whether common gene sets are enriched for transcription factor binding, epigenetic marks, or open chromatin regions.

Sources of well curated gene sets include [GO consortium](http://geneontology.org/) (gene's function, biological process and cellular localisation), [REACTOME](http://www.reactome.org/) (Biological Pathways) and [MsigDB](http://software.broadinstitute.org/gsea/msigdb/) (Computationally and Experimentally derived).

---
## Gene Set testing for peaks

Gene set enrichment testing may be performed on the sets of genes with peaks associated to them. We will not access these database libraries directly in testing but will use other R/Bioconductor libraries which make extensive use of them.

How we perform this analysis will depend on the type of peaks we are interested in. There are a wide range of types of peaks in our data set, some in promoters where annotation is straightforward, and many elswhere where annotation is trickier.

---
## Gene Set testing for peaks

We will go through two strategies:

* using the closest gene with **ChIPseeker** followed by gene set enrichment with **clusterProfiler**. This is typically done for peaks in promoters
 
<div align="center">
<img src="imgs/gene_example_prom.png" alt="offset" height="250" width="800">
</div>

---
## Gene Set testing for peaks

We will go through two strategies:

1. using the closest gene with **ChIPseeker** followed by gene set enrichment with **clusterProfiler**. This is typically done for peaks in promoters
 
 2. allowing for annotation of one peak with multiple genes using toolset called **GREAT**. This is usually done to annotate enhancer or distal peaks.
 
<div align="center">
<img src="imgs/gene_example_enh.png" alt="offset" height="250" width="800">
</div>

---
## Choosing a peak set

Our peak set is large (~72k peaks), which will result in many genes being annotated to peaks.

Even from the ChIPseeker annotation of the nearest gene, there are almost 17k genes, making it unlikely we will find any real specific enrichment of gene sets.

We should choose a more specific set of peaks to test.

``` r
length(unique(annotatedPeaksGR$geneId))
```

```
## [1] 16801
```
---

## Using differential peaks for test

The specific set of peaks we choose will depend on our question. In our case, we just performed differential analysis on our consensus peak set, so we can use the peaks that go up in W6 vs W0.

Here we read in the differential results and look at them.

``` r
W6MinusD0 <- rio::import("data/W6MinusD0.xlsx")

W6MinusD0[1:5, ]
```

```
##         seqnames     start       end width strand  baseMean log2FoldChange
## 1          chr12  24581325  24586154  4830      * 9192.9984      -8.891086
## 2          chr12  24575912  24578078  2167      * 2788.4081      -9.487306
## 3          chr13  23573848  23574171   324      * 1399.1613      -7.289620
## 4           chr4 135549991 135552699  2709      *  439.3033      -6.265374
## 5 chrUn_GL456383     22710     26551  3842      * 2422.7188      -5.462346
##       lfcSE      stat        pvalue          padj
## 1 0.3253001 -27.33195 1.770064e-164 4.410114e-160
## 2 0.4380051 -21.66026 4.864412e-104 6.059841e-100
## 3 0.3891231 -18.73346  2.641873e-78  2.194075e-74
## 4 0.4460385 -14.04671  8.069439e-45  5.026252e-41
## 5 0.3993478 -13.67817  1.371076e-42  6.832071e-39
```

---

## Using differential peaks for test

The gene annotation packages (e.g. ChIPseeker, GREAT) require a GRanges object. We can convert this table to a GRanges and keep key differential statistics as metadata in the GRanges object.

``` r
W6MinusD0_gr <- GRanges(seqnames = W6MinusD0$seqnames, IRanges(start = W6MinusD0$start,
    end = W6MinusD0$end), log2FoldChange = W6MinusD0$log2FoldChange, padj = W6MinusD0$padj)

W6MinusD0_gr[1:3, ]
```

```
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames            ranges strand | log2FoldChange         padj
##          <Rle>         <IRanges>  <Rle> |      <numeric>    <numeric>
##   [1]    chr12 24581325-24586154      * |       -8.89109 4.41011e-160
##   [2]    chr12 24575912-24578078      * |       -9.48731 6.05984e-100
##   [3]    chr13 23573848-23574171      * |       -7.28962  2.19408e-74
##   -------
##   seqinfo: 34 sequences from an unspecified genome; no seqlengths
```

---

## Using differential peaks for test

We are going to look at the peaks that are increased in W6, so the GRanges is subset to the peaks that have a fold change greater than 2 and an adjusted p-value less than 0.05.

We also remove peaks that aren't on the main chromosomes (usually unplaced configs).

``` r
W6MinusD0_gr_main <- W6MinusD0_gr[as.vector(seqnames(W6MinusD0_gr)) %in% paste0("chr",
    c(1:19, "X", "Y", "M"))]
W6MinusD0_gr_up <- W6MinusD0_gr_main[W6MinusD0_gr_main$log2FoldChange > 1 & W6MinusD0_gr_main$padj <
    0.05]

W6MinusD0_gr_up
```

```
## GRanges object with 4627 ranges and 2 metadata columns:
##          seqnames              ranges strand | log2FoldChange        padj
##             <Rle>           <IRanges>  <Rle> |      <numeric>   <numeric>
##      [1]     chr3   69954132-69954644      * |        5.33160 9.93884e-36
##      [2]    chr12     9855553-9856569      * |        4.99663 2.81330e-33
##      [3]     chr7 114358259-114358706      * |        5.45989 9.14611e-31
##      [4]     chr4   97465816-97466559      * |        4.43453 7.28496e-29
##      [5]    chr13   38793431-38794003      * |        5.40520 1.40299e-28
##      ...      ...                 ...    ... .            ...         ...
##   [4623]    chr15   91037086-91037371      * |        1.59945   0.0498130
##   [4624]    chr14   11671472-11671821      * |        2.06389   0.0498410
##   [4625]    chr16   78456277-78456679      * |        1.55887   0.0498410
##   [4626]     chr6   90471253-90471635      * |        1.46471   0.0498410
##   [4627]    chr12   36438399-36438569      * |        1.70550   0.0498587
##   -------
##   seqinfo: 34 sequences from an unspecified genome; no seqlengths
```

---
class: inverse, center, middle

# Functional Enrichment with nearest gene

---