ChIPseq In Bioconductor (part3)

class: center, middle, inverse, title-slide

.title[
# ChIPseq In Bioconductor (part3)<br />
<html><br />
<br />
<hr color='#EB811B' size=1px width=796px><br />
</html>
]
.author[
### Rockefeller University, Bioinformatics Resource Center
]
.date[
### <a href="http://rockefelleruniversity.github.io/RU_ChIPseq/" class="uri">http://rockefelleruniversity.github.io/RU_ChIPseq/</a>
]

---

## Data

In todays session we will continue to review the Myc ChIPseq we were working on in our last sessions.

This include Myc ChIPseq for MEL and Ch12 celllines.

Information and files for the [Myc ChIPseq in MEL cell line can be found here](https://www.encodeproject.org/experiments/ENCSR000EUA/)

Information and files for the [Myc ChIPseq in Ch12 cell line can be found here](https://www.encodeproject.org/experiments/ENCSR000ERN/)

---

## Data

In the data directory we have provided peak calls from MACS2 following the processing steps outlined in our last session.

Peak calls for Myc in MEL and Ch12 cellines can be found in

**data/peaks/**

* **data/peaks/Mel_1_peaks.xls**
* **data/peaks/Mel_2_peaks.xls**
* **data/peaks/Ch12_1_peaks.xls**
* **data/peaks/Ch12_1_peaks.xls**

---

## ChIP peaks in R

In our last session we reviewed how we can [identify putative transciption factor binding sites using peak calling programs such as MACS2.](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor2.html#38)

``` r
library(GenomicRanges)
macsPeaks <- "data/peaks/Mel_1_peaks.xls"
macsPeaks_DF <- read.delim(macsPeaks,comment.char="#")
macsPeaks_GR <- GRanges(seqnames=macsPeaks_DF[,"chr"],
                        IRanges(macsPeaks_DF[,"start"],macsPeaks_DF[,"end"]))
mcols(macsPeaks_GR) <- macsPeaks_DF[,c("abs_summit", "fold_enrichment")]
macsPeaks_GR[1:5,]
```

```
## GRanges object with 5 ranges and 2 metadata columns:
##       seqnames          ranges strand | abs_summit fold_enrichment
##          <Rle>       <IRanges>  <Rle> |  <integer>       <numeric>
##   [1]     chr1 4785370-4785641      * |    4785565         5.23296
##   [2]     chr1 5082993-5083216      * |    5083122         4.26417
##   [3]     chr1 7397557-7398093      * |    7397835         9.84580
##   [4]     chr1 9545120-9545629      * |    9545400        11.85079
##   [5]     chr1 9700895-9701037      * |    9700978         5.14291
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths
```

---
## Annotation of peaks to genes

Since transcription factors, as suggested in name, may regulate the transcription of their target genes, we used the **ChIPseeker package** to associate our peaks, representing potential transcription factor binding events, to their overlapping or closest mm10 genes.

``` r
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(ChIPseeker)
peakAnno <- annotatePeak(macsPeaks_GR, tssRegion=c(-1000, 1000), 
                         TxDb=TxDb.Mmusculus.UCSC.mm10.knownGene, 
                         annoDb="org.Mm.eg.db")
```

```
## >> preparing features information...		 2025-06-05 20:01:33 
## >> identifying nearest features...		 2025-06-05 20:01:33 
## >> calculating distance from peak to TSS...	 2025-06-05 20:01:41 
## >> assigning genomic annotation...		 2025-06-05 20:01:41 
## >> adding gene annotation...			 2025-06-05 20:01:43 
## >> assigning chromosome lengths			 2025-06-05 20:01:43 
## >> done...					 2025-06-05 20:01:43
```

---

## Annotation of peaks to genes

This allowed us to produce a GRanges or data.frame of our peaks and their predicted target genes.

``` r
annotatedPeaksGR <- as.GRanges(peakAnno)
annotatedPeaksDF <- as.data.frame(peakAnno)
annotatedPeaksDF[1:2, ]
```

```
##   seqnames   start     end width strand abs_summit fold_enrichment annotation geneChr
## 1     chr1 4785370 4785641   272      *    4785565         5.23296   Promoter       1
## 2     chr1 5082993 5083216   224      *    5083122         4.26417   Promoter       1
##   geneStart geneEnd geneLength geneStrand geneId          transcriptId distanceToTSS
## 1   4783572 4785692       2121          2  27395  ENSMUST00000132625.1            51
## 2   5083080 5162529      79450          1 108664 ENSMUST00000044369.12             0
##              ENSEMBL  SYMBOL                                        GENENAME
## 1 ENSMUSG00000033845  Mrpl15             mitochondrial ribosomal protein L15
## 2 ENSMUSG00000033793 Atp6v1h ATPase, H+ transporting, lysosomal V1 subunit H
```

---
class: inverse, center, middle

# Gene Set Enrichment

---

## Gene Set testing

Transcription factors or epigenetic marks may act on specific sets of genes grouped by a common biological feature (shared Biological function, common regulation in RNAseq experiment etc).

A frequent step in ChIPseq analysis is to test whether common gene sets are enriched for transcription factor binding or epigenetic marks.

Sources of well curated gene sets include [GO consortium](http://geneontology.org/) (gene's function, biological process and cellular localisation), [REACTOME](http://www.reactome.org/) (Biological Pathways) and [MsigDB](http://software.broadinstitute.org/gsea/msigdb/) (Computationally and Experimentally derived).

---
## Gene Set testing for ChIPseq

Gene set enrichment testing may be performed on the sets of genes with peaks associated to them. In this example we will consider genes with peaks within 1000bp of a gene's TSS.

We will not access these database libraries directly in testing but will use other R/Bioconductor libraries which make extensive use of them.

---
## Gene ontology and gene set testing

To perform gene set testing here, we will use the **clusterProfiler package**.

clusterProfiler provides multiple enrichment functions that allow for comparison of your gene list to known (e.g. GO, KEGG) or custom gene sets. Detailed information about all of the functionality within this package is available [here](http://yulab-smu.top/clusterProfiler-book/).

In this example we use all TSS sites we found to be overlapped by Myc peaks.

The peaks landing in TSS regions will be marked as "Promoter" in the **annotation** column of our annotated GRanges object.

``` r
annotatedPeaksGR[1, ]
```

```
## GRanges object with 1 range and 14 metadata columns:
##       seqnames          ranges strand | abs_summit fold_enrichment  annotation   geneChr
##          <Rle>       <IRanges>  <Rle> |  <integer>       <numeric> <character> <integer>
##   [1]     chr1 4785370-4785641      * |    4785565         5.23296    Promoter         1
##       geneStart   geneEnd geneLength geneStrand      geneId         transcriptId
##       <integer> <integer>  <integer>  <integer> <character>          <character>
##   [1]   4783572   4785692       2121          2       27395 ENSMUST00000132625.1
##       distanceToTSS            ENSEMBL      SYMBOL               GENENAME
##           <numeric>        <character> <character>            <character>
##   [1]            51 ENSMUSG00000033845      Mrpl15 mitochondrial riboso..
##   -------
##   seqinfo: 21 sequences from mm10 genome
```

---
## Gene ontology and gene set testing

We can extract the unique names of genes with peaks in their TSS by subsetting the annotated GRanges and retrieving gene names from the **geneId** column.

``` r
annotatedPeaksGR_TSS <- annotatedPeaksGR[annotatedPeaksGR$annotation == "Promoter", ]
genesWithPeakInTSS <- unique(annotatedPeaksGR_TSS$geneId)
genesWithPeakInTSS[1:2]
```

```
## [1] "27395"  "108664"
```

---
## Gene ontology and functional testing

Next we can extract all genes which are included in the TxDb object to use as our universe of genes for pathway enrichment.

``` r
allGeneGR <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
allGeneGR[1:2, ]
```

```
## GRanges object with 2 ranges and 1 metadata column:
##             seqnames            ranges strand |     gene_id
##                <Rle>         <IRanges>  <Rle> | <character>
##   100009600     chr9 21062393-21073096      - |   100009600
##   100009609     chr7 84935565-84964115      - |   100009609
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
```

``` r
allGeneIDs <- allGeneGR$gene_id
```

---
## Gene ontology and functional testing

Once we have our gene list and the universe of genes in the same format, we can use them in the **enrichGO** function to perform gene ontology analysis

For the **ont** argument, we can choose between the "BP", "MF", and "CC" subontologies, or "ALL" for all three.

``` r
library(clusterProfiler)
library(org.Mm.eg.db)
GO_result <- enrichGO(gene = genesWithPeakInTSS, universe = allGeneIDs, OrgDb = org.Mm.eg.db,
    ont = "BP")
```

---
## Gene ontology and functional testing

We now have an **enrichResult** instance. From this object, we can extract a data frame of the most highly enriched gene ontology categories.

``` r
GO_result_df <- data.frame(GO_result)
GO_result_df[1:5, ]
```

```
##                    ID                          Description GeneRatio   BgRatio RichFactor
## GO:0022613 GO:0022613 ribonucleoprotein complex biogenesis  313/4989 433/22195  0.7228637
## GO:0034470 GO:0034470                     ncRNA processing  278/4989 408/22195  0.6813725
## GO:0042254 GO:0042254                  ribosome biogenesis  235/4989 315/22195  0.7460317
## GO:0016072 GO:0016072               rRNA metabolic process  180/4989 252/22195  0.7142857
## GO:0006364 GO:0006364                      rRNA processing  160/4989 214/22195  0.7476636
##            FoldEnrichment   zScore        pvalue      p.adjust        qvalue
## GO:0022613       3.215867 25.07395 6.639693e-110 4.152464e-106 3.322642e-106
## GO:0034470       3.031282 22.29904  9.993609e-88  3.125001e-84  2.500506e-84
## GO:0042254       3.318937 22.32056  3.570778e-87  7.443883e-84  5.956309e-84
## GO:0016072       3.177705 18.72130  5.809345e-62  9.082911e-59  7.267797e-59
## GO:0006364       3.326196 18.41256  8.758221e-60  1.095478e-56  8.765596e-57
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          geneID
## GO:0022613 59014/19989/27993/226982/73674/55989/57750/22596/69163/17975/67949/66942/20643/13211/15278/208768/67459/67223/215193/353258/20042/67239/56200/67332/108707/110816/18114/216156/20054/20222/28109/103136/70683/69878/52705/432508/56351/18813/64934/67465/52513/66249/57316/103573/18148/52530/216766/75273/19069/12261/18432/67493/104662/192159/276919/216987/217011/78394/56321/217038/110809/12812/67684/107732/19942/72047/68979/67671/72344/217431/20115/71952/15519/67236/217869/69237/217995/30946/208366/70078/70930/28126/212880/66475/105372/108143/66254/72198/218629/20088/52535/27374/75751/27275/72662/14000/223499/78581/239528/67959/109075/12181/55944/66538/20826/74778/72193/21357/66409/106298/224092/57808/68193/67014/106338/74112/207932/224481/67045/213773/66978/20630/19896/72462/54127/27756/53817/224823/66229/53414/18519/72515/67242/20641/108013/69241/225348/67936/67222/20044/68731/20227/67674/107071/28000/59028/19211/54364/107094/66583/18150/18572/13669/227522/20935/27176/227656/118451/227674/227715/66489/75422/13046/75786/98956/66181/271842/214616/320632/20638/67134/66580/67204/16418/68295/67996/228889/66481/68879/69126/57815/69639/229317/56758/97112/20091/74200/56417/100043813/70767/70465/56195/99480/68147/229937/70285/100019/230082/20104/12578/72181/20116/69072/67205/75062/194268/230737/214150/213541/68040/27981/14105/67025/69902/50912/74035/66448/57741/20817/433864/75416/17089/65961/231452/100503670/100608/117109/11837/71990/19988/208144/19384/100609/74097/66596/192196/28035/68011/57905/56505/72171/213895/110109/60321/20901/68988/76846/22185/20103/68077/107686/13871/69731/20085/27998/14113/20055/27207/22121/101612/67973/83485/20068/67148/101867/68052/267019/56844/67920/67276/66282/234344/234374/50783/78651/24015/234549/214627/214572/66164/67619/72544/66523/52202/235036/215194/71989/382137/77591/11920/102462/67941/225215/245000/22258/72831/16785/67115/66446/13205/236732/72584/68995/72554/72459/19655/110954/245474/26905/76130/237107/69499/237221
## GO:0034470                                                                                                                                                                                                                               59014/19989/27993/73674/57750/69163/17975/22099/67949/66942/15278/208768/67459/67223/77065/215193/73681/67239/56200/108707/110816/18114/216156/20054/103136/28088/70683/52705/17299/103284/18813/64934/52513/57316/74467/71701/18148/66628/52530/194908/68626/22059/54351/75273/67493/104662/216987/78394/217038/70422/20848/19942/52686/13207/68979/76265/72344/68730/80913/217431/20115/66132/211064/328162/74691/11651/69237/217995/30946/68916/19766/208366/69955/70078/70930/28114/21752/105372/66254/72198/67053/68045/20088/67011/56335/18130/72662/170762/14000/67724/223499/78581/68260/239528/109075/12181/20826/74778/72026/21357/66409/74108/15547/57808/106248/52575/106338/207932/224481/67045/213773/19896/20383/57773/72462/54127/224823/66229/53414/18519/224907/72515/72167/240174/58523/225348/67936/68857/67222/68731/20227/67674/107071/109077/70044/59028/54364/107094/66583/18150/18572/72133/227522/13434/74455/27176/227656/68205/227695/227715/66489/20823/75422/228019/98956/66181/271842/69185/66580/16418/228889/66481/69639/56417/100043813/229543/20226/229780/28036/68147/108943/229937/70285/230082/69934/74753/230233/20104/12578/100087/72181/68276/20116/69072/230709/66966/67205/230734/76793/214150/80912/213541/67025/69902/50912/74035/71957/78697/75416/78890/17089/211006/65961/231413/100503670/56361/117109/71990/70650/208144/100929/100609/83701/74097/53379/69786/72171/70047/14911/381802/213895/110109/60321/66078/68077/13871/20085/27998/21803/14113/20055/233189/67973/20068/101861/101867/233802/67920/67276/234344/234374/70359/78651/212528/214627/66369/67619/234734/72544/30947/66965/53424/212728/52202/235036/215194/71989/67049/382137/77591/102462/69882/69606/68291/67789/83669/72341/67115/66446/19652/54632/13205/69942/72554/245474/76130/69499/15108
## GO:0042254                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     59014/19989/27993/226982/73674/55989/57750/22596/69163/17975/67949/66942/15278/208768/67459/67223/215193/353258/20042/67239/56200/108707/110816/18114/216156/20054/103136/70683/52705/18813/64934/52513/66249/57316/103573/18148/52530/75273/19069/12261/18432/67493/104662/216987/217011/78394/56321/217038/107732/19942/68979/67671/72344/217431/20115/71952/67236/217869/69237/217995/30946/208366/70078/70930/28126/66475/105372/66254/72198/218629/20088/52535/75751/72662/14000/223499/78581/109075/12181/66538/20826/74778/66409/106298/224092/57808/68193/67014/106338/74112/207932/224481/67045/213773/19896/72462/54127/224823/66229/53414/18519/72515/225348/67936/67222/20044/68731/20227/67674/107071/59028/19211/54364/107094/66583/18150/18572/227522/20935/27176/227656/118451/227674/227715/66489/75422/98956/66181/271842/214616/67134/66580/16418/228889/66481/69126/57815/69639/229317/97112/20091/100043813/99480/68147/229937/70285/100019/230082/20104/12578/72181/20116/69072/67205/194268/230737/213541/68040/67025/69902/50912/74035/66448/57741/433864/75416/17089/65961/231452/100503670/100608/117109/11837/71990/19988/208144/19384/100609/74097/66596/72171/213895/110109/60321/76846/20103/68077/13871/20085/27998/14113/20055/27207/101612/67973/83485/20068/101867/68052/267019/67920/67276/66282/234344/234374/78651/24015/234549/214627/66164/67619/72544/66523/52202/235036/215194/71989/382137/77591/102462/67941/225215/72831/16785/67115/66446/13205/236732/72584/72554/110954/245474/76130/237107/69499
## GO:0016072                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     59014/19989/27993/73674/57750/69163/17975/67949/66942/15278/208768/67459/67223/215193/67239/56200/20587/108707/110816/18114/216156/66420/20054/103136/70683/52705/216443/18813/64934/52513/57316/18148/52530/22059/75273/67493/104662/216987/78394/217038/19942/21429/68979/72344/217431/20115/28071/69237/217995/30946/208366/70078/75678/70930/105372/66254/72198/105083/20088/319955/72662/14000/223499/78581/109075/12181/20826/74778/66409/57808/106338/207932/224481/67045/213773/19896/72462/54127/224823/66229/53414/18519/72515/13872/225348/67936/67222/21453/68731/20227/67674/107071/59028/54364/107094/66583/18150/18572/227522/27176/227656/227715/66489/75422/98956/66181/271842/20017/66580/16418/21969/228889/66481/69639/100043813/68147/229937/70285/230082/319885/64424/20104/12578/72181/20116/69072/67205/213541/67025/69902/56717/50912/74035/75416/17089/65961/100503670/117109/71990/208144/100609/74097/20018/20019/72171/213895/110109/60321/68077/13871/20085/27998/14113/20055/67973/20068/101867/67920/67276/234344/234374/78651/68544/214627/67619/72544/52202/235036/215194/20586/71989/382137/77591/102462/67115/66446/72554/245474/76130/69499
## GO:0006364                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 59014/19989/27993/73674/57750/69163/17975/67949/66942/15278/208768/67459/67223/215193/67239/56200/108707/110816/18114/216156/20054/103136/70683/52705/18813/64934/52513/57316/18148/52530/75273/67493/104662/216987/78394/217038/19942/68979/72344/217431/20115/69237/217995/30946/208366/70078/70930/105372/66254/72198/20088/72662/14000/223499/78581/109075/12181/20826/74778/66409/57808/106338/207932/224481/67045/213773/19896/72462/54127/224823/66229/53414/18519/72515/225348/67936/67222/68731/20227/67674/107071/59028/54364/107094/66583/18150/18572/227522/27176/227656/227715/66489/75422/98956/66181/271842/66580/16418/228889/66481/69639/100043813/68147/229937/70285/230082/20104/12578/72181/20116/69072/67205/213541/67025/69902/50912/74035/75416/17089/65961/100503670/117109/71990/208144/100609/74097/72171/213895/110109/60321/68077/13871/20085/27998/14113/20055/67973/20068/101867/67920/67276/234344/234374/78651/214627/67619/72544/52202/235036/215194/71989/382137/77591/102462/67115/66446/72554/245474/76130/69499
##            Count
## GO:0022613   313
## GO:0034470   278
## GO:0042254   235
## GO:0016072   180
## GO:0006364   160
```

---
## Gene ontology and functional testing

Network plots can be generated from any enrichResult object using the **enrichplot** package.

We measure similarities between the various significant gene sets and group them accordingly. The **showCategory** argument specifies how many top gene ontology hits to show.

``` r
library(enrichplot)
GO_result_plot <- pairwise_termsim(GO_result)
emapplot(GO_result_plot, showCategory = 20)
```

![](ChIPseq_In_Bioconductor3_files/figure-html/unnamed-chunk-12-1.png)

---
## Gene ontology and functional testing

In addition to gene ontology, we can test our gene list against custom gene sets that we import as gmt files using the clusterProfiler **enricher** function. Similar to the enrichGO function, this will produce an enrichResult object that can be used for visualization.

Here we will get gene sets from MSigDB using the **msigdbr** package. We can use this package to pull out specific gene sets, which can be seen [at this link](http://software.broadinstitute.org/gsea/msigdb/). We can also run the **msigdbr_collections** function to see the categories and subcategory codes that will be used for accessing the gene sets.

``` r
library(msigdbr)
msigdbr_collections()
```

```
##    gs_collection gs_subcollection                   gs_collection_name num_genesets
## 1             C1                                            Positional          302
## 2             C2              CGP   Chemical and Genetic Perturbations         3494
## 3             C2               CP                   Canonical Pathways           19
## 4             C2      CP:BIOCARTA                    BioCarta Pathways          292
## 5             C2   CP:KEGG_LEGACY                 KEGG Legacy Pathways          186
## 6             C2  CP:KEGG_MEDICUS                KEGG Medicus Pathways          658
## 7             C2           CP:PID                         PID Pathways          196
## 8             C2      CP:REACTOME                    Reactome Pathways         1736
## 9             C2  CP:WIKIPATHWAYS                         WikiPathways          830
## 10            C3        MIR:MIRDB                                miRDB         2377
## 11            C3   MIR:MIR_LEGACY                           MIR_Legacy          221
## 12            C3         TFT:GTRD                                 GTRD          505
## 13            C3   TFT:TFT_LEGACY                           TFT_Legacy          610
## 14            C4              3CA Curated Cancer Cell Atlas gene sets           148
## 15            C4              CGN            Cancer Gene Neighborhoods          427
## 16            C4               CM                       Cancer Modules          431
## 17            C5            GO:BP                GO Biological Process         7608
## 18            C5            GO:CC                GO Cellular Component         1026
## 19            C5            GO:MF                GO Molecular Function         1820
## 20            C5              HPO             Human Phenotype Ontology         5653
## 21            C6                                   Oncogenic Signature          189
## 22            C7      IMMUNESIGDB                          ImmuneSigDB         4872
## 23            C7              VAX                HIPC Vaccine Response          347
## 24            C8                                   Cell Type Signature          840
## 25             H                                              Hallmark           50
```

---

## Gene ontology and functional testing

From the data frame on the previous slide we can identify the category/subcategory we want, and use those in the **msigdbr** function. Here we will use "H" to access the Hallmark gene sets, and in the end we need to get a data frame where the first column contains the name of the gene sets and the second column contains the gene ID.

``` r
library(msigdbr)
msig_t2g <- msigdbr(species = "Mus musculus", category = "H", subcategory = NULL)
msig_t2g <- msig_t2g[, colnames(msig_t2g) %in% c("gs_name", "entrez_gene")]
msig_t2g[1:3, ]
```

```
##                 gs_name entrez_gene
## 1 HALLMARK_ADIPOGENESIS       11303
## 2 HALLMARK_ADIPOGENESIS       74610
## 3 HALLMARK_ADIPOGENESIS       52538
```

---
## Gene ontology and functional testing

We then run the gene set enrichment, using the term to gene mapping we created as the **TERM2GENE** argument in the enricher function.

``` r
hallmark <- enricher(gene = genesWithPeakInTSS, universe = allGeneIDs, TERM2GENE = msig_t2g)
hallmark_df <- data.frame(hallmark)
hallmark_df[1:3, ]
```

```
##                                              ID             Description GeneRatio  BgRatio
## HALLMARK_MYC_TARGETS_V1 HALLMARK_MYC_TARGETS_V1 HALLMARK_MYC_TARGETS_V1  154/1421 200/4387
## HALLMARK_E2F_TARGETS       HALLMARK_E2F_TARGETS    HALLMARK_E2F_TARGETS  126/1421 200/4387
## HALLMARK_MYC_TARGETS_V2 HALLMARK_MYC_TARGETS_V2 HALLMARK_MYC_TARGETS_V2   50/1421  58/4387
##                         RichFactor FoldEnrichment    zScore       pvalue     p.adjust
## HALLMARK_MYC_TARGETS_V1   0.770000       2.377192 13.797609 4.191095e-40 2.095547e-38
## HALLMARK_E2F_TARGETS      0.630000       1.944975  9.467379 9.706769e-20 2.426692e-18
## HALLMARK_MYC_TARGETS_V2   0.862069       2.661433  8.815539 1.810363e-17 3.017272e-16
##                               qvalue
## HALLMARK_MYC_TARGETS_V1 1.632321e-38
## HALLMARK_E2F_TARGETS    1.890266e-18
## HALLMARK_MYC_TARGETS_V2 2.350296e-16
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       geneID
## HALLMARK_MYC_TARGETS_V1 15510/18393/66942/17219/51810/107508/15182/56200/67332/66094/103136/18674/53605/12461/73192/12567/56351/18813/67465/18972/12464/103573/18148/14694/12330/52530/22333/27041/13681/12261/22627/110809/18102/18673/16211/26446/22123/16647/20382/18263/22630/26443/13665/13877/218236/105148/28126/30877/11792/15381/12465/18458/55944/14375/15382/66409/14852/22195/19385/20462/28185/21454/19826/67097/20383/109801/108121/27756/66973/15516/225027/20641/225363/27366/70769/11757/56086/20823/229279/100042807/110074/99138/67134/67204/26444/12428/15191/12462/68436/51788/13684/19359/20104/230484/107995/51797/230721/26445/74326/20810/230908/19934/69719/14208/13204/11991/11837/19988/19384/22384/17220/27979/381760/53379/56150/66870/12468/68011/23983/12785/12034/68988/20103/21849/107686/53607/14113/57296/50995/19899/16828/68981/27050/20133/13690/233870/19935/21781/17218/24015/14719/17463/101943/85305/26441/12858/56403/23918/67115/66446/18655/108062/53381/66235
## HALLMARK_E2F_TARGETS                                                                                                                                                                                   18393/17219/67196/98386/66403/12534/21917/53605/103468/12567/18813/18972/21379/15574/103573/15366/19360/20425/20877/78304/22059/114714/110809/18102/67684/21973/103733/16647/15374/20382/21877/11799/67241/18949/13665/218210/66197/69716/19355/70472/14375/26934/105988/12419/14852/19385/20462/22042/106618/15361/22154/18969/70769/20833/12236/110074/20638/67134/12531/69270/17865/110750/69639/70099/140917/51788/27354/12578/230484/50927/73804/107995/51797/11637/19891/100336/16765/21335/17089/19687/11991/50883/22256/12704/19384/19718/74097/17220/18861/12190/100710/71777/70699/14056/56150/66979/17768/68298/19362/77891/16881/12447/18971/108961/13626/233726/22390/18817/66422/109145/17218/68278/69724/17535/13433/16201/67629/12649/66131/56403/20842/12530/17350/19139/24061/53381
## HALLMARK_MYC_TARGETS_V2                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  27993/15510/66942/15278/338359/20509/70683/12567/18813/64934/21379/18148/18432/69071/18673/28126/30877/75751/53414/72515/21453/13340/107071/59028/107094/226169/70769/20322/13537/67134/68493/53608/69902/20810/100608/56361/22256/15277/110109/27998/101612/71974/67973/18817/66422/17218/66590/66164/235036/76130
##                         Count
## HALLMARK_MYC_TARGETS_V1   154
## HALLMARK_E2F_TARGETS      126
## HALLMARK_MYC_TARGETS_V2    50
```

---
## Gene ontology and functional testing

We learned about the **goseq** package during the RNAseq course, which is another functional annotation package similar to clusterProfiler,

Here we perform the same enrichment test for the MSigDB Hallmark gene sets.

For goseq, we need a named vector of all genes (the universe) with 1s or 0s representing whether a gene had peak in TSS or not. 
We can turn a logical vector into 1 for TRUE and 0 for FALSE simply using the **as.integer** function.

``` r
allGenesForGOseq <- as.integer(allGeneIDs %in% genesWithPeakInTSS)
names(allGenesForGOseq) <- allGeneIDs
allGenesForGOseq[1:3]
```

```
## 100009600 100009609 100009614 
##         1         0         0
```

---
## Gene ontology and functional testing

First we must construct a **nullp** data.frame for use within **goseq** using the **nullp** function and supplying our named vector, genome to be used and gene identifier used.

The **nullp** function attempts to correct for gene length biases we may see in gene set testing. i.e. a longer gene may have more chance to have a peak within it.

``` r
library(goseq)
pwf = nullp(allGenesForGOseq, "mm10", "knownGene", plot.fit = FALSE)
```

```
## Can't find mm10/knownGene length data in genLenDataBase...
```

```
## Found the annotation package, TxDb.Mmusculus.UCSC.mm10.knownGene
```

```
## Trying to get the gene lengths from it.
```

---
## Gene ontology and functional testing

We can use the same term to gene mapping we used for clusterProfiler (though it must be converted from a tibble to data frame for goseq) to run the gene set enrichment test.

``` r
Myc_hallMarks <- goseq(pwf, "mm10", "knownGene", gene2cat = data.frame(msig_t2g))

Myc_hallMarks[1:3, ]
```

```
##                   category over_represented_pvalue under_represented_pvalue numDEInCat
## 32 HALLMARK_MYC_TARGETS_V1            5.905066e-41                        1        154
## 13    HALLMARK_E2F_TARGETS            3.821953e-20                        1        126
## 33 HALLMARK_MYC_TARGETS_V2            1.123201e-17                        1         50
##    numInCat
## 32      200
## 13      200
## 33       58
```

---

## GREAT for GO and functional testing

In addition to a standard enrichment tests, methods have been implemented specifically for ChIPseq. Many of these tools aim to incorporate peaks distal to genes into their enrichment testing such as the popular [GREAT](http://bejerano.stanford.edu/great/public/html/splash.php) toolset.

Incorporating distal peaks by rules such as nearest gene results in some genes having a higher chance of being selected and hence some gene sets as a whole having a higher chance of having its members selected.

[GREAT](http://bejerano.stanford.edu/great/public/html/splash.php) defines regulatory regions for each individual gene and compares the proportion of peaks mapping to a gene set's regulatory regions to the proportion of the genome occupied by gene set's regulatory regions.

i.e. If a gene set's regulatory regions account for 1% of the genome then one might expect 1% of peaks to overlap these regions by chance.

---
rGREAT - R interface to GREAT server
=========================================================

We can use the GREAT Bioconductor interface available in the rGREAT package.

``` r
library(rGREAT)
```
---
rGREAT for GO and functional testing
=========================================================

To submit jobs we can use our GRanges of Myc peaks and specify a genome with the **submitGreatJob** function.

This function returns a GreatJob object containing a reference to our results on the GREAT server. To review the categories of results available we can use the availableCategories function on our GreatJob object.

``` r
great_Job <- submitGreatJob(macsPeaks_GR, species = "mm10", version = "3.0.0", request_interval = 1)
availableCategories(great_Job)
```

```
## [1] "GO"                               "Phenotype Data and Human Disease"
## [3] "Pathway Data"                     "Gene Expression"                 
## [5] "Regulatory Motifs"                "Gene Families"
```
---
rGREAT for GO and functional testing
=========================================================

The results table can be retrieved using the getEnrichmentTables function and specifying which tables we wish to review.

Here we retrieve the results tables for the "Regulatory Motifs" gene sets which contains 2 separate database results.

``` r
great_ResultTable = getEnrichmentTables(great_Job, category = "Regulatory Motifs")
names(great_ResultTable)
```

```
## [1] "MSigDB Predicted Promoter Motifs" "MSigDB miRNA Motifs"
```
---
rGREAT for GO and functional testin
=========================================================

Now we can review the enrichment of our genes with Myc peaks in their TSS for the "MSigDB Predicted Promoter Motifs" gene sets.

``` r
msigProMotifs <- great_ResultTable[["MSigDB Predicted Promoter Motifs"]]
msigProMotifs[1:4, ]
```

```
##                     ID
## 1   SCGGAAGY_V$ELK1_02
## 2     GGGCGGR_V$SP1_Q6
## 3      CACGTG_V$MYC_Q2
## 4 RCGCANGCGY_V$NRF1_Q6
##                                                                              name
## 1                Motif SCGGAAGY matches ELK1: ELK1, member of ETS oncogene family
## 2                             Motif GGGCGGR matches SP1: Sp1 transcription factor
## 3 Motif CACGTG matches MYC: v-myc myelocytomatosis viral oncogene homolog (avian)
## 4                     Motif RCGCANGCGY matches NRF1: nuclear respiratory factor 1
##   Binom_Genome_Fraction Binom_Expected Binom_Observed_Region_Hits Binom_Fold_Enrichment
## 1            0.05912876       814.6169                       1586              1.946928
## 2            0.20132920      2773.7130                       3781              1.363155
## 3            0.07555623      1040.9380                       1638              1.573580
## 4            0.05163166       711.3293                       1195              1.679953
##   Binom_Region_Set_Coverage Binom_Raw_PValue Binom_Adjp_BH Hyper_Total_Genes Hyper_Expected
## 1                0.11511940    1.738355e-136 1.069088e-133              1069       502.9698
## 2                0.27444290     2.738851e-94  8.421967e-92              2689      1265.1880
## 3                0.11889380     1.110696e-71  2.276927e-69               941       442.7452
## 4                0.08673877     2.182812e-65  3.356073e-63               822       386.7551
##   Hyper_Observed_Gene_Hits Hyper_Fold_Enrichment Hyper_Gene_Set_Coverage
## 1                      806              1.602482              0.07821446
## 2                     1812              1.432198              0.17583700
## 3                      695              1.569752              0.06744299
## 4                      606              1.566883              0.05880640
##   Hyper_Term_Gene_Coverage Hyper_Raw_PValue Hyper_Adjp_BH
## 1                0.7539757     1.739538e-83  5.349079e-81
## 2                0.6738565    5.523247e-114 3.396797e-111
## 3                0.7385760     1.936613e-65  3.970057e-63
## 4                0.7372263     1.614352e-56  2.482066e-54
```

---
class: inverse, center, middle

# Motifs

---

## Motifs

A common practice in transcription factor ChIPseq is to investigate the motifs enriched under peaks.

De novo motif enrichment can be performed in R/Bioconductor but this can be very time consuming. Here we will use the MEME-ChIP suite available online to identify de novo motifs.

MEME-ChIP requires a FASTA file of sequences under peaks as input so we extract this using the **BSgenome** package.

---

## Extracting sequences under regions

First we need to load the BSgenome object for the genome we are working on, UCSC's mm10 build for the mouse genome, **BSgenome.Mmusculus.UCSC.mm10**.

``` r
library(BSgenome)
library(BSgenome.Mmusculus.UCSC.mm10)
BSgenome.Mmusculus.UCSC.mm10
```

```
## | BSgenome object for Mouse
## | - organism: Mus musculus
## | - provider: UCSC
## | - genome: mm10
## | - release date: Sep 2017
## | - 239 sequence(s):
## |     chr1                 chr2                 chr3                 chr4                
## |     chr5                 chr6                 chr7                 chr8                
## |     chr9                 chr10                chr11                chr12               
## |     chr13                chr14                chr15                chr16               
## |     chr17                chr18                chr19                chrX                
## |     ...                  ...                  ...                  ...                 
## |     chrX_KQ030495_fix    chrX_KQ030496_fix    chrX_KQ030497_fix    chrX_KZ289092_fix   
## |     chrX_KZ289093_fix    chrX_KZ289094_fix    chrX_KZ289095_fix    chrY_JH792832_fix   
## |     chrY_JH792833_fix    chrY_JH792834_fix    chr1_KK082441_alt    chr11_KZ289073_alt  
## |     chr11_KZ289074_alt   chr11_KZ289075_alt   chr11_KZ289077_alt   chr11_KZ289078_alt  
## |     chr11_KZ289079_alt   chr11_KZ289080_alt   chr11_KZ289081_alt                       
## | 
## | Tips: call 'seqnames()' on the object to get all the sequence names, call 'seqinfo()' to
## | get the full sequence info, use the '$' or '[[' operator to access a given sequence, see
## | '?BSgenome' for more information.
```
---

## Extracting sequences under regions

The motif for the ChIP-ed transcription factor should near the summit of a peak. MEME-ChIP will trim our peaks to a common length internally if sequences are of different length.

It is best therefore to provide a peak set resized to a common length.

``` r
macsSummits_GR <- GRanges(seqnames(macsPeaks_GR), IRanges(macsPeaks_GR$abs_summit, macsPeaks_GR$abs_summit),
    score = macsPeaks_GR$fold_enrichment)
macsSummits_GR <- resize(macsSummits_GR, 100, fix = "center")
```

---
## Extracting sequences under regions

We now have a GRanges, centred on the summit, highest point of signal for every peak.

``` r
macsSummits_GR
```

```
## GRanges object with 13777 ranges and 1 metadata column:
##           seqnames              ranges strand |     score
##              <Rle>           <IRanges>  <Rle> | <numeric>
##       [1]     chr1     4785515-4785614      * |   5.23296
##       [2]     chr1     5083072-5083171      * |   4.26417
##       [3]     chr1     7397785-7397884      * |   9.84580
##       [4]     chr1     9545350-9545449      * |  11.85079
##       [5]     chr1     9700928-9701027      * |   5.14291
##       ...      ...                 ...    ... .       ...
##   [13773]     chrX 168674050-168674149      * |   5.21700
##   [13774]     chrX 169047705-169047804      * |   5.33257
##   [13775]     chrX 169320320-169320419      * |   4.67030
##   [13776]     chrY     1245745-1245844      * |   4.04550
##   [13777]     chrY     1286577-1286676      * |   5.00143
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths
```

---
## Extracting sequences under regions

Once we have recentered our peaks we can use the **getSeq** function with our GRanges of resized common peaks and the BSgenome object for mm10.

The **getSeq** function returns a *DNAStringSet* object containing sequences under peaks.

``` r
peaksSequences <- getSeq(BSgenome.Mmusculus.UCSC.mm10, macsSummits_GR)
names(peaksSequences) <- paste0(seqnames(macsSummits_GR), ":", start(macsSummits_GR), "-", end(macsSummits_GR))

peaksSequences[1:2, ]
```

```
## DNAStringSet object of length 2:
##     width seq                                                           names               
## [1]   100 CGTCCATCGCCAGAGTGACGCGGCCCCTG...TTAGGACTGGGCTTCAGGTTGGCCAGGCT chr1:4785515-4785614
## [2]   100 CCGCCCTCAGCTGCGGTCACGTGCGGCTG...TTCCTCTGAGTGAGGCTCGCAACGTCTCC chr1:5083072-5083171
```
---
## Writing to FASTA file

The *writeXStringSet* function allows the user to write DNA/RNA/AA(aminoacid)StringSet objects out to file.

By default the *writeXStringSet* function writes the sequence information in FASTA format (as required for MEME-ChIP).

``` r
writeXStringSet(peaksSequences, file = "mycMel_rep1.fa")
```

---
## MEME-ChIP

Now the file "mycMel_rep1.fa" contains sequences around the geometric center of peaks suitable for Motif analysis in MEME-ChIP.

In your own work you will typically run this from your own laptop with MEME installed locally but today we will upload our generated FASTA file to their [web portal](http://meme-suite.org/tools/meme-chip).

Follow instructions [here](http://meme-suite.org/doc/download.html) to install MEME localy.

Results files from MEME-ChIP can be found [here](http://rockefelleruniversity.github.io/myc_Meme_Example/meme-chip.html)

---
## Parsing back FIMO results

We can retrieve the locations of Myc motifs identified in MEME-ChIP from the FIMO output.

FIMO reports Myc motif locations as a GFF3 file which we should be able to vizualise in IGV. 
Sadly, this GFF file's naming conventions cause only a fraction of motifs to be reported.

---
## FIMO to R

Fortunately we can parse our motif's GFF file into R and address this using the **import** function in  the **rtracklayer** package.

``` r
library(rtracklayer)
motifGFF <- import("~/Downloads/fimo.gff")
```

---
## FIMO to valid GFF3

We can give the sequences some more sensible names and export the GFF to file to visualise in IGV.

``` r
motifGFF$Name <- paste0(seqnames(motifGFF), ":", start(motifGFF), "-", end(motifGFF))
motifGFF$ID <- paste0(seqnames(motifGFF), ":", start(motifGFF), "-", end(motifGFF))
export.gff3(motifGFF, con = "~/Downloads/fimoUpdated.gff")
```

---
## Scanning for known motifs

We saw previously we can scan sequences using some of the Biostrings functionality **matchPattern**.

Often with ChIPseq we may know the motif we are looking for or we can use a set of known motifs from a database such as a [JASPAR](http://jaspar.genereg.net).

We can access to JASPAR using the JASPAR2020 bioconductor library.

``` r
library(JASPAR2020)
JASPAR2020
```

```
## An object of class "JASPAR2020"
## Slot "db":
## [1] "/Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library/JASPAR2020/extdata/JASPAR2020.sqlite"
```

---
## Get motifs from JASPAR with TFBStools

We can access the model for the our motif of interest using the **TFBSTools** package and its **getMatrixByName** function.

``` r
library(TFBSTools)
pfm <- getMatrixByName(JASPAR2020, name = "MYC")
pfm
```

```
## An object of class PFMatrix
## ID: MA0147.3
## Name: MYC
## Matrix Class: Basic helix-loop-helix factors (bHLH)
## strand: +
## Tags: 
## $centrality_logp
## [1] "-296.164"
## 
## $family
## [1] "bHLH-ZIP factors"
## 
## $medline
## [1] "18555785"
## 
## $remap_tf_name
## [1] "MYC"
## 
## $source
## [1] "29126285"
## 
## $tax_group
## [1] "vertebrates"
## 
## $tfbs_shape_id
## [1] "140"
## 
## $type
## [1] "ChIP-seq"
## 
## $unibind
## [1] "1"
## 
## $collection
## [1] "CORE"
## 
## $species
##           9606 
## "Homo sapiens" 
## 
## $acc
## [1] "P01106"
## 
## Background: 
##    A    C    G    T 
## 0.25 0.25 0.25 0.25 
## Matrix: 
##   [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## A 1038 1019  438   44 4600   54  169   59   23   419   961   755
## C 1345 1287 3746 4840   59 4525  157  162   56  2801  1524  1785
## G 1649 1757  472   31  170   70 4578   63 4817  1198  1036  1348
## T  910  879  286   27  113  293   38 4658   46   524  1421  1054
```

---
## Motif scanning with motifmatchr

With this PWM we can use the **motifmatchr** package to scan our summits for the Myc motif and return the positions of the motifs.

We will need to provide our PWM, GRanges to scan within and BSGenome object to extract sequence from.

We also set the **out** paramter to positions for this instance.

``` r
library(motifmatchr)
MycMotifs <- matchMotifs(pfm, macsSummits_GR, BSgenome.Mmusculus.UCSC.mm10, out = "positions")
MycMotifs
```

```
## GRangesList object of length 1:
## [[1]]
## GRanges object with 1600 ranges and 1 metadata column:
##          seqnames              ranges strand |     score
##             <Rle>           <IRanges>  <Rle> | <numeric>
##      [1]     chr1     9545433-9545444      - |   14.0312
##      [2]     chr1   13151280-13151291      + |   15.7259
##      [3]     chr1   13293444-13293455      + |   14.5201
##      [4]     chr1   16665901-16665912      - |   14.6571
##      [5]     chr1   43561269-43561280      + |   14.2468
##      ...      ...                 ...    ... .       ...
##   [1596]     chrX 155338504-155338515      + |   14.8300
##   [1597]     chrX 155338504-155338515      - |   14.8266
##   [1598]     chrX 160610006-160610017      + |   12.6818
##   [1599]     chrX 166454533-166454544      + |   13.6368
##   [1600]     chrX 166454533-166454544      - |   13.6368
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths
```

---
## Exporting motif matches

We can export the Myc motif positions within peaks for use later in IGV or for metaplot vizualisation.

``` r
export.bed(MycMotifs[[1]], con = "MycMotifs.bed")
```

---
## Time for an exercise!

Exercise on ChIPseq data can be found [here](../../exercises/exercises/chipseq_part3_exercise.html)

---
## Answers to exercise

Answers can be found [here](../../exercises/answers/chipseq_part3_answers.html)