CUT&RUN/ATAC (part 5) - Peak annotation and functional enrichment


Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived RU_Course_help folder in the Rstudio menu.

Session -> Set Working Directory -> Choose Directory

or in the console.

setwd("~/Downloads/ATAC.Cut-Run.ChIP-master/r_course")

What we will cover

We have now called peaks and then built a consensus peak set that we have counted over and determined differential peaks between W6 and W0 for our CUT&RUN data.

In this section we will:

  • Overlap these peak sets to genomic features
  • Annotate peaks to genes
  • Perform functional enrichment for pathways and biological gene sets with the annotated peaks

Peak Annotation


Annotation of peaks to genes

So far we have been working with CUT&RUN peaks corresponding to transcription factor binding or ATACseq peaks corresponding to open chromatin regions. Transcription factors, as implied in the name, can affect the expression of their target genes and open regions generally correlate with gene expression.

We will often annotate peaks to genes to try and identify the target of a transcription factor or a gene regulated by a regulatory element uncovered by ATACseq. This is typically done using a simple set of rules:

Peaks are typically annotated to a gene if * They overlap the gene. * The gene is the closest (and within a minimum distance).

Peak annotation

A useful package for annotation of peaks to genes is ChIPseeker.

By using pre-defined annotation in the form of a TXDB object for mouse (mm10 genome), ChIPseeker will provide us with an overview of where peaks land in the gene and distance to TSS sites.

First load the libraries we require for the next part and read in our SOX9 CUT&RUN peaks.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
library(ChIPseeker)
library(rtracklayer)

cnrPeaks_GR <- rtracklayer::import("data/peaks/SOX9CNR_W6_rep1_macs_peaks.narrowPeak")

Peak annotation

The annotatePeak function accepts a GRanges object of the regions to annotate, a TXDB object for gene locations and a database object name to retrieve gene names from.

peakAnno <- annotatePeak(cnrPeaks_GR, tssRegion = c(-1000, 1000), TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,
    annoDb = "org.Mm.eg.db")
## >> preparing features information...      2025-06-11 07:14:29 
## >> identifying nearest features...        2025-06-11 07:14:29 
## >> calculating distance from peak to TSS...   2025-06-11 07:14:30 
## >> assigning genomic annotation...        2025-06-11 07:14:30 
## >> adding gene annotation...          2025-06-11 07:14:31
## >> assigning chromosome lengths           2025-06-11 07:14:32 
## >> done...                    2025-06-11 07:14:32
class(peakAnno)
## [1] "csAnno"
## attr(,"package")
## [1] "ChIPseeker"

Peak annotation

The result is a csAnno object containing annotation for peaks and overall annotation statistics.

peakAnno
## Annotated peaks generated by ChIPseeker
## 72118/72174  peaks were annotated
## Genomic Annotation Summary:
##              Feature  Frequency
## 9           Promoter 20.3180898
## 4             5' UTR  0.1677806
## 3             3' UTR  1.4420810
## 1           1st Exon  1.2687540
## 7         Other Exon  2.5555340
## 2         1st Intron 14.3431598
## 8       Other Intron 24.3115450
## 6 Downstream (<=300)  0.1067695
## 5  Distal Intergenic 35.4862864

Peak annotation

The csAnno object contains the information on annotation of individual peaks to genes.

To extract this from the csAnno object the ChIPseeker functions as.GRanges or as.data.frame can be used to produce the respective object with peaks and their associated genes.

annotatedPeaksGR <- as.GRanges(peakAnno)
annotatedPeaksDF <- as.data.frame(peakAnno)

Peak annotation

annotatedPeaksGR[1:2, ]
## GRanges object with 2 ranges and 18 metadata columns:
##       seqnames          ranges strand |                   name     score
##          <Rle>       <IRanges>  <Rle> |            <character> <numeric>
##   [1]     chr1 4466289-4466395      * | SOX9CNR_W6_rep1_macs..        16
##   [2]     chr1 4785750-4785948      * | SOX9CNR_W6_rep1_macs..        53
##       signalValue    pValue    qValue      peak        annotation   geneChr
##         <numeric> <numeric> <numeric> <integer>       <character> <integer>
##   [1]     3.22302   3.93505   1.66890        13 Distal Intergenic         1
##   [2]     5.79731   8.11406   5.39043       138          Promoter         1
##       geneStart   geneEnd geneLength geneStrand      geneId
##       <integer> <integer>  <integer>  <integer> <character>
##   [1]   4492465   4493735       1271          2       20671
##   [2]   4781221   4785739       4519          2       27395
##               transcriptId distanceToTSS            ENSEMBL      SYMBOL
##                <character>     <numeric>        <character> <character>
##   [1] ENSMUST00000191939.1         27340 ENSMUSG00000025902       Sox17
##   [2] ENSMUST00000146665.2           -11 ENSMUSG00000033845      Mrpl15
##                     GENENAME
##                  <character>
##   [1] SRY (sex determining..
##   [2] mitochondrial riboso..
##   -------
##   seqinfo: 35 sequences (1 circular) from mm10 genome

Peak annotation

The genomic annotation for each peak is whin the annotation column and the closest gene is in shown in the geneId, ENSEMBL, and SYMBOL columns (geneId is the Entrez ID).

annotatedPeaksGR$annotation[1:5]
## [1] "Distal Intergenic"                                 
## [2] "Promoter"                                          
## [3] "Promoter"                                          
## [4] "Promoter"                                          
## [5] "Intron (ENSMUST00000134384.7/18777, intron 9 of 9)"
annotatedPeaksGR$SYMBOL[1:5]
## [1] "Sox17"  "Mrpl15" "Lypla1" "Lypla1" "Lypla1"

Vizualising peak annotation

Now we have the annotated peaks from ChIPseeker we can use some of ChIPseeker’s plotting functions to display distribution of peaks in gene features. Here we use the plotAnnoBar function to plot this as a bar chart but plotAnnoPie would produce a similar plot as a pie chart.

plotAnnoBar(peakAnno)

Vizualising peak annotation

Similarly we can plot the distribution of peaks around TSS sites.

plotDistToTSS(peakAnno)

Gene Set Enrichment


Gene Set testing for peaks

Transcription factors or epigenetic marks may act on specific sets of genes grouped by a common biological feature (shared Biological function, common regulation in RNAseq experiment etc).

A frequent step in CUT&RUN or ATACseq analysis is to test whether common gene sets are enriched for transcription factor binding, epigenetic marks, or open chromatin regions.

Sources of well curated gene sets include GO consortium (gene’s function, biological process and cellular localisation), REACTOME (Biological Pathways) and MsigDB (Computationally and Experimentally derived).

Gene Set testing for peaks

Gene set enrichment testing may be performed on the sets of genes with peaks associated to them. We will not access these database libraries directly in testing but will use other R/Bioconductor libraries which make extensive use of them.

How we perform this analysis will depend on the type of peaks we are interested in. There are a wide range of types of peaks in our data set, some in promoters where annotation is straightforward, and many elswhere where annotation is trickier.

Gene Set testing for peaks

We will go through two strategies:

  • using the closest gene with ChIPseeker followed by gene set enrichment with clusterProfiler. This is typically done for peaks in promoters

offset

Gene Set testing for peaks

We will go through two strategies:

  1. using the closest gene with ChIPseeker followed by gene set enrichment with clusterProfiler. This is typically done for peaks in promoters

  2. allowing for annotation of one peak with multiple genes using toolset called GREAT. This is usually done to annotate enhancer or distal peaks.

offset

Choosing a peak set

Our peak set is large (~72k peaks), which will result in many genes being annotated to peaks.

Even from the ChIPseeker annotation of the nearest gene, there are almost 17k genes, making it unlikely we will find any real specific enrichment of gene sets.

We should choose a more specific set of peaks to test.

length(unique(annotatedPeaksGR$geneId))
## [1] 16801

Using differential peaks for test

The specific set of peaks we choose will depend on our question. In our case, we just performed differential analysis on our consensus peak set, so we can use the peaks that go up in W6 vs W0.

Here we read in the differential results and look at them.

W6MinusD0 <- rio::import("data/W6MinusD0.xlsx")

W6MinusD0[1:5, ]
##         seqnames     start       end width strand  baseMean log2FoldChange
## 1          chr12  24581325  24586154  4830      * 9192.9984      -8.891086
## 2          chr12  24575912  24578078  2167      * 2788.4081      -9.487306
## 3          chr13  23573848  23574171   324      * 1399.1613      -7.289620
## 4           chr4 135549991 135552699  2709      *  439.3033      -6.265374
## 5 chrUn_GL456383     22710     26551  3842      * 2422.7188      -5.462346
##       lfcSE      stat        pvalue          padj
## 1 0.3253001 -27.33195 1.770064e-164 4.410114e-160
## 2 0.4380051 -21.66026 4.864412e-104 6.059841e-100
## 3 0.3891231 -18.73346  2.641873e-78  2.194075e-74
## 4 0.4460385 -14.04671  8.069439e-45  5.026252e-41
## 5 0.3993478 -13.67817  1.371076e-42  6.832071e-39

Using differential peaks for test

The gene annotation packages (e.g. ChIPseeker, GREAT) require a GRanges object. We can convert this table to a GRanges and keep key differential statistics as metadata in the GRanges object.

W6MinusD0_gr <- GRanges(seqnames = W6MinusD0$seqnames, IRanges(start = W6MinusD0$start,
    end = W6MinusD0$end), log2FoldChange = W6MinusD0$log2FoldChange, padj = W6MinusD0$padj)

W6MinusD0_gr[1:3, ]
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames            ranges strand | log2FoldChange         padj
##          <Rle>         <IRanges>  <Rle> |      <numeric>    <numeric>
##   [1]    chr12 24581325-24586154      * |       -8.89109 4.41011e-160
##   [2]    chr12 24575912-24578078      * |       -9.48731 6.05984e-100
##   [3]    chr13 23573848-23574171      * |       -7.28962  2.19408e-74
##   -------
##   seqinfo: 34 sequences from an unspecified genome; no seqlengths

Using differential peaks for test

We are going to look at the peaks that are increased in W6, so the GRanges is subset to the peaks that have a fold change greater than 2 and an adjusted p-value less than 0.05.

We also remove peaks that aren’t on the main chromosomes (usually unplaced configs).

W6MinusD0_gr_main <- W6MinusD0_gr[as.vector(seqnames(W6MinusD0_gr)) %in% paste0("chr",
    c(1:19, "X", "Y", "M"))]
W6MinusD0_gr_up <- W6MinusD0_gr_main[W6MinusD0_gr_main$log2FoldChange > 1 & W6MinusD0_gr_main$padj <
    0.05]

W6MinusD0_gr_up
## GRanges object with 4627 ranges and 2 metadata columns:
##          seqnames              ranges strand | log2FoldChange        padj
##             <Rle>           <IRanges>  <Rle> |      <numeric>   <numeric>
##      [1]     chr3   69954132-69954644      * |        5.33160 9.93884e-36
##      [2]    chr12     9855553-9856569      * |        4.99663 2.81330e-33
##      [3]     chr7 114358259-114358706      * |        5.45989 9.14611e-31
##      [4]     chr4   97465816-97466559      * |        4.43453 7.28496e-29
##      [5]    chr13   38793431-38794003      * |        5.40520 1.40299e-28
##      ...      ...                 ...    ... .            ...         ...
##   [4623]    chr15   91037086-91037371      * |        1.59945   0.0498130
##   [4624]    chr14   11671472-11671821      * |        2.06389   0.0498410
##   [4625]    chr16   78456277-78456679      * |        1.55887   0.0498410
##   [4626]     chr6   90471253-90471635      * |        1.46471   0.0498410
##   [4627]    chr12   36438399-36438569      * |        1.70550   0.0498587
##   -------
##   seqinfo: 34 sequences from an unspecified genome; no seqlengths

Functional Enrichment with nearest gene


Enrichment with nearest gene

To perform gene set using the closest gene to each peak, we will use the clusterProfiler package.

clusterProfiler provides multiple enrichment functions that allow for comparison of your gene list to known (e.g. GO, KEGG) or custom gene sets. Detailed information about all of the functionality within this package is available here.

Enrichment with nearest gene

The functions we will use in clusterProfiler take a vector of genes, which we can obtain from the ChIPseeker result.

It then calculates a p-value based on a hypergeometric distribution to determine if a gene set is over represented in our vector of genes.

offset

Enrichment with nearest gene

To get a list of potential target genes, we again use the ChIPseeker package to associate our peaks, representing potential transcription factor binding events, to their overlapping or closest mm10 genes.

peakAnno_up <- annotatePeak(W6MinusD0_gr_up, tssRegion = c(-1000, 1000), TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,
    annoDb = "org.Mm.eg.db")

Enrichment with nearest gene

peakAnno_up
## Annotated peaks generated by ChIPseeker
## 4627/4627  peaks were annotated
## Genomic Annotation Summary:
##              Feature   Frequency
## 9           Promoter  6.02982494
## 4             5' UTR  0.12967365
## 3             3' UTR  1.18867517
## 1           1st Exon  1.70736979
## 7         Other Exon  3.02571861
## 2         1st Intron 15.99308407
## 8       Other Intron 29.21979684
## 6 Downstream (<=300)  0.06483683
## 5  Distal Intergenic 42.64102010

Enrichment with nearest gene

We again convert this csAnno object to a GRanges to see the individual peak annotations.

annotatedPeaksGR_up <- as.GRanges(peakAnno_up)
annotatedPeaksGR_up[1:3]
## GRanges object with 3 ranges and 14 metadata columns:
##       seqnames              ranges strand | log2FoldChange        padj
##          <Rle>           <IRanges>  <Rle> |      <numeric>   <numeric>
##   [1]     chr3   69954132-69954644      * |        5.33160 9.93884e-36
##   [2]    chr12     9855553-9856569      * |        4.99663 2.81330e-33
##   [3]     chr7 114358259-114358706      * |        5.45989 9.14611e-31
##                   annotation   geneChr geneStart   geneEnd geneLength
##                  <character> <integer> <integer> <integer>  <integer>
##   [1]      Distal Intergenic         3  70007613  70028708      21096
##   [2]      Distal Intergenic        12   9574441   9581500       7060
##   [3] Intron (ENSMUST00000..         7 114354640 114415021      60382
##       geneStrand      geneId         transcriptId distanceToTSS
##        <integer> <character>          <character>     <numeric>
##   [1]          1      229389 ENSMUST00000053013.5        -52969
##   [2]          1       23967 ENSMUST00000057021.8        281112
##   [3]          2       71045 ENSMUST00000124673.1         56315
##                  ENSEMBL        SYMBOL               GENENAME
##              <character>   <character>            <character>
##   [1] ENSMUSG00000027788         Otol1               otolin 1
##   [2] ENSMUSG00000048387          Osr1 odd-skipped related ..
##   [3] ENSMUSG00000087475 4933406I18Rik RIKEN cDNA 4933406I1..
##   -------
##   seqinfo: 34 sequences (1 circular) from mm10 genome

Analysis of promoter peaks

The closest gene annotation approach works well when we subset to just the peaks localized to promoters as this leads to confident peak to gene annotations.

For this Sox9 CUT&RUN dataset there are not many promoter-bound peaks, resulting in a small gene list. This might be okay if certain gene sets are enriched in these genes.

annotatedPeaksGR_up_prom <- annotatedPeaksGR_up[grepl("Promoter", annotatedPeaksGR_up$annotation)]
up_promGenes_uniq <- unique(annotatedPeaksGR_up_prom$geneId)

length(up_promGenes_uniq)
## [1] 268

Identify universe of genes

Pathway enrichment requires a ‘universe’ of genes. We can extract all genes which are included in the TxDb object to use as our universe.

allGeneGR <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
allGeneGR[1:2, ]
## GRanges object with 2 ranges and 1 metadata column:
##             seqnames            ranges strand |     gene_id
##                <Rle>         <IRanges>  <Rle> | <character>
##   100009600     chr9 21062393-21073096      - |   100009600
##   100009609     chr7 84935565-84964115      - |   100009609
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
allGeneIDs <- allGeneGR$gene_id

Analysis of promoter peaks

Once we have our gene list and the universe of genes in the same format, we can use them in the enrichGO function to perform gene ontology analysis

For the ont argument, we can choose between the “BP”, “MF”, and “CC” subontologies, or “ALL” for all three.

library(clusterProfiler)
library(org.Mm.eg.db)

GO_result_prom <- enrichGO(gene = up_promGenes_uniq, universe = allGeneIDs, OrgDb = org.Mm.eg.db,
    ont = "BP")

Analysis of promoter peaks

In our case, this results in no enriched terms. This suggests the Sox9 is not likely not enacting its main regulatory effect by binding to promoters.

This is not surprising given how few peaks overlapped with promoters

GO_result_prom
## #
## # over-representation test
## #
## #...@organism     Mus musculus 
## #...@ontology     BP 
## #...@keytype      ENTREZID 
## #...@gene     chr [1:268] "56443" "68481" "53333" "102465715" "77116" "238871" "13806" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...0 enriched terms found
## #...Citation
##  T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
##  clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
##  The Innovation. 2021, 2(3):100141

Nearest gene approach with all genes

We can try to use the whole set of Sox9 peaks that go up in the week 6 time point.

up_genes_uniq <- unique(annotatedPeaksGR_up$geneId)

length(up_genes_uniq)
## [1] 2821

Nearest gene approach with all genes

We again use the enrichGO function, but this time with all of the genes from the differential peaks.

GO_result <- enrichGO(gene = up_genes_uniq, universe = allGeneIDs, OrgDb = org.Mm.eg.db,
    ont = "BP")
GO_result
## #
## # over-representation test
## #
## #...@organism     Mus musculus 
## #...@ontology     BP 
## #...@keytype      ENTREZID 
## #...@gene     chr [1:2821] "229389" "23967" "71045" "319865" "66143" "18478" "52440" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...976 enriched terms found
## 'data.frame':    976 obs. of  12 variables:
##  $ ID            : chr  "GO:0051216" "GO:0061448" "GO:0198738" "GO:0016055" ...
##  $ Description   : chr  "cartilage development" "connective tissue development" "cell-cell signaling by wnt" "Wnt signaling pathway" ...
##  $ GeneRatio     : chr  "66/2616" "83/2616" "108/2616" "107/2616" ...
##  $ BgRatio       : chr  "224/22195" "318/22195" "463/22195" "461/22195" ...
##  $ RichFactor    : num  0.295 0.261 0.233 0.232 0.257 ...
##  $ FoldEnrichment: num  2.5 2.21 1.98 1.97 2.18 ...
##  $ zScore        : num  8.25 7.97 7.78 7.69 7.81 ...
##  $ pvalue        : num  7.32e-13 1.33e-12 1.62e-12 2.85e-12 3.24e-12 ...
##  $ p.adjust      : num  3.33e-09 3.33e-09 3.33e-09 3.97e-09 3.97e-09 ...
##  $ qvalue        : num  2.44e-09 2.44e-09 2.44e-09 2.91e-09 2.91e-09 ...
##  $ geneID        : chr  "23967/12394/20679/18028/12156/20678/19227/12839/16974/234356/67216/108153/210719/11977/12023/14042/320910/70661"| __truncated__ "23967/81004/12394/20679/19017/18028/12156/20678/19227/93691/12633/17978/12839/16973/16974/234356/67216/108153/2"| __truncated__ "235542/81004/50873/21885/20668/14369/12156/218952/72293/13649/20356/67974/16973/16974/21416/11820/21844/18033/5"| __truncated__ "235542/81004/50873/21885/20668/14369/12156/218952/72293/13649/20356/67974/16973/16974/21416/11820/21844/18033/5"| __truncated__ ...
##  $ Count         : int  66 83 108 107 83 110 51 95 109 51 ...
## #...Citation
##  T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
##  clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
##  The Innovation. 2021, 2(3):100141

Nearest gene approach with all genes

We now have an enrichResult instance. From this object, we can extract a data frame of the most highly enriched gene ontology categories.

GO_result_df <- data.frame(GO_result, row.names = NULL)
GO_result_df[1:3, ]
##           ID                   Description GeneRatio   BgRatio RichFactor
## 1 GO:0051216         cartilage development   66/2616 224/22195  0.2946429
## 2 GO:0061448 connective tissue development   83/2616 318/22195  0.2610063
## 3 GO:0198738    cell-cell signaling by wnt  108/2616 463/22195  0.2332613
##   FoldEnrichment   zScore       pvalue     p.adjust       qvalue
## 1       2.499846 8.246835 7.319794e-13 3.329261e-09 2.437582e-09
## 2       2.214463 7.973432 1.327538e-12 3.329261e-09 2.437582e-09
## 3       1.979066 7.782046 1.618241e-12 3.329261e-09 2.437582e-09
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       geneID
## 1                                                                                                                                                                                                                                                                   23967/12394/20679/18028/12156/20678/19227/12839/16974/234356/67216/108153/210719/11977/12023/14042/320910/70661/18605/72043/12160/58250/83925/20319/12393/15251/12399/12387/21834/73121/74155/12842/76566/12162/18750/17127/17260/18159/218772/666279/21808/214855/20583/269941/14173/17132/18740/65960/20855/106522/15530/102607/14182/14284/18121/192119/16007/14634/240396/14462/19226/12833/18019/208647/17129/19645
## 2                                                                                                                                                         23967/81004/12394/20679/19017/18028/12156/20678/19227/93691/12633/17978/12839/16973/16974/234356/67216/108153/210719/11977/12023/14042/320910/70661/18605/72043/16795/12160/16000/58250/83925/20319/231991/12393/15251/26383/12399/12387/20660/21834/73121/74155/12842/76566/12162/18750/17127/17260/18159/218772/666279/21808/214855/20583/18128/269941/14173/17132/18740/65960/20855/106522/15530/102607/666173/234199/13405/14182/14284/18121/192119/16007/14634/240396/14462/19226/12833/18019/232685/208647/17129/12977/19645
## 3 235542/81004/50873/21885/20668/14369/12156/218952/72293/13649/20356/67974/16973/16974/21416/11820/21844/18033/54198/666048/107515/58198/16543/110957/26563/74256/69151/14042/18605/72043/232334/12389/53883/22414/66725/207181/12505/17342/20319/12876/67916/68010/50523/59036/56637/12387/98170/93687/218581/723818/12842/233328/219134/11426/17127/12388/66042/56458/329252/16841/621976/69601/20377/14371/19664/16842/20583/18128/19043/26941/13617/21929/14362/16687/14173/11865/109901/494504/21887/18163/107351/18390/57265/21888/64297/14365/50781/18121/68775/14370/14634/75607/21951/19211/17191/24136/13618/19042/20729/12833/75723/74025/77938/387223/84035/103583/665113/66277
##   Count
## 1    66
## 2    83
## 3   108

Visualize enrichment result

Network plots can be generated from any enrichResult object using the enrichplot package.

We measure similarities between the various significant gene sets and group them accordingly. The showCategory argument specifies how many top gene ontology hits to show.

library(enrichplot)
GO_result_plot <- pairwise_termsim(GO_result)
emapplot(GO_result_plot, showCategory = 20, cex.params = list(category_label = 0.6))

Using MSigDB gene sets

In addition to gene ontology, we can test our gene list against custom gene sets that we import using the clusterProfiler enricher function. Similar to the enrichGO function, this will produce an enrichResult object that can be used for visualization.

Here we will get gene sets from MSigDB using the msigdbr package. We can use this package to pull out specific gene sets, which can be seen at this link. We can also run the msigdbr_collections function to see the categories and subcategory codes that will be used for accessing the gene sets.

library(msigdbr)
msigdbr_collections()
## # A tibble: 25 x 4
##    gs_collection gs_subcollection  gs_collection_name               num_genesets
##    <chr>         <chr>             <chr>                                   <int>
##  1 C1            ""                Positional                                302
##  2 C2            "CGP"             Chemical and Genetic Perturbati~         3494
##  3 C2            "CP"              Canonical Pathways                         19
##  4 C2            "CP:BIOCARTA"     BioCarta Pathways                         292
##  5 C2            "CP:KEGG_LEGACY"  KEGG Legacy Pathways                      186
##  6 C2            "CP:KEGG_MEDICUS" KEGG Medicus Pathways                     658
##  7 C2            "CP:PID"          PID Pathways                              196
##  8 C2            "CP:REACTOME"     Reactome Pathways                        1736
##  9 C2            "CP:WIKIPATHWAYS" WikiPathways                              830
## 10 C3            "MIR:MIRDB"       miRDB                                    2377
## # i 15 more rows

Using MSigDB gene sets

From the data frame on the previous slide we can identify the category/subcategory we want, and use those in the msigdbr function. Here we will use “C2” as the category, and “REACTOME” as the subcategory to access the Reactome gene sets, and in the end we need to get a data frame where the first column contains the name of the gene sets and the second column contains the gene ID.

library(msigdbr)
msig_t2g_reac <- msigdbr(species = "Mus musculus", category = "C2", subcategory = "REACTOME")
msig_t2g_reac <- msig_t2g_reac[, colnames(msig_t2g_reac) %in% c("gs_name", "entrez_gene")]
msig_t2g_reac[1:3, ]
## # A tibble: 3 x 2
##   gs_name                         entrez_gene
##   <chr>                           <chr>      
## 1 REACTOME_2_LTR_CIRCLE_FORMATION 23825      
## 2 REACTOME_2_LTR_CIRCLE_FORMATION 15361      
## 3 REACTOME_2_LTR_CIRCLE_FORMATION 319583

Using MSigDB gene sets

We then run the gene set enrichment, using the term to gene mapping we created as the TERM2GENE argument in the enricher function.

reactome <- enricher(gene = up_genes_uniq, universe = allGeneIDs, TERM2GENE = msig_t2g_reac)
reactome_df <- data.frame(reactome, row.names = NULL)
# clean up table to print on slide
reactome_df <- reactome_df %>%
    dplyr::select(-Description, -geneID) %>%
    dplyr::mutate(ID = substr(ID, 1, 50))
reactome_df[1:5, ]
##                                                   ID GeneRatio   BgRatio
## 1                        REACTOME_CDC42_GTPASE_CYCLE   42/1351 153/10508
## 2               REACTOME_NEPHRIN_FAMILY_INTERACTIONS   12/1351  23/10508
## 3 REACTOME_DISEASES_OF_SIGNAL_TRANSDUCTION_BY_GROWTH   91/1351 464/10508
## 4                          REACTOME_RHO_GTPASE_CYCLE   87/1351 446/10508
## 5                   REACTOME_CELL_CELL_COMMUNICATION   38/1351 153/10508
##   RichFactor FoldEnrichment   zScore       pvalue    p.adjust      qvalue Count
## 1  0.2745098       2.135121 5.432546 9.312843e-07 0.001170624 0.001082250    42
## 2  0.5217391       4.058057 5.639176 6.697622e-06 0.004209456 0.003891671    12
## 3  0.1961207       1.525415 4.446301 1.745630e-05 0.007314192 0.006762021    91
## 4  0.1950673       1.517222 4.287386 3.291241e-05 0.009331757 0.008627274    87
## 5  0.2483660       1.931777 4.459363 3.711916e-05 0.009331757 0.008627274    38

Limitations of nearest gene

In our case, annotating the peaks with the nearest gene is likely not the best strategy given the high number of non-promoter and distal intergenic peaks.

For example, it’s known that regulatory elements like enhancers can regulate genes that are far away, and can even skip over nearby genes to regulate distant ones.

By only using the closest gene, we might be missing key peak to gene relationships. For enhancer-focused peak sets we have other methods that are better suited.

Functional Enrichment of distal regions with GREAT


GREAT for GO and functional testing

For methods that rely on genomic regions (i.e. peaks) like CUT&RUN and ATACseq, tools have been developed to better incorporate peaks distal to genes into their enrichment testing, such as the popular GREAT toolset.

Incorporating distal peaks by rules such as nearest gene results in some genes having a higher chance of being selected and hence some gene sets as a whole having a higher chance of having its members selected.

It’s known that genes can be regulated by elements that are far away (e.g. 1000kb), so you might miss key regulatory relationships by limiting to the closest gene.

Of course this method will likely introduce more false positives, so this might be worth it if you know you are looking at enhancers (like Sox9), but not if the TF is binding to promoters.

GREAT for GO and functional testing

Instead of assigning a peak the closest genes (like ChIPseeker), GREAT provides flexible options for controlling peak annotation.

The default setting is called ‘basalPlusExt’, which gives each gene a defined basal regulatory regions (e.g. 5kb upstream and 1kb downstream) and then expends that region in either direction based on a maximum set distance (e.g. 1000kb).

A peak is then annotated by any gene regulatory region the peak overlaps. This might be multiple genes.

offset

GREAT for GO and functional testing

After defining regulatory regions for each individual gene, GREAT then compares the proportion of peaks mapping to an entire gene set’s regulatory regions to the proportion of the genome occupied by gene set’s regulatory regions.

i.e. If a gene set’s regulatory regions account for 1% of the genome then one might expect 1% of peaks to overlap these regions by chance.

This will help reduce bias towards gene sets that have many genes with large regulatory domains (i.e. more distal, few surrounding genes)

(McClean, 2010) ]

]

rGREAT

We can use the rGREAT package to utilize the GREAT algorithm.

There are two ways to use rGREAT, by remotely accessing the GREAT server with the submitGreatJob function, or running the algorithms locally with the great function.

The local version provides a bit more flexibility, so we will use this.

library(rGREAT)

rGREAT for GO and functional testing

To use the great function we need at least three arguments:

  • GRanges of Sox9 peaks
  • the gene sets to test against (look at the help to see options - ?great)
  • a genome to define the TSS loci, usually easiest to specify a TxDb

This function contains many other arguments to fine tune the gene annotation settings.

great_gobp <- great(W6MinusD0_gr_up, gene_sets = "GO:BP", tss_source = "TxDb.Mmusculus.UCSC.mm10.knownGene")

rGREAT for GO and functional testing

This function returns a GreatObject containing the results of the test and various informaiton about the inputs. These can be accessed using the ‘@’ accessor.

If we print the object, we get a summary.

great_gobp
## 4627 regions are associated to 4245 genes' extended TSSs.
##   TSS source: TxDb.Mmusculus.UCSC.mm10.knownGene
##   Genome: mm10
##   OrgDb: org.Mm.eg.db
##   Gene sets: GO:BP
##   Background: whole genome excluding gaps
## Mode: Basal plus extension
##   Proximal: 5000 bp upstream, 1000 bp downstream,
##   plus Distal: up to 1000000 bp

rGREAT for GO and functional testing

The results table can be retrieved using the getEnrichmentTables function.

great_gobp_tab <- getEnrichmentTables(great_gobp)

great_gobp_tab[1:5, ]
##           id                                              description
## 1 GO:0002762 negative regulation of myeloid leukocyte differentiation
## 2 GO:0045638      negative regulation of myeloid cell differentiation
## 3 GO:0042303                                            molting cycle
## 4 GO:0042633                                               hair cycle
## 5 GO:0045639      positive regulation of myeloid cell differentiation
##   genome_fraction observed_region_hits fold_enrichment p_value p_adjust
## 1     0.007923178                   99        2.700451       0        0
## 2     0.011056773                  136        2.658343       0        0
## 3     0.015534092                  190        2.643432       0        0
## 4     0.015534092                  190        2.643432       0        0
## 5     0.010856163                  132        2.627835       0        0
##   mean_tss_dist observed_gene_hits gene_set_size fold_enrichment_hyper
## 1        259213                 23            64              1.742524
## 2        261644                 33            97              1.649579
## 3        196828                 59           137              2.088154
## 4        196828                 59           137              2.088154
## 5        238689                 37           114              1.573721
##   p_value_hyper p_adjust_hyper
## 1  3.301098e-03   1.634203e-02
## 2  1.450937e-03   8.381898e-03
## 3  2.201773e-09   4.700020e-08
## 4  2.201773e-09   4.700020e-08
## 5  2.034818e-03   1.097948e-02

Reactome gene sets with rGREAT

The great function can be used to test custom gene sets. The gene sets need to be a list where each entry is named and contains a character vector of gene IDs.

We can convert our Reactome gene set table we previously generated to a list.

# convert to a list of gene sets
reac_gene_sets <- split(msig_t2g_reac$entrez_gene, msig_t2g_reac$gs_name)
reac_gene_sets <- lapply(reac_gene_sets, as.character)

reac_gene_sets[1:2]
## $REACTOME_2_LTR_CIRCLE_FORMATION
## [1] "23825"  "15361"  "319583" "101739" "108138" "22596"  "14375" 
## 
## $REACTOME_ABACAVIR_ADME
## [1] "18671" "26357" "75894" "11522" "76952" "18534" "20517" "20518" "20519"

Reactome gene sets with rGREAT

This list is then used for the ‘gene_sets’ argument for the great function.

great_reac <- great(W6MinusD0_gr_up, gene_sets = reac_gene_sets, tss_source = "TxDb.Mmusculus.UCSC.mm10.knownGene")

great_reac_tab <- getEnrichmentTable(great_reac)
great_reac_tab[1:2, ]
##                                             id genome_fraction
## 1 REACTOME_TRANSCRIPTIONAL_REGULATION_BY_RUNX3     0.008752716
## 2                    REACTOME_SIGNALING_BY_ALK     0.004829087
##   observed_region_hits fold_enrichment p_value p_adjust mean_tss_dist
## 1                  138        3.407507       0        0        232957
## 2                   72        3.222315       0        0        252943
##   observed_gene_hits gene_set_size fold_enrichment_hyper p_value_hyper
## 1                 34            96              1.717270  0.0005505350
## 2                 15            30              2.424382  0.0003260066
##   p_adjust_hyper
## 1     0.01333749
## 2     0.01020957

Reactome gene sets with rGREAT

Our top hit, RUNX related regulation, lines up well with the publication. Below is a figure showing RNAseq results, demonstrating that at week 6 there is a string increase in RUNX genes.

igv

Retrieve rGREAT gene annotations

It might be useful to know the genes associated with each peak. For example to overlap with a DE gene list from an RNAseq experiment.

This can be retrieved with the getRegionGeneAssociations function.

great_genes <- getRegionGeneAssociations(great_reac)

great_genes[1:5]
## GRanges object with 5 ranges and 4 metadata columns:
##       seqnames              ranges strand | log2FoldChange        padj
##          <Rle>           <IRanges>  <Rle> |      <numeric>   <numeric>
##   [1]     chr3   69954132-69954644      * |        5.33160 9.93884e-36
##   [2]    chr12     9855553-9856569      * |        4.99663 2.81330e-33
##   [3]     chr7 114358259-114358706      * |        5.45989 9.14611e-31
##   [4]     chr4   97465816-97466559      * |        4.43453 7.28496e-29
##   [5]    chr13   38793431-38794003      * |        5.40520 1.40299e-28
##       annotated_genes     dist_to_TSS
##       <CharacterList>   <IntegerList>
##   [1]    Sptssb,Otol1   -94192,-52969
##   [2]     Osr1,Nt5c1b  285437,-513404
##   [3]     Psma1,Pde3b   -82141,-56575
##   [4]    Gm12695,Nfia -680630,-306175
##   [5]  Eef1e1,Slc35b3  -134373,166872
##   -------
##   seqinfo: 34 sequences from an unspecified genome; no seqlengths

Time for an exercise!

Exercise on CUT&RUN data can be found here

Answers to exercise

Answers can be found here

R code for solutions can be found here