class: center, middle, inverse, title-slide .title[ # CUT&RUN In Bioconductor (part5)
] .author[ ### Rockefeller University, Bioinformatics Resource Centre ] .date[ ###
http://rockefelleruniversity.github.io/RU_CUT&RUN/
] --- ## Set the Working directory Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived. You may navigate to the unarchived RU_Course_help folder in the Rstudio menu. **Session -> Set Working Directory -> Choose Directory** or in the console. ``` r setwd("~/Downloads/ATAC.Cut-Run.ChIP-master/r_course") ``` --- ## What we will cover We have now called peaks and then built a consensus peak set that we have counted over and determined differential peaks between W6 and W0 for our CUT&RUN data. In this section we will: * Overlap these peak sets to genomic features * Annotate peaks to genes * Perform functional enrichment for pathways and biological gene sets with the annotated peaks --- class: inverse, center, middle # Peak Annotation <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Annotation of peaks to genes So far we have been working with CUT&RUN peaks corresponding to transcription factor binding or ATACseq peaks corresponding to open chromatin regions. Transcription factors, as implied in the name, can affect the expression of their target genes and open regions generally correlate with gene expression. We will often annotate peaks to genes to try and identify the target of a transcription factor or a gene regulated by a regulatory element uncovered by ATACseq. This is typically done using a simple set of rules: Peaks are typically annotated to a gene if * They overlap the gene. * The gene is the closest (and within a minimum distance). --- ## Peak annotation A useful package for annotation of peaks to genes is **ChIPseeker**. By using pre-defined annotation in the form of a **TXDB** object for mouse (mm10 genome), ChIPseeker will provide us with an overview of where peaks land in the gene and distance to TSS sites. First load the libraries we require for the next part and read in our SOX9 CUT&RUN peaks. ``` r library(TxDb.Mmusculus.UCSC.mm10.knownGene) library(org.Mm.eg.db) library(ChIPseeker) library(rtracklayer) cnrPeaks_GR <- rtracklayer::import("data/peaks/SOX9CNR_W6_rep1_macs_peaks.narrowPeak") ``` --- ## Peak annotation The annotatePeak function accepts a GRanges object of the regions to annotate, a TXDB object for gene locations and a database object name to retrieve gene names from. ``` r peakAnno <- annotatePeak(cnrPeaks_GR, tssRegion = c(-1000, 1000), TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene, annoDb = "org.Mm.eg.db") ``` ``` ## >> preparing features information... 2025-06-11 07:12:13 ## >> identifying nearest features... 2025-06-11 07:12:13 ## >> calculating distance from peak to TSS... 2025-06-11 07:12:14 ## >> assigning genomic annotation... 2025-06-11 07:12:14 ## >> adding gene annotation... 2025-06-11 07:12:24 ``` ``` ## >> assigning chromosome lengths 2025-06-11 07:12:24 ## >> done... 2025-06-11 07:12:24 ``` ``` r class(peakAnno) ``` ``` ## [1] "csAnno" ## attr(,"package") ## [1] "ChIPseeker" ``` --- ## Peak annotation The result is a csAnno object containing annotation for peaks and overall annotation statistics. ``` r peakAnno ``` ``` ## Annotated peaks generated by ChIPseeker ## 72118/72174 peaks were annotated ## Genomic Annotation Summary: ## Feature Frequency ## 9 Promoter 20.3180898 ## 4 5' UTR 0.1677806 ## 3 3' UTR 1.4420810 ## 1 1st Exon 1.2687540 ## 7 Other Exon 2.5555340 ## 2 1st Intron 14.3431598 ## 8 Other Intron 24.3115450 ## 6 Downstream (<=300) 0.1067695 ## 5 Distal Intergenic 35.4862864 ``` --- ## Peak annotation The csAnno object contains the information on annotation of individual peaks to genes. To extract this from the csAnno object the ChIPseeker functions *as.GRanges* or *as.data.frame* can be used to produce the respective object with peaks and their associated genes. ``` r annotatedPeaksGR <- as.GRanges(peakAnno) annotatedPeaksDF <- as.data.frame(peakAnno) ``` --- ## Peak annotation ``` r annotatedPeaksGR[1:2, ] ``` ``` ## GRanges object with 2 ranges and 18 metadata columns: ## seqnames ranges strand | name score ## <Rle> <IRanges> <Rle> | <character> <numeric> ## [1] chr1 4466289-4466395 * | SOX9CNR_W6_rep1_macs.. 16 ## [2] chr1 4785750-4785948 * | SOX9CNR_W6_rep1_macs.. 53 ## signalValue pValue qValue peak annotation geneChr ## <numeric> <numeric> <numeric> <integer> <character> <integer> ## [1] 3.22302 3.93505 1.66890 13 Distal Intergenic 1 ## [2] 5.79731 8.11406 5.39043 138 Promoter 1 ## geneStart geneEnd geneLength geneStrand geneId ## <integer> <integer> <integer> <integer> <character> ## [1] 4492465 4493735 1271 2 20671 ## [2] 4781221 4785739 4519 2 27395 ## transcriptId distanceToTSS ENSEMBL SYMBOL ## <character> <numeric> <character> <character> ## [1] ENSMUST00000191939.1 27340 ENSMUSG00000025902 Sox17 ## [2] ENSMUST00000146665.2 -11 ENSMUSG00000033845 Mrpl15 ## GENENAME ## <character> ## [1] SRY (sex determining.. ## [2] mitochondrial riboso.. ## ------- ## seqinfo: 35 sequences (1 circular) from mm10 genome ``` ## Peak annotation The genomic annotation for each peak is whin the *annotation* column and the closest gene is in shown in the *geneId*, *ENSEMBL*, and *SYMBOL* columns (geneId is the Entrez ID). ``` r annotatedPeaksGR$annotation[1:5] ``` ``` ## [1] "Distal Intergenic" ## [2] "Promoter" ## [3] "Promoter" ## [4] "Promoter" ## [5] "Intron (ENSMUST00000134384.7/18777, intron 9 of 9)" ``` ``` r annotatedPeaksGR$SYMBOL[1:5] ``` ``` ## [1] "Sox17" "Mrpl15" "Lypla1" "Lypla1" "Lypla1" ``` --- ## Vizualising peak annotation Now we have the annotated peaks from ChIPseeker we can use some of ChIPseeker's plotting functions to display distribution of peaks in gene features. Here we use the **plotAnnoBar** function to plot this as a bar chart but **plotAnnoPie** would produce a similar plot as a pie chart. ``` r plotAnnoBar(peakAnno) ``` <!-- --> --- ## Vizualising peak annotation Similarly we can plot the distribution of peaks around TSS sites. ``` r plotDistToTSS(peakAnno) ``` <!-- --> --- class: inverse, center, middle # Gene Set Enrichment <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Gene Set testing for peaks Transcription factors or epigenetic marks may act on specific sets of genes grouped by a common biological feature (shared Biological function, common regulation in RNAseq experiment etc). A frequent step in CUT&RUN or ATACseq analysis is to test whether common gene sets are enriched for transcription factor binding, epigenetic marks, or open chromatin regions. Sources of well curated gene sets include [GO consortium](http://geneontology.org/) (gene's function, biological process and cellular localisation), [REACTOME](http://www.reactome.org/) (Biological Pathways) and [MsigDB](http://software.broadinstitute.org/gsea/msigdb/) (Computationally and Experimentally derived). --- ## Gene Set testing for peaks Gene set enrichment testing may be performed on the sets of genes with peaks associated to them. We will not access these database libraries directly in testing but will use other R/Bioconductor libraries which make extensive use of them. How we perform this analysis will depend on the type of peaks we are interested in. There are a wide range of types of peaks in our data set, some in promoters where annotation is straightforward, and many elswhere where annotation is trickier. --- ## Gene Set testing for peaks We will go through two strategies: * using the closest gene with **ChIPseeker** followed by gene set enrichment with **clusterProfiler**. This is typically done for peaks in promoters <div align="center"> <img src="imgs/gene_example_prom.png" alt="offset" height="250" width="800"> </div> --- ## Gene Set testing for peaks We will go through two strategies: 1. using the closest gene with **ChIPseeker** followed by gene set enrichment with **clusterProfiler**. This is typically done for peaks in promoters 2. allowing for annotation of one peak with multiple genes using toolset called **GREAT**. This is usually done to annotate enhancer or distal peaks. <div align="center"> <img src="imgs/gene_example_enh.png" alt="offset" height="250" width="800"> </div> --- ## Choosing a peak set Our peak set is large (~72k peaks), which will result in many genes being annotated to peaks. Even from the ChIPseeker annotation of the nearest gene, there are almost 17k genes, making it unlikely we will find any real specific enrichment of gene sets. We should choose a more specific set of peaks to test. ``` r length(unique(annotatedPeaksGR$geneId)) ``` ``` ## [1] 16801 ``` --- ## Using differential peaks for test The specific set of peaks we choose will depend on our question. In our case, we just performed differential analysis on our consensus peak set, so we can use the peaks that go up in W6 vs W0. Here we read in the differential results and look at them. ``` r W6MinusD0 <- rio::import("data/W6MinusD0.xlsx") W6MinusD0[1:5, ] ``` ``` ## seqnames start end width strand baseMean log2FoldChange ## 1 chr12 24581325 24586154 4830 * 9192.9984 -8.891086 ## 2 chr12 24575912 24578078 2167 * 2788.4081 -9.487306 ## 3 chr13 23573848 23574171 324 * 1399.1613 -7.289620 ## 4 chr4 135549991 135552699 2709 * 439.3033 -6.265374 ## 5 chrUn_GL456383 22710 26551 3842 * 2422.7188 -5.462346 ## lfcSE stat pvalue padj ## 1 0.3253001 -27.33195 1.770064e-164 4.410114e-160 ## 2 0.4380051 -21.66026 4.864412e-104 6.059841e-100 ## 3 0.3891231 -18.73346 2.641873e-78 2.194075e-74 ## 4 0.4460385 -14.04671 8.069439e-45 5.026252e-41 ## 5 0.3993478 -13.67817 1.371076e-42 6.832071e-39 ``` --- ## Using differential peaks for test The gene annotation packages (e.g. ChIPseeker, GREAT) require a GRanges object. We can convert this table to a GRanges and keep key differential statistics as metadata in the GRanges object. ``` r W6MinusD0_gr <- GRanges(seqnames = W6MinusD0$seqnames, IRanges(start = W6MinusD0$start, end = W6MinusD0$end), log2FoldChange = W6MinusD0$log2FoldChange, padj = W6MinusD0$padj) W6MinusD0_gr[1:3, ] ``` ``` ## GRanges object with 3 ranges and 2 metadata columns: ## seqnames ranges strand | log2FoldChange padj ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr12 24581325-24586154 * | -8.89109 4.41011e-160 ## [2] chr12 24575912-24578078 * | -9.48731 6.05984e-100 ## [3] chr13 23573848-23574171 * | -7.28962 2.19408e-74 ## ------- ## seqinfo: 34 sequences from an unspecified genome; no seqlengths ``` --- ## Using differential peaks for test We are going to look at the peaks that are increased in W6, so the GRanges is subset to the peaks that have a fold change greater than 2 and an adjusted p-value less than 0.05. We also remove peaks that aren't on the main chromosomes (usually unplaced configs). ``` r W6MinusD0_gr_main <- W6MinusD0_gr[as.vector(seqnames(W6MinusD0_gr)) %in% paste0("chr", c(1:19, "X", "Y", "M"))] W6MinusD0_gr_up <- W6MinusD0_gr_main[W6MinusD0_gr_main$log2FoldChange > 1 & W6MinusD0_gr_main$padj < 0.05] W6MinusD0_gr_up ``` ``` ## GRanges object with 4627 ranges and 2 metadata columns: ## seqnames ranges strand | log2FoldChange padj ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr3 69954132-69954644 * | 5.33160 9.93884e-36 ## [2] chr12 9855553-9856569 * | 4.99663 2.81330e-33 ## [3] chr7 114358259-114358706 * | 5.45989 9.14611e-31 ## [4] chr4 97465816-97466559 * | 4.43453 7.28496e-29 ## [5] chr13 38793431-38794003 * | 5.40520 1.40299e-28 ## ... ... ... ... . ... ... ## [4623] chr15 91037086-91037371 * | 1.59945 0.0498130 ## [4624] chr14 11671472-11671821 * | 2.06389 0.0498410 ## [4625] chr16 78456277-78456679 * | 1.55887 0.0498410 ## [4626] chr6 90471253-90471635 * | 1.46471 0.0498410 ## [4627] chr12 36438399-36438569 * | 1.70550 0.0498587 ## ------- ## seqinfo: 34 sequences from an unspecified genome; no seqlengths ``` --- class: inverse, center, middle # Functional Enrichment with nearest gene <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Enrichment with nearest gene To perform gene set using the closest gene to each peak, we will use the **clusterProfiler package**. clusterProfiler provides multiple enrichment functions that allow for comparison of your gene list to known (e.g. GO, KEGG) or custom gene sets. Detailed information about all of the functionality within this package is available [here](http://yulab-smu.top/clusterProfiler-book/). --- ## Enrichment with nearest gene The functions we will use in clusterProfiler take a vector of genes, which we can obtain from the ChIPseeker result. It then calculates a p-value based on a hypergeometric distribution to determine if a gene set is over represented in our vector of genes. <img src="imgs/hypergeo.png" alt="offset" height="300" width="800"> --- ## Enrichment with nearest gene To get a list of potential target genes, we again use the **ChIPseeker package** to associate our peaks, representing potential transcription factor binding events, to their overlapping or closest mm10 genes. ``` r peakAnno_up <- annotatePeak(W6MinusD0_gr_up, tssRegion = c(-1000, 1000), TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene, annoDb = "org.Mm.eg.db") ``` --- ## Enrichment with nearest gene ``` r peakAnno_up ``` ``` ## Annotated peaks generated by ChIPseeker ## 4627/4627 peaks were annotated ## Genomic Annotation Summary: ## Feature Frequency ## 9 Promoter 6.02982494 ## 4 5' UTR 0.12967365 ## 3 3' UTR 1.18867517 ## 1 1st Exon 1.70736979 ## 7 Other Exon 3.02571861 ## 2 1st Intron 15.99308407 ## 8 Other Intron 29.21979684 ## 6 Downstream (<=300) 0.06483683 ## 5 Distal Intergenic 42.64102010 ``` --- ## Enrichment with nearest gene We again convert this csAnno object to a GRanges to see the individual peak annotations. ``` r annotatedPeaksGR_up <- as.GRanges(peakAnno_up) annotatedPeaksGR_up[1:3] ``` ``` ## GRanges object with 3 ranges and 14 metadata columns: ## seqnames ranges strand | log2FoldChange padj ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr3 69954132-69954644 * | 5.33160 9.93884e-36 ## [2] chr12 9855553-9856569 * | 4.99663 2.81330e-33 ## [3] chr7 114358259-114358706 * | 5.45989 9.14611e-31 ## annotation geneChr geneStart geneEnd geneLength ## <character> <integer> <integer> <integer> <integer> ## [1] Distal Intergenic 3 70007613 70028708 21096 ## [2] Distal Intergenic 12 9574441 9581500 7060 ## [3] Intron (ENSMUST00000.. 7 114354640 114415021 60382 ## geneStrand geneId transcriptId distanceToTSS ## <integer> <character> <character> <numeric> ## [1] 1 229389 ENSMUST00000053013.5 -52969 ## [2] 1 23967 ENSMUST00000057021.8 281112 ## [3] 2 71045 ENSMUST00000124673.1 56315 ## ENSEMBL SYMBOL GENENAME ## <character> <character> <character> ## [1] ENSMUSG00000027788 Otol1 otolin 1 ## [2] ENSMUSG00000048387 Osr1 odd-skipped related .. ## [3] ENSMUSG00000087475 4933406I18Rik RIKEN cDNA 4933406I1.. ## ------- ## seqinfo: 34 sequences (1 circular) from mm10 genome ``` --- ## Analysis of promoter peaks The closest gene annotation approach works well when we subset to just the peaks localized to promoters as this leads to confident peak to gene annotations. For this Sox9 CUT&RUN dataset there are not many promoter-bound peaks, resulting in a small gene list. This might be okay if certain gene sets are enriched in these genes. ``` r annotatedPeaksGR_up_prom <- annotatedPeaksGR_up[grepl("Promoter", annotatedPeaksGR_up$annotation)] up_promGenes_uniq <- unique(annotatedPeaksGR_up_prom$geneId) length(up_promGenes_uniq) ``` ``` ## [1] 268 ``` --- ## Identify universe of genes Pathway enrichment requires a 'universe' of genes. We can extract all genes which are included in the TxDb object to use as our universe. ``` r allGeneGR <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene) allGeneGR[1:2, ] ``` ``` ## GRanges object with 2 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## 100009600 chr9 21062393-21073096 - | 100009600 ## 100009609 chr7 84935565-84964115 - | 100009609 ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome ``` ``` r allGeneIDs <- allGeneGR$gene_id ``` --- ## Analysis of promoter peaks Once we have our gene list and the universe of genes in the same format, we can use them in the `enrichGO` function to perform gene ontology analysis For the **ont** argument, we can choose between the "BP", "MF", and "CC" subontologies, or "ALL" for all three. ``` r library(clusterProfiler) library(org.Mm.eg.db) GO_result_prom <- enrichGO(gene = up_promGenes_uniq, universe = allGeneIDs, OrgDb = org.Mm.eg.db, ont = "BP") ``` --- ## Analysis of promoter peaks In our case, this results in no enriched terms. This suggests the Sox9 is not likely not enacting its main regulatory effect by binding to promoters. This is not surprising given how few peaks overlapped with promoters ``` r GO_result_prom ``` ``` ## # ## # over-representation test ## # ## #...@organism Mus musculus ## #...@ontology BP ## #...@keytype ENTREZID ## #...@gene chr [1:268] "56443" "68481" "53333" "102465715" "77116" "238871" "13806" ... ## #...pvalues adjusted by 'BH' with cutoff <0.05 ## #...0 enriched terms found ## #...Citation ## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. ## clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. ## The Innovation. 2021, 2(3):100141 ``` --- ## Nearest gene approach with all genes We can try to use the whole set of Sox9 peaks that go up in the week 6 time point. ``` r up_genes_uniq <- unique(annotatedPeaksGR_up$geneId) length(up_genes_uniq) ``` ``` ## [1] 2821 ``` --- ## Nearest gene approach with all genes We again use the `enrichGO` function, but this time with all of the genes from the differential peaks. ``` r GO_result <- enrichGO(gene = up_genes_uniq, universe = allGeneIDs, OrgDb = org.Mm.eg.db, ont = "BP") GO_result ``` ``` ## # ## # over-representation test ## # ## #...@organism Mus musculus ## #...@ontology BP ## #...@keytype ENTREZID ## #...@gene chr [1:2821] "229389" "23967" "71045" "319865" "66143" "18478" "52440" ... ## #...pvalues adjusted by 'BH' with cutoff <0.05 ## #...976 enriched terms found ## 'data.frame': 976 obs. of 12 variables: ## $ ID : chr "GO:0051216" "GO:0061448" "GO:0198738" "GO:0016055" ... ## $ Description : chr "cartilage development" "connective tissue development" "cell-cell signaling by wnt" "Wnt signaling pathway" ... ## $ GeneRatio : chr "66/2616" "83/2616" "108/2616" "107/2616" ... ## $ BgRatio : chr "224/22195" "318/22195" "463/22195" "461/22195" ... ## $ RichFactor : num 0.295 0.261 0.233 0.232 0.257 ... ## $ FoldEnrichment: num 2.5 2.21 1.98 1.97 2.18 ... ## $ zScore : num 8.25 7.97 7.78 7.69 7.81 ... ## $ pvalue : num 7.32e-13 1.33e-12 1.62e-12 2.85e-12 3.24e-12 ... ## $ p.adjust : num 3.33e-09 3.33e-09 3.33e-09 3.97e-09 3.97e-09 ... ## $ qvalue : num 2.44e-09 2.44e-09 2.44e-09 2.91e-09 2.91e-09 ... ## $ geneID : chr "23967/12394/20679/18028/12156/20678/19227/12839/16974/234356/67216/108153/210719/11977/12023/14042/320910/70661"| __truncated__ "23967/81004/12394/20679/19017/18028/12156/20678/19227/93691/12633/17978/12839/16973/16974/234356/67216/108153/2"| __truncated__ "235542/81004/50873/21885/20668/14369/12156/218952/72293/13649/20356/67974/16973/16974/21416/11820/21844/18033/5"| __truncated__ "235542/81004/50873/21885/20668/14369/12156/218952/72293/13649/20356/67974/16973/16974/21416/11820/21844/18033/5"| __truncated__ ... ## $ Count : int 66 83 108 107 83 110 51 95 109 51 ... ## #...Citation ## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. ## clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. ## The Innovation. 2021, 2(3):100141 ``` --- ## Nearest gene approach with all genes We now have an **enrichResult** instance. From this object, we can extract a data frame of the most highly enriched gene ontology categories. ``` r GO_result_df <- data.frame(GO_result, row.names = NULL) GO_result_df[1:3, ] ``` ``` ## ID Description GeneRatio BgRatio RichFactor ## 1 GO:0051216 cartilage development 66/2616 224/22195 0.2946429 ## 2 GO:0061448 connective tissue development 83/2616 318/22195 0.2610063 ## 3 GO:0198738 cell-cell signaling by wnt 108/2616 463/22195 0.2332613 ## FoldEnrichment zScore pvalue p.adjust qvalue ## 1 2.499846 8.246835 7.319794e-13 3.329261e-09 2.437582e-09 ## 2 2.214463 7.973432 1.327538e-12 3.329261e-09 2.437582e-09 ## 3 1.979066 7.782046 1.618241e-12 3.329261e-09 2.437582e-09 ## geneID ## 1 23967/12394/20679/18028/12156/20678/19227/12839/16974/234356/67216/108153/210719/11977/12023/14042/320910/70661/18605/72043/12160/58250/83925/20319/12393/15251/12399/12387/21834/73121/74155/12842/76566/12162/18750/17127/17260/18159/218772/666279/21808/214855/20583/269941/14173/17132/18740/65960/20855/106522/15530/102607/14182/14284/18121/192119/16007/14634/240396/14462/19226/12833/18019/208647/17129/19645 ## 2 23967/81004/12394/20679/19017/18028/12156/20678/19227/93691/12633/17978/12839/16973/16974/234356/67216/108153/210719/11977/12023/14042/320910/70661/18605/72043/16795/12160/16000/58250/83925/20319/231991/12393/15251/26383/12399/12387/20660/21834/73121/74155/12842/76566/12162/18750/17127/17260/18159/218772/666279/21808/214855/20583/18128/269941/14173/17132/18740/65960/20855/106522/15530/102607/666173/234199/13405/14182/14284/18121/192119/16007/14634/240396/14462/19226/12833/18019/232685/208647/17129/12977/19645 ## 3 235542/81004/50873/21885/20668/14369/12156/218952/72293/13649/20356/67974/16973/16974/21416/11820/21844/18033/54198/666048/107515/58198/16543/110957/26563/74256/69151/14042/18605/72043/232334/12389/53883/22414/66725/207181/12505/17342/20319/12876/67916/68010/50523/59036/56637/12387/98170/93687/218581/723818/12842/233328/219134/11426/17127/12388/66042/56458/329252/16841/621976/69601/20377/14371/19664/16842/20583/18128/19043/26941/13617/21929/14362/16687/14173/11865/109901/494504/21887/18163/107351/18390/57265/21888/64297/14365/50781/18121/68775/14370/14634/75607/21951/19211/17191/24136/13618/19042/20729/12833/75723/74025/77938/387223/84035/103583/665113/66277 ## Count ## 1 66 ## 2 83 ## 3 108 ``` --- ## Visualize enrichment result Network plots can be generated from any enrichResult object using the **enrichplot** package. We measure similarities between the various significant gene sets and group them accordingly. The **showCategory** argument specifies how many top gene ontology hits to show. ``` r library(enrichplot) GO_result_plot <- pairwise_termsim(GO_result) emapplot(GO_result_plot, showCategory = 20, cex.params = list(category_label = 0.6)) ``` <!-- --> --- ## Using MSigDB gene sets In addition to gene ontology, we can test our gene list against custom gene sets that we import using the clusterProfiler `enricher` function. Similar to the enrichGO function, this will produce an enrichResult object that can be used for visualization. Here we will get gene sets from MSigDB using the **msigdbr** package. We can use this package to pull out specific gene sets, which can be seen [at this link](http://software.broadinstitute.org/gsea/msigdb/). We can also run the `msigdbr_collections` function to see the categories and subcategory codes that will be used for accessing the gene sets. ``` r library(msigdbr) msigdbr_collections() ``` ``` ## # A tibble: 25 x 4 ## gs_collection gs_subcollection gs_collection_name num_genesets ## <chr> <chr> <chr> <int> ## 1 C1 "" Positional 302 ## 2 C2 "CGP" Chemical and Genetic Perturbati~ 3494 ## 3 C2 "CP" Canonical Pathways 19 ## 4 C2 "CP:BIOCARTA" BioCarta Pathways 292 ## 5 C2 "CP:KEGG_LEGACY" KEGG Legacy Pathways 186 ## 6 C2 "CP:KEGG_MEDICUS" KEGG Medicus Pathways 658 ## 7 C2 "CP:PID" PID Pathways 196 ## 8 C2 "CP:REACTOME" Reactome Pathways 1736 ## 9 C2 "CP:WIKIPATHWAYS" WikiPathways 830 ## 10 C3 "MIR:MIRDB" miRDB 2377 ## # i 15 more rows ``` --- ## Using MSigDB gene sets From the data frame on the previous slide we can identify the category/subcategory we want, and use those in the `msigdbr` function. Here we will use "C2" as the category, and "REACTOME" as the subcategory to access the [Reactome](https://reactome.org/) gene sets, and in the end we need to get a data frame where the first column contains the name of the gene sets and the second column contains the gene ID. ``` r library(msigdbr) msig_t2g_reac <- msigdbr(species = "Mus musculus", category = "C2", subcategory = "REACTOME") msig_t2g_reac <- msig_t2g_reac[, colnames(msig_t2g_reac) %in% c("gs_name", "entrez_gene")] msig_t2g_reac[1:3, ] ``` ``` ## # A tibble: 3 x 2 ## gs_name entrez_gene ## <chr> <chr> ## 1 REACTOME_2_LTR_CIRCLE_FORMATION 23825 ## 2 REACTOME_2_LTR_CIRCLE_FORMATION 15361 ## 3 REACTOME_2_LTR_CIRCLE_FORMATION 319583 ``` --- ## Using MSigDB gene sets We then run the gene set enrichment, using the term to gene mapping we created as the **TERM2GENE** argument in the enricher function. ``` r reactome <- enricher(gene = up_genes_uniq, universe = allGeneIDs, TERM2GENE = msig_t2g_reac) reactome_df <- data.frame(reactome, row.names = NULL) # clean up table to print on slide reactome_df <- reactome_df %>% dplyr::select(-Description, -geneID) %>% dplyr::mutate(ID = substr(ID, 1, 50)) reactome_df[1:5, ] ``` ``` ## ID GeneRatio BgRatio ## 1 REACTOME_CDC42_GTPASE_CYCLE 42/1351 153/10508 ## 2 REACTOME_NEPHRIN_FAMILY_INTERACTIONS 12/1351 23/10508 ## 3 REACTOME_DISEASES_OF_SIGNAL_TRANSDUCTION_BY_GROWTH 91/1351 464/10508 ## 4 REACTOME_RHO_GTPASE_CYCLE 87/1351 446/10508 ## 5 REACTOME_CELL_CELL_COMMUNICATION 38/1351 153/10508 ## RichFactor FoldEnrichment zScore pvalue p.adjust qvalue Count ## 1 0.2745098 2.135121 5.432546 9.312843e-07 0.001170624 0.001082250 42 ## 2 0.5217391 4.058057 5.639176 6.697622e-06 0.004209456 0.003891671 12 ## 3 0.1961207 1.525415 4.446301 1.745630e-05 0.007314192 0.006762021 91 ## 4 0.1950673 1.517222 4.287386 3.291241e-05 0.009331757 0.008627274 87 ## 5 0.2483660 1.931777 4.459363 3.711916e-05 0.009331757 0.008627274 38 ``` --- ## Limitations of nearest gene In our case, annotating the peaks with the nearest gene is likely not the best strategy given the high number of non-promoter and distal intergenic peaks. For example, it's known that regulatory elements like enhancers can regulate genes that are far away, and can even skip over nearby genes to regulate distant ones. By only using the closest gene, we might be missing key peak to gene relationships. For enhancer-focused peak sets we have other methods that are better suited. --- class: inverse, center, middle # Functional Enrichment of distal regions with GREAT <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## GREAT for GO and functional testing For methods that rely on genomic regions (i.e. peaks) like CUT&RUN and ATACseq, tools have been developed to better incorporate peaks distal to genes into their enrichment testing, such as the popular [GREAT](http://bejerano.stanford.edu/great/public/html/splash.php) toolset. Incorporating distal peaks by rules such as nearest gene results in some genes having a higher chance of being selected and hence some gene sets as a whole having a higher chance of having its members selected. It's known that genes can be regulated by elements that are far away (e.g. 1000kb), so you might miss key regulatory relationships by limiting to the closest gene. Of course this method will likely introduce more false positives, so this might be worth it if you know you are looking at enhancers (like Sox9), but not if the TF is binding to promoters. --- ## GREAT for GO and functional testing Instead of assigning a peak the closest genes (like ChIPseeker), GREAT provides flexible options for controlling peak annotation. The default setting is called 'basalPlusExt', which gives each gene a defined basal regulatory regions (e.g. 5kb upstream and 1kb downstream) and then expends that region in either direction based on a maximum set distance (e.g. 1000kb). A peak is then annotated by any gene regulatory region the peak overlaps. This might be multiple genes. <img src="imgs/great_basal_extension.png" alt="offset" height="250" width="800"> --- ## GREAT for GO and functional testing .pull-left[ After defining regulatory regions for each individual gene, GREAT then compares the proportion of peaks mapping to an entire gene set's regulatory regions to the proportion of the genome occupied by gene set's regulatory regions. i.e. If a gene set's regulatory regions account for 1% of the genome then one might expect 1% of peaks to overlap these regions by chance. This will help reduce bias towards gene sets that have many genes with large regulatory domains (i.e. more distal, few surrounding genes) [(McClean, 2010)](https://pmc.ncbi.nlm.nih.gov/articles/PMC4840234/) ] .pull-right[ <img src="imgs/great_explanation.png"height="500" width="325"> ] --- ## rGREAT We can use the rGREAT package to utilize the GREAT algorithm. There are two ways to use rGREAT, by remotely accessing the GREAT server with the `submitGreatJob` function, or running the algorithms locally with the `great` function. The local version provides a bit more flexibility, so we will use this. ``` r library(rGREAT) ``` --- ## rGREAT for GO and functional testing To use the `great` function we need at least three arguments: * GRanges of Sox9 peaks * the gene sets to test against (look at the help to see options - ?great) * a genome to define the TSS loci, usually easiest to specify a TxDb This function contains many other arguments to fine tune the gene annotation settings. ``` r great_gobp <- great(W6MinusD0_gr_up, gene_sets = "GO:BP", tss_source = "TxDb.Mmusculus.UCSC.mm10.knownGene") ``` --- ## rGREAT for GO and functional testing This function returns a GreatObject containing the results of the test and various informaiton about the inputs. These can be accessed using the '@' accessor. If we print the object, we get a summary. ``` r great_gobp ``` ``` ## 4627 regions are associated to 4245 genes' extended TSSs. ## TSS source: TxDb.Mmusculus.UCSC.mm10.knownGene ## Genome: mm10 ## OrgDb: org.Mm.eg.db ## Gene sets: GO:BP ## Background: whole genome excluding gaps ## Mode: Basal plus extension ## Proximal: 5000 bp upstream, 1000 bp downstream, ## plus Distal: up to 1000000 bp ``` --- ## rGREAT for GO and functional testing The results table can be retrieved using the `getEnrichmentTables` function. ``` r great_gobp_tab <- getEnrichmentTables(great_gobp) great_gobp_tab[1:5, ] ``` ``` ## id description ## 1 GO:0002762 negative regulation of myeloid leukocyte differentiation ## 2 GO:0045638 negative regulation of myeloid cell differentiation ## 3 GO:0042303 molting cycle ## 4 GO:0042633 hair cycle ## 5 GO:0045639 positive regulation of myeloid cell differentiation ## genome_fraction observed_region_hits fold_enrichment p_value p_adjust ## 1 0.007923178 99 2.700451 0 0 ## 2 0.011056773 136 2.658343 0 0 ## 3 0.015534092 190 2.643432 0 0 ## 4 0.015534092 190 2.643432 0 0 ## 5 0.010856163 132 2.627835 0 0 ## mean_tss_dist observed_gene_hits gene_set_size fold_enrichment_hyper ## 1 259213 23 64 1.742524 ## 2 261644 33 97 1.649579 ## 3 196828 59 137 2.088154 ## 4 196828 59 137 2.088154 ## 5 238689 37 114 1.573721 ## p_value_hyper p_adjust_hyper ## 1 3.301098e-03 1.634203e-02 ## 2 1.450937e-03 8.381898e-03 ## 3 2.201773e-09 4.700020e-08 ## 4 2.201773e-09 4.700020e-08 ## 5 2.034818e-03 1.097948e-02 ``` --- ## Reactome gene sets with rGREAT The `great` function can be used to test custom gene sets. The gene sets need to be a list where each entry is named and contains a character vector of gene IDs. We can convert our Reactome gene set table we previously generated to a list. ``` r # convert to a list of gene sets reac_gene_sets <- split(msig_t2g_reac$entrez_gene, msig_t2g_reac$gs_name) reac_gene_sets <- lapply(reac_gene_sets, as.character) reac_gene_sets[1:2] ``` ``` ## $REACTOME_2_LTR_CIRCLE_FORMATION ## [1] "23825" "15361" "319583" "101739" "108138" "22596" "14375" ## ## $REACTOME_ABACAVIR_ADME ## [1] "18671" "26357" "75894" "11522" "76952" "18534" "20517" "20518" "20519" ``` --- ## Reactome gene sets with rGREAT This list is then used for the 'gene_sets' argument for the `great` function. ``` r great_reac <- great(W6MinusD0_gr_up, gene_sets = reac_gene_sets, tss_source = "TxDb.Mmusculus.UCSC.mm10.knownGene") great_reac_tab <- getEnrichmentTable(great_reac) great_reac_tab[1:2, ] ``` ``` ## id genome_fraction ## 1 REACTOME_TRANSCRIPTIONAL_REGULATION_BY_RUNX3 0.008752716 ## 2 REACTOME_SIGNALING_BY_ALK 0.004829087 ## observed_region_hits fold_enrichment p_value p_adjust mean_tss_dist ## 1 138 3.407507 0 0 232957 ## 2 72 3.222315 0 0 252943 ## observed_gene_hits gene_set_size fold_enrichment_hyper p_value_hyper ## 1 34 96 1.717270 0.0005505350 ## 2 15 30 2.424382 0.0003260066 ## p_adjust_hyper ## 1 0.01333749 ## 2 0.01020957 ``` --- ## Reactome gene sets with rGREAT Our top hit, RUNX related regulation, lines up well with the publication. Below is a figure showing RNAseq results, demonstrating that at week 6 there is a string increase in RUNX genes. <img src="imgs/rnaseq_heatmap.png" alt="igv" height="500" width="300"> --- ## Retrieve rGREAT gene annotations It might be useful to know the genes associated with each peak. For example to overlap with a DE gene list from an RNAseq experiment. This can be retrieved with the `getRegionGeneAssociations` function. ``` r great_genes <- getRegionGeneAssociations(great_reac) great_genes[1:5] ``` ``` ## GRanges object with 5 ranges and 4 metadata columns: ## seqnames ranges strand | log2FoldChange padj ## <Rle> <IRanges> <Rle> | <numeric> <numeric> ## [1] chr3 69954132-69954644 * | 5.33160 9.93884e-36 ## [2] chr12 9855553-9856569 * | 4.99663 2.81330e-33 ## [3] chr7 114358259-114358706 * | 5.45989 9.14611e-31 ## [4] chr4 97465816-97466559 * | 4.43453 7.28496e-29 ## [5] chr13 38793431-38794003 * | 5.40520 1.40299e-28 ## annotated_genes dist_to_TSS ## <CharacterList> <IntegerList> ## [1] Sptssb,Otol1 -94192,-52969 ## [2] Osr1,Nt5c1b 285437,-513404 ## [3] Psma1,Pde3b -82141,-56575 ## [4] Gm12695,Nfia -680630,-306175 ## [5] Eef1e1,Slc35b3 -134373,166872 ## ------- ## seqinfo: 34 sequences from an unspecified genome; no seqlengths ``` --- ## Time for an exercise! Exercise on CUT&RUN data can be found [here](../../exercises/exercises/ATACCutRunChIP_Annotation_exercise.html) --- ## Answers to exercise Answers can be found [here](../../exercises/answers/ATACCutRunChIP_Annotation_answers.html) R code for solutions can be found [here](../../exercises/answers/ATACCutRunChIP_Annotation_answers.R)