Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived RU_Course_help folder in the Rstudio menu.
Session -> Set Working Directory -> Choose Directory
or in the console.
We have now called peaks and then built a consensus peak set that we have counted over and determined differential peaks between W6 and W0 for our CUT&RUN data.
In this section we will:
So far we have been working with CUT&RUN peaks corresponding to transcription factor binding or ATACseq peaks corresponding to open chromatin regions. Transcription factors, as implied in the name, can affect the expression of their target genes and open regions generally correlate with gene expression.
We will often annotate peaks to genes to try and identify the target of a transcription factor or a gene regulated by a regulatory element uncovered by ATACseq. This is typically done using a simple set of rules:
Peaks are typically annotated to a gene if * They overlap the gene. * The gene is the closest (and within a minimum distance).
A useful package for annotation of peaks to genes is ChIPseeker.
By using pre-defined annotation in the form of a TXDB object for mouse (mm10 genome), ChIPseeker will provide us with an overview of where peaks land in the gene and distance to TSS sites.
First load the libraries we require for the next part and read in our SOX9 CUT&RUN peaks.
The annotatePeak function accepts a GRanges object of the regions to annotate, a TXDB object for gene locations and a database object name to retrieve gene names from.
peakAnno <- annotatePeak(cnrPeaks_GR, tssRegion = c(-1000, 1000), TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,
annoDb = "org.Mm.eg.db")
## >> preparing features information... 2025-06-11 07:14:29
## >> identifying nearest features... 2025-06-11 07:14:29
## >> calculating distance from peak to TSS... 2025-06-11 07:14:30
## >> assigning genomic annotation... 2025-06-11 07:14:30
## >> adding gene annotation... 2025-06-11 07:14:31
## >> assigning chromosome lengths 2025-06-11 07:14:32
## >> done... 2025-06-11 07:14:32
## [1] "csAnno"
## attr(,"package")
## [1] "ChIPseeker"
The result is a csAnno object containing annotation for peaks and overall annotation statistics.
## Annotated peaks generated by ChIPseeker
## 72118/72174 peaks were annotated
## Genomic Annotation Summary:
## Feature Frequency
## 9 Promoter 20.3180898
## 4 5' UTR 0.1677806
## 3 3' UTR 1.4420810
## 1 1st Exon 1.2687540
## 7 Other Exon 2.5555340
## 2 1st Intron 14.3431598
## 8 Other Intron 24.3115450
## 6 Downstream (<=300) 0.1067695
## 5 Distal Intergenic 35.4862864
The csAnno object contains the information on annotation of individual peaks to genes.
To extract this from the csAnno object the ChIPseeker functions as.GRanges or as.data.frame can be used to produce the respective object with peaks and their associated genes.
## GRanges object with 2 ranges and 18 metadata columns:
## seqnames ranges strand | name score
## <Rle> <IRanges> <Rle> | <character> <numeric>
## [1] chr1 4466289-4466395 * | SOX9CNR_W6_rep1_macs.. 16
## [2] chr1 4785750-4785948 * | SOX9CNR_W6_rep1_macs.. 53
## signalValue pValue qValue peak annotation geneChr
## <numeric> <numeric> <numeric> <integer> <character> <integer>
## [1] 3.22302 3.93505 1.66890 13 Distal Intergenic 1
## [2] 5.79731 8.11406 5.39043 138 Promoter 1
## geneStart geneEnd geneLength geneStrand geneId
## <integer> <integer> <integer> <integer> <character>
## [1] 4492465 4493735 1271 2 20671
## [2] 4781221 4785739 4519 2 27395
## transcriptId distanceToTSS ENSEMBL SYMBOL
## <character> <numeric> <character> <character>
## [1] ENSMUST00000191939.1 27340 ENSMUSG00000025902 Sox17
## [2] ENSMUST00000146665.2 -11 ENSMUSG00000033845 Mrpl15
## GENENAME
## <character>
## [1] SRY (sex determining..
## [2] mitochondrial riboso..
## -------
## seqinfo: 35 sequences (1 circular) from mm10 genome
The genomic annotation for each peak is whin the annotation column and the closest gene is in shown in the geneId, ENSEMBL, and SYMBOL columns (geneId is the Entrez ID).
## [1] "Distal Intergenic"
## [2] "Promoter"
## [3] "Promoter"
## [4] "Promoter"
## [5] "Intron (ENSMUST00000134384.7/18777, intron 9 of 9)"
## [1] "Sox17" "Mrpl15" "Lypla1" "Lypla1" "Lypla1"
Now we have the annotated peaks from ChIPseeker we can use some of ChIPseeker’s plotting functions to display distribution of peaks in gene features. Here we use the plotAnnoBar function to plot this as a bar chart but plotAnnoPie would produce a similar plot as a pie chart.
Transcription factors or epigenetic marks may act on specific sets of genes grouped by a common biological feature (shared Biological function, common regulation in RNAseq experiment etc).
A frequent step in CUT&RUN or ATACseq analysis is to test whether common gene sets are enriched for transcription factor binding, epigenetic marks, or open chromatin regions.
Sources of well curated gene sets include GO consortium (gene’s function, biological process and cellular localisation), REACTOME (Biological Pathways) and MsigDB (Computationally and Experimentally derived).
Gene set enrichment testing may be performed on the sets of genes with peaks associated to them. We will not access these database libraries directly in testing but will use other R/Bioconductor libraries which make extensive use of them.
How we perform this analysis will depend on the type of peaks we are interested in. There are a wide range of types of peaks in our data set, some in promoters where annotation is straightforward, and many elswhere where annotation is trickier.
We will go through two strategies:
We will go through two strategies:
using the closest gene with ChIPseeker followed by gene set enrichment with clusterProfiler. This is typically done for peaks in promoters
allowing for annotation of one peak with multiple genes using toolset called GREAT. This is usually done to annotate enhancer or distal peaks.
Our peak set is large (~72k peaks), which will result in many genes being annotated to peaks.
Even from the ChIPseeker annotation of the nearest gene, there are almost 17k genes, making it unlikely we will find any real specific enrichment of gene sets.
We should choose a more specific set of peaks to test.
## [1] 16801
The specific set of peaks we choose will depend on our question. In our case, we just performed differential analysis on our consensus peak set, so we can use the peaks that go up in W6 vs W0.
Here we read in the differential results and look at them.
## seqnames start end width strand baseMean log2FoldChange
## 1 chr12 24581325 24586154 4830 * 9192.9984 -8.891086
## 2 chr12 24575912 24578078 2167 * 2788.4081 -9.487306
## 3 chr13 23573848 23574171 324 * 1399.1613 -7.289620
## 4 chr4 135549991 135552699 2709 * 439.3033 -6.265374
## 5 chrUn_GL456383 22710 26551 3842 * 2422.7188 -5.462346
## lfcSE stat pvalue padj
## 1 0.3253001 -27.33195 1.770064e-164 4.410114e-160
## 2 0.4380051 -21.66026 4.864412e-104 6.059841e-100
## 3 0.3891231 -18.73346 2.641873e-78 2.194075e-74
## 4 0.4460385 -14.04671 8.069439e-45 5.026252e-41
## 5 0.3993478 -13.67817 1.371076e-42 6.832071e-39
The gene annotation packages (e.g. ChIPseeker, GREAT) require a GRanges object. We can convert this table to a GRanges and keep key differential statistics as metadata in the GRanges object.
W6MinusD0_gr <- GRanges(seqnames = W6MinusD0$seqnames, IRanges(start = W6MinusD0$start,
end = W6MinusD0$end), log2FoldChange = W6MinusD0$log2FoldChange, padj = W6MinusD0$padj)
W6MinusD0_gr[1:3, ]
## GRanges object with 3 ranges and 2 metadata columns:
## seqnames ranges strand | log2FoldChange padj
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr12 24581325-24586154 * | -8.89109 4.41011e-160
## [2] chr12 24575912-24578078 * | -9.48731 6.05984e-100
## [3] chr13 23573848-23574171 * | -7.28962 2.19408e-74
## -------
## seqinfo: 34 sequences from an unspecified genome; no seqlengths
We are going to look at the peaks that are increased in W6, so the GRanges is subset to the peaks that have a fold change greater than 2 and an adjusted p-value less than 0.05.
We also remove peaks that aren’t on the main chromosomes (usually unplaced configs).
W6MinusD0_gr_main <- W6MinusD0_gr[as.vector(seqnames(W6MinusD0_gr)) %in% paste0("chr",
c(1:19, "X", "Y", "M"))]
W6MinusD0_gr_up <- W6MinusD0_gr_main[W6MinusD0_gr_main$log2FoldChange > 1 & W6MinusD0_gr_main$padj <
0.05]
W6MinusD0_gr_up
## GRanges object with 4627 ranges and 2 metadata columns:
## seqnames ranges strand | log2FoldChange padj
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr3 69954132-69954644 * | 5.33160 9.93884e-36
## [2] chr12 9855553-9856569 * | 4.99663 2.81330e-33
## [3] chr7 114358259-114358706 * | 5.45989 9.14611e-31
## [4] chr4 97465816-97466559 * | 4.43453 7.28496e-29
## [5] chr13 38793431-38794003 * | 5.40520 1.40299e-28
## ... ... ... ... . ... ...
## [4623] chr15 91037086-91037371 * | 1.59945 0.0498130
## [4624] chr14 11671472-11671821 * | 2.06389 0.0498410
## [4625] chr16 78456277-78456679 * | 1.55887 0.0498410
## [4626] chr6 90471253-90471635 * | 1.46471 0.0498410
## [4627] chr12 36438399-36438569 * | 1.70550 0.0498587
## -------
## seqinfo: 34 sequences from an unspecified genome; no seqlengths
To perform gene set using the closest gene to each peak, we will use the clusterProfiler package.
clusterProfiler provides multiple enrichment functions that allow for comparison of your gene list to known (e.g. GO, KEGG) or custom gene sets. Detailed information about all of the functionality within this package is available here.
The functions we will use in clusterProfiler take a vector of genes, which we can obtain from the ChIPseeker result.
It then calculates a p-value based on a hypergeometric distribution to determine if a gene set is over represented in our vector of genes.
To get a list of potential target genes, we again use the ChIPseeker package to associate our peaks, representing potential transcription factor binding events, to their overlapping or closest mm10 genes.
## Annotated peaks generated by ChIPseeker
## 4627/4627 peaks were annotated
## Genomic Annotation Summary:
## Feature Frequency
## 9 Promoter 6.02982494
## 4 5' UTR 0.12967365
## 3 3' UTR 1.18867517
## 1 1st Exon 1.70736979
## 7 Other Exon 3.02571861
## 2 1st Intron 15.99308407
## 8 Other Intron 29.21979684
## 6 Downstream (<=300) 0.06483683
## 5 Distal Intergenic 42.64102010
We again convert this csAnno object to a GRanges to see the individual peak annotations.
## GRanges object with 3 ranges and 14 metadata columns:
## seqnames ranges strand | log2FoldChange padj
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr3 69954132-69954644 * | 5.33160 9.93884e-36
## [2] chr12 9855553-9856569 * | 4.99663 2.81330e-33
## [3] chr7 114358259-114358706 * | 5.45989 9.14611e-31
## annotation geneChr geneStart geneEnd geneLength
## <character> <integer> <integer> <integer> <integer>
## [1] Distal Intergenic 3 70007613 70028708 21096
## [2] Distal Intergenic 12 9574441 9581500 7060
## [3] Intron (ENSMUST00000.. 7 114354640 114415021 60382
## geneStrand geneId transcriptId distanceToTSS
## <integer> <character> <character> <numeric>
## [1] 1 229389 ENSMUST00000053013.5 -52969
## [2] 1 23967 ENSMUST00000057021.8 281112
## [3] 2 71045 ENSMUST00000124673.1 56315
## ENSEMBL SYMBOL GENENAME
## <character> <character> <character>
## [1] ENSMUSG00000027788 Otol1 otolin 1
## [2] ENSMUSG00000048387 Osr1 odd-skipped related ..
## [3] ENSMUSG00000087475 4933406I18Rik RIKEN cDNA 4933406I1..
## -------
## seqinfo: 34 sequences (1 circular) from mm10 genome
The closest gene annotation approach works well when we subset to just the peaks localized to promoters as this leads to confident peak to gene annotations.
For this Sox9 CUT&RUN dataset there are not many promoter-bound peaks, resulting in a small gene list. This might be okay if certain gene sets are enriched in these genes.
annotatedPeaksGR_up_prom <- annotatedPeaksGR_up[grepl("Promoter", annotatedPeaksGR_up$annotation)]
up_promGenes_uniq <- unique(annotatedPeaksGR_up_prom$geneId)
length(up_promGenes_uniq)
## [1] 268
Pathway enrichment requires a ‘universe’ of genes. We can extract all genes which are included in the TxDb object to use as our universe.
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 100009600 chr9 21062393-21073096 - | 100009600
## 100009609 chr7 84935565-84964115 - | 100009609
## -------
## seqinfo: 66 sequences (1 circular) from mm10 genome
Once we have our gene list and the universe of genes in the same
format, we can use them in the enrichGO
function to perform
gene ontology analysis
For the ont argument, we can choose between the “BP”, “MF”, and “CC” subontologies, or “ALL” for all three.
In our case, this results in no enriched terms. This suggests the Sox9 is not likely not enacting its main regulatory effect by binding to promoters.
This is not surprising given how few peaks overlapped with promoters
## #
## # over-representation test
## #
## #...@organism Mus musculus
## #...@ontology BP
## #...@keytype ENTREZID
## #...@gene chr [1:268] "56443" "68481" "53333" "102465715" "77116" "238871" "13806" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...0 enriched terms found
## #...Citation
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
## clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
## The Innovation. 2021, 2(3):100141
We can try to use the whole set of Sox9 peaks that go up in the week 6 time point.
## [1] 2821
We again use the enrichGO
function, but this time with
all of the genes from the differential peaks.
GO_result <- enrichGO(gene = up_genes_uniq, universe = allGeneIDs, OrgDb = org.Mm.eg.db,
ont = "BP")
GO_result
## #
## # over-representation test
## #
## #...@organism Mus musculus
## #...@ontology BP
## #...@keytype ENTREZID
## #...@gene chr [1:2821] "229389" "23967" "71045" "319865" "66143" "18478" "52440" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...976 enriched terms found
## 'data.frame': 976 obs. of 12 variables:
## $ ID : chr "GO:0051216" "GO:0061448" "GO:0198738" "GO:0016055" ...
## $ Description : chr "cartilage development" "connective tissue development" "cell-cell signaling by wnt" "Wnt signaling pathway" ...
## $ GeneRatio : chr "66/2616" "83/2616" "108/2616" "107/2616" ...
## $ BgRatio : chr "224/22195" "318/22195" "463/22195" "461/22195" ...
## $ RichFactor : num 0.295 0.261 0.233 0.232 0.257 ...
## $ FoldEnrichment: num 2.5 2.21 1.98 1.97 2.18 ...
## $ zScore : num 8.25 7.97 7.78 7.69 7.81 ...
## $ pvalue : num 7.32e-13 1.33e-12 1.62e-12 2.85e-12 3.24e-12 ...
## $ p.adjust : num 3.33e-09 3.33e-09 3.33e-09 3.97e-09 3.97e-09 ...
## $ qvalue : num 2.44e-09 2.44e-09 2.44e-09 2.91e-09 2.91e-09 ...
## $ geneID : chr "23967/12394/20679/18028/12156/20678/19227/12839/16974/234356/67216/108153/210719/11977/12023/14042/320910/70661"| __truncated__ "23967/81004/12394/20679/19017/18028/12156/20678/19227/93691/12633/17978/12839/16973/16974/234356/67216/108153/2"| __truncated__ "235542/81004/50873/21885/20668/14369/12156/218952/72293/13649/20356/67974/16973/16974/21416/11820/21844/18033/5"| __truncated__ "235542/81004/50873/21885/20668/14369/12156/218952/72293/13649/20356/67974/16973/16974/21416/11820/21844/18033/5"| __truncated__ ...
## $ Count : int 66 83 108 107 83 110 51 95 109 51 ...
## #...Citation
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
## clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
## The Innovation. 2021, 2(3):100141
We now have an enrichResult instance. From this object, we can extract a data frame of the most highly enriched gene ontology categories.
## ID Description GeneRatio BgRatio RichFactor
## 1 GO:0051216 cartilage development 66/2616 224/22195 0.2946429
## 2 GO:0061448 connective tissue development 83/2616 318/22195 0.2610063
## 3 GO:0198738 cell-cell signaling by wnt 108/2616 463/22195 0.2332613
## FoldEnrichment zScore pvalue p.adjust qvalue
## 1 2.499846 8.246835 7.319794e-13 3.329261e-09 2.437582e-09
## 2 2.214463 7.973432 1.327538e-12 3.329261e-09 2.437582e-09
## 3 1.979066 7.782046 1.618241e-12 3.329261e-09 2.437582e-09
## geneID
## 1 23967/12394/20679/18028/12156/20678/19227/12839/16974/234356/67216/108153/210719/11977/12023/14042/320910/70661/18605/72043/12160/58250/83925/20319/12393/15251/12399/12387/21834/73121/74155/12842/76566/12162/18750/17127/17260/18159/218772/666279/21808/214855/20583/269941/14173/17132/18740/65960/20855/106522/15530/102607/14182/14284/18121/192119/16007/14634/240396/14462/19226/12833/18019/208647/17129/19645
## 2 23967/81004/12394/20679/19017/18028/12156/20678/19227/93691/12633/17978/12839/16973/16974/234356/67216/108153/210719/11977/12023/14042/320910/70661/18605/72043/16795/12160/16000/58250/83925/20319/231991/12393/15251/26383/12399/12387/20660/21834/73121/74155/12842/76566/12162/18750/17127/17260/18159/218772/666279/21808/214855/20583/18128/269941/14173/17132/18740/65960/20855/106522/15530/102607/666173/234199/13405/14182/14284/18121/192119/16007/14634/240396/14462/19226/12833/18019/232685/208647/17129/12977/19645
## 3 235542/81004/50873/21885/20668/14369/12156/218952/72293/13649/20356/67974/16973/16974/21416/11820/21844/18033/54198/666048/107515/58198/16543/110957/26563/74256/69151/14042/18605/72043/232334/12389/53883/22414/66725/207181/12505/17342/20319/12876/67916/68010/50523/59036/56637/12387/98170/93687/218581/723818/12842/233328/219134/11426/17127/12388/66042/56458/329252/16841/621976/69601/20377/14371/19664/16842/20583/18128/19043/26941/13617/21929/14362/16687/14173/11865/109901/494504/21887/18163/107351/18390/57265/21888/64297/14365/50781/18121/68775/14370/14634/75607/21951/19211/17191/24136/13618/19042/20729/12833/75723/74025/77938/387223/84035/103583/665113/66277
## Count
## 1 66
## 2 83
## 3 108
Network plots can be generated from any enrichResult object using the enrichplot package.
We measure similarities between the various significant gene sets and group them accordingly. The showCategory argument specifies how many top gene ontology hits to show.
library(enrichplot)
GO_result_plot <- pairwise_termsim(GO_result)
emapplot(GO_result_plot, showCategory = 20, cex.params = list(category_label = 0.6))
In addition to gene ontology, we can test our gene list against
custom gene sets that we import using the clusterProfiler
enricher
function. Similar to the enrichGO function, this
will produce an enrichResult object that can be used for
visualization.
Here we will get gene sets from MSigDB using the
msigdbr package. We can use this package to pull out
specific gene sets, which can be seen at this link.
We can also run the msigdbr_collections
function to see the
categories and subcategory codes that will be used for accessing the
gene sets.
## # A tibble: 25 x 4
## gs_collection gs_subcollection gs_collection_name num_genesets
## <chr> <chr> <chr> <int>
## 1 C1 "" Positional 302
## 2 C2 "CGP" Chemical and Genetic Perturbati~ 3494
## 3 C2 "CP" Canonical Pathways 19
## 4 C2 "CP:BIOCARTA" BioCarta Pathways 292
## 5 C2 "CP:KEGG_LEGACY" KEGG Legacy Pathways 186
## 6 C2 "CP:KEGG_MEDICUS" KEGG Medicus Pathways 658
## 7 C2 "CP:PID" PID Pathways 196
## 8 C2 "CP:REACTOME" Reactome Pathways 1736
## 9 C2 "CP:WIKIPATHWAYS" WikiPathways 830
## 10 C3 "MIR:MIRDB" miRDB 2377
## # i 15 more rows
From the data frame on the previous slide we can identify the
category/subcategory we want, and use those in the msigdbr
function. Here we will use “C2” as the category, and “REACTOME” as the
subcategory to access the Reactome
gene sets, and in the end we need to get a data frame where the first
column contains the name of the gene sets and the second column contains
the gene ID.
library(msigdbr)
msig_t2g_reac <- msigdbr(species = "Mus musculus", category = "C2", subcategory = "REACTOME")
msig_t2g_reac <- msig_t2g_reac[, colnames(msig_t2g_reac) %in% c("gs_name", "entrez_gene")]
msig_t2g_reac[1:3, ]
## # A tibble: 3 x 2
## gs_name entrez_gene
## <chr> <chr>
## 1 REACTOME_2_LTR_CIRCLE_FORMATION 23825
## 2 REACTOME_2_LTR_CIRCLE_FORMATION 15361
## 3 REACTOME_2_LTR_CIRCLE_FORMATION 319583
We then run the gene set enrichment, using the term to gene mapping we created as the TERM2GENE argument in the enricher function.
reactome <- enricher(gene = up_genes_uniq, universe = allGeneIDs, TERM2GENE = msig_t2g_reac)
reactome_df <- data.frame(reactome, row.names = NULL)
# clean up table to print on slide
reactome_df <- reactome_df %>%
dplyr::select(-Description, -geneID) %>%
dplyr::mutate(ID = substr(ID, 1, 50))
reactome_df[1:5, ]
## ID GeneRatio BgRatio
## 1 REACTOME_CDC42_GTPASE_CYCLE 42/1351 153/10508
## 2 REACTOME_NEPHRIN_FAMILY_INTERACTIONS 12/1351 23/10508
## 3 REACTOME_DISEASES_OF_SIGNAL_TRANSDUCTION_BY_GROWTH 91/1351 464/10508
## 4 REACTOME_RHO_GTPASE_CYCLE 87/1351 446/10508
## 5 REACTOME_CELL_CELL_COMMUNICATION 38/1351 153/10508
## RichFactor FoldEnrichment zScore pvalue p.adjust qvalue Count
## 1 0.2745098 2.135121 5.432546 9.312843e-07 0.001170624 0.001082250 42
## 2 0.5217391 4.058057 5.639176 6.697622e-06 0.004209456 0.003891671 12
## 3 0.1961207 1.525415 4.446301 1.745630e-05 0.007314192 0.006762021 91
## 4 0.1950673 1.517222 4.287386 3.291241e-05 0.009331757 0.008627274 87
## 5 0.2483660 1.931777 4.459363 3.711916e-05 0.009331757 0.008627274 38
In our case, annotating the peaks with the nearest gene is likely not the best strategy given the high number of non-promoter and distal intergenic peaks.
For example, it’s known that regulatory elements like enhancers can regulate genes that are far away, and can even skip over nearby genes to regulate distant ones.
By only using the closest gene, we might be missing key peak to gene relationships. For enhancer-focused peak sets we have other methods that are better suited.
For methods that rely on genomic regions (i.e. peaks) like CUT&RUN and ATACseq, tools have been developed to better incorporate peaks distal to genes into their enrichment testing, such as the popular GREAT toolset.
Incorporating distal peaks by rules such as nearest gene results in some genes having a higher chance of being selected and hence some gene sets as a whole having a higher chance of having its members selected.
It’s known that genes can be regulated by elements that are far away (e.g. 1000kb), so you might miss key regulatory relationships by limiting to the closest gene.
Of course this method will likely introduce more false positives, so this might be worth it if you know you are looking at enhancers (like Sox9), but not if the TF is binding to promoters.
Instead of assigning a peak the closest genes (like ChIPseeker), GREAT provides flexible options for controlling peak annotation.
The default setting is called ‘basalPlusExt’, which gives each gene a defined basal regulatory regions (e.g. 5kb upstream and 1kb downstream) and then expends that region in either direction based on a maximum set distance (e.g. 1000kb).
A peak is then annotated by any gene regulatory region the peak overlaps. This might be multiple genes.
After defining regulatory regions for each individual gene, GREAT then compares the proportion of peaks mapping to an entire gene set’s regulatory regions to the proportion of the genome occupied by gene set’s regulatory regions.
i.e. If a gene set’s regulatory regions account for 1% of the genome then one might expect 1% of peaks to overlap these regions by chance.
This will help reduce bias towards gene sets that have many genes with large regulatory domains (i.e. more distal, few surrounding genes)
]
We can use the rGREAT package to utilize the GREAT algorithm.
There are two ways to use rGREAT, by remotely accessing the GREAT
server with the submitGreatJob
function, or running the
algorithms locally with the great
function.
The local version provides a bit more flexibility, so we will use this.
To use the great
function we need at least three
arguments:
This function contains many other arguments to fine tune the gene annotation settings.
This function returns a GreatObject containing the results of the test and various informaiton about the inputs. These can be accessed using the ‘@’ accessor.
If we print the object, we get a summary.
## 4627 regions are associated to 4245 genes' extended TSSs.
## TSS source: TxDb.Mmusculus.UCSC.mm10.knownGene
## Genome: mm10
## OrgDb: org.Mm.eg.db
## Gene sets: GO:BP
## Background: whole genome excluding gaps
## Mode: Basal plus extension
## Proximal: 5000 bp upstream, 1000 bp downstream,
## plus Distal: up to 1000000 bp
The results table can be retrieved using the
getEnrichmentTables
function.
## id description
## 1 GO:0002762 negative regulation of myeloid leukocyte differentiation
## 2 GO:0045638 negative regulation of myeloid cell differentiation
## 3 GO:0042303 molting cycle
## 4 GO:0042633 hair cycle
## 5 GO:0045639 positive regulation of myeloid cell differentiation
## genome_fraction observed_region_hits fold_enrichment p_value p_adjust
## 1 0.007923178 99 2.700451 0 0
## 2 0.011056773 136 2.658343 0 0
## 3 0.015534092 190 2.643432 0 0
## 4 0.015534092 190 2.643432 0 0
## 5 0.010856163 132 2.627835 0 0
## mean_tss_dist observed_gene_hits gene_set_size fold_enrichment_hyper
## 1 259213 23 64 1.742524
## 2 261644 33 97 1.649579
## 3 196828 59 137 2.088154
## 4 196828 59 137 2.088154
## 5 238689 37 114 1.573721
## p_value_hyper p_adjust_hyper
## 1 3.301098e-03 1.634203e-02
## 2 1.450937e-03 8.381898e-03
## 3 2.201773e-09 4.700020e-08
## 4 2.201773e-09 4.700020e-08
## 5 2.034818e-03 1.097948e-02
The great
function can be used to test custom gene sets.
The gene sets need to be a list where each entry is named and contains a
character vector of gene IDs.
We can convert our Reactome gene set table we previously generated to a list.
# convert to a list of gene sets
reac_gene_sets <- split(msig_t2g_reac$entrez_gene, msig_t2g_reac$gs_name)
reac_gene_sets <- lapply(reac_gene_sets, as.character)
reac_gene_sets[1:2]
## $REACTOME_2_LTR_CIRCLE_FORMATION
## [1] "23825" "15361" "319583" "101739" "108138" "22596" "14375"
##
## $REACTOME_ABACAVIR_ADME
## [1] "18671" "26357" "75894" "11522" "76952" "18534" "20517" "20518" "20519"
This list is then used for the ‘gene_sets’ argument for the
great
function.
great_reac <- great(W6MinusD0_gr_up, gene_sets = reac_gene_sets, tss_source = "TxDb.Mmusculus.UCSC.mm10.knownGene")
great_reac_tab <- getEnrichmentTable(great_reac)
great_reac_tab[1:2, ]
## id genome_fraction
## 1 REACTOME_TRANSCRIPTIONAL_REGULATION_BY_RUNX3 0.008752716
## 2 REACTOME_SIGNALING_BY_ALK 0.004829087
## observed_region_hits fold_enrichment p_value p_adjust mean_tss_dist
## 1 138 3.407507 0 0 232957
## 2 72 3.222315 0 0 252943
## observed_gene_hits gene_set_size fold_enrichment_hyper p_value_hyper
## 1 34 96 1.717270 0.0005505350
## 2 15 30 2.424382 0.0003260066
## p_adjust_hyper
## 1 0.01333749
## 2 0.01020957
Our top hit, RUNX related regulation, lines up well with the publication. Below is a figure showing RNAseq results, demonstrating that at week 6 there is a string increase in RUNX genes.
It might be useful to know the genes associated with each peak. For example to overlap with a DE gene list from an RNAseq experiment.
This can be retrieved with the getRegionGeneAssociations
function.
## GRanges object with 5 ranges and 4 metadata columns:
## seqnames ranges strand | log2FoldChange padj
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr3 69954132-69954644 * | 5.33160 9.93884e-36
## [2] chr12 9855553-9856569 * | 4.99663 2.81330e-33
## [3] chr7 114358259-114358706 * | 5.45989 9.14611e-31
## [4] chr4 97465816-97466559 * | 4.43453 7.28496e-29
## [5] chr13 38793431-38794003 * | 5.40520 1.40299e-28
## annotated_genes dist_to_TSS
## <CharacterList> <IntegerList>
## [1] Sptssb,Otol1 -94192,-52969
## [2] Osr1,Nt5c1b 285437,-513404
## [3] Psma1,Pde3b -82141,-56575
## [4] Gm12695,Nfia -680630,-306175
## [5] Eef1e1,Slc35b3 -134373,166872
## -------
## seqinfo: 34 sequences from an unspecified genome; no seqlengths
Exercise on CUT&RUN data can be found here