In todays session we will continue to review the Myc ChIPseq we were working on in our last sessions.
This include Myc ChIPseq for MEL and Ch12 celllines.
Information and files for the Myc ChIPseq in MEL cell line can be found here
Information and files for the Myc ChIPseq in Ch12 cell line can be found here
In the data directory we have provided peak calls from MACS2 following the processing steps outlined in our last session.
Peak calls for Myc in MEL and Ch12 cellines can be found in
data/peaks/
In our last session we reviewed how we can identify putative transciption factor binding sites using peak calling programs such as MACS2.
library(GenomicRanges)
<- "data/peaks/Mel_1_peaks.xls"
macsPeaks <- read.delim(macsPeaks,comment.char="#")
macsPeaks_DF <- GRanges(seqnames=macsPeaks_DF[,"chr"],
macsPeaks_GR IRanges(macsPeaks_DF[,"start"],macsPeaks_DF[,"end"]))
mcols(macsPeaks_GR) <- macsPeaks_DF[,c("abs_summit", "fold_enrichment")]
1:5,] macsPeaks_GR[
## GRanges object with 5 ranges and 2 metadata columns:
## seqnames ranges strand | abs_summit fold_enrichment
## <Rle> <IRanges> <Rle> | <integer> <numeric>
## [1] chr1 4785370-4785641 * | 4785565 5.23296
## [2] chr1 5082993-5083216 * | 5083122 4.26417
## [3] chr1 7397557-7398093 * | 7397835 9.84580
## [4] chr1 9545120-9545629 * | 9545400 11.85079
## [5] chr1 9700895-9701037 * | 9700978 5.14291
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
Since transcription factors, as suggested in name, may regulate the transcription of their target genes, we used the ChIPseeker package to associate our peaks, representing potential transcription factor binding events, to their overlapping or closest mm10 genes.
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(ChIPseeker)
<- annotatePeak(macsPeaks_GR, tssRegion=c(-1000, 1000),
peakAnno TxDb=TxDb.Mmusculus.UCSC.mm10.knownGene,
annoDb="org.Mm.eg.db")
## >> preparing features information... 2021-08-02 03:24:04 PM
## >> identifying nearest features... 2021-08-02 03:24:04 PM
## >> calculating distance from peak to TSS... 2021-08-02 03:24:05 PM
## >> assigning genomic annotation... 2021-08-02 03:24:05 PM
## >> adding gene annotation... 2021-08-02 03:24:08 PM
## >> assigning chromosome lengths 2021-08-02 03:24:08 PM
## >> done... 2021-08-02 03:24:08 PM
This allowed us to produce a GRanges or data.frame of our peaks and their predicted target genes.
<- as.GRanges(peakAnno)
annotatedPeaksGR <- as.data.frame(peakAnno)
annotatedPeaksDF 1:2, ] annotatedPeaksDF[
## seqnames start end width strand abs_summit fold_enrichment annotation
## 1 chr1 4785370 4785641 272 * 4785565 5.23296 Promoter
## 2 chr1 5082993 5083216 224 * 5083122 4.26417 Promoter
## geneChr geneStart geneEnd geneLength geneStrand geneId transcriptId
## 1 1 4783572 4785692 2121 2 27395 ENSMUST00000132625.1
## 2 1 5083080 5162529 79450 1 108664 ENSMUST00000044369.12
## distanceToTSS ENSEMBL SYMBOL
## 1 51 ENSMUSG00000033845 Mrpl15
## 2 0 ENSMUSG00000033793 Atp6v1h
## GENENAME
## 1 mitochondrial ribosomal protein L15
## 2 ATPase, H+ transporting, lysosomal V1 subunit H
Transcription factors or epigenetic marks may act on specific sets of genes grouped by a common biological feature (shared Biological function, common regulation in RNAseq experiment etc).
A frequent step in ChIPseq analysis is to test whether common gene sets are enriched for transcription factor binding or epigenetic marks.
Sources of well curated gene sets include GO consortium (gene’s function, biological process and cellular localisation), REACTOME (Biological Pathways) and MsigDB (Computationally and Experimentally derived).
Gene set enrichment testing may be performed on the sets of genes with peaks associated to them. In this example we will consider genes with peaks within 1000bp of a gene’s TSS.
We will not access these database libraries directly in testing but will use other R/Bioconductor libraries which make extensive use of them.
To perform gene set testing here, we will use the clusterProfiler package.
clusterProfiler provides multiple enrichment functions that allow for comparison of your gene list to known (e.g. GO, KEGG) or custom gene sets. Detailed information about all of the functionality within this package is available here.
In this example we use all TSS sites we found to be overlapped by Myc peaks.
The peaks landing in TSS regions will be marked as “Promoter” in the annotation column of our annotated GRanges object.
1, ] annotatedPeaksGR[
## GRanges object with 1 range and 14 metadata columns:
## seqnames ranges strand | abs_summit fold_enrichment annotation
## <Rle> <IRanges> <Rle> | <integer> <numeric> <character>
## [1] chr1 4785370-4785641 * | 4785565 5.23296 Promoter
## geneChr geneStart geneEnd geneLength geneStrand geneId
## <integer> <integer> <integer> <integer> <integer> <character>
## [1] 1 4783572 4785692 2121 2 27395
## transcriptId distanceToTSS ENSEMBL SYMBOL
## <character> <numeric> <character> <character>
## [1] ENSMUST00000132625.1 51 ENSMUSG00000033845 Mrpl15
## GENENAME
## <character>
## [1] mitochondrial riboso..
## -------
## seqinfo: 21 sequences from mm10 genome
We can extract the unique names of genes with peaks in their TSS by subsetting the annotated GRanges and retrieving gene names from the geneId column.
<- annotatedPeaksGR[annotatedPeaksGR$annotation == "Promoter",
annotatedPeaksGR_TSS
]<- unique(annotatedPeaksGR_TSS$geneId)
genesWithPeakInTSS 1:2] genesWithPeakInTSS[
## [1] "27395" "108664"
Next we can extract all genes which are included in the TxDb object to use as our universe of genes for pathway enrichment.
<- genes(TxDb.Mmusculus.UCSC.mm10.knownGene)
allGeneGR 1:2, ] allGeneGR[
## GRanges object with 2 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 100009600 chr9 21062393-21073096 - | 100009600
## 100009609 chr7 84935565-84964115 - | 100009609
## -------
## seqinfo: 66 sequences (1 circular) from mm10 genome
<- allGeneGR$gene_id allGeneIDs
Once we have our gene list and the universe of genes in the same format, we can use them in the enrichGO function to perform gene ontology analysis
For the ont argument, we can choose between the “BP”, “MF”, and “CC” subontologies, or “ALL” for all three.
library(clusterProfiler)
library(org.Mm.eg.db)
<- enrichGO(gene = genesWithPeakInTSS, universe = allGeneIDs, OrgDb = org.Mm.eg.db,
GO_result ont = "BP")
We now have an enrichResult instance. From this object, we can extract a data frame of the most highly enriched gene ontology categories.
<- data.frame(GO_result)
GO_result_df 1:5, ] GO_result_df[
## ID Description GeneRatio BgRatio
## GO:0022613 GO:0022613 ribonucleoprotein complex biogenesis 288/4898 405/20741
## GO:0034660 GO:0034660 ncRNA metabolic process 293/4898 444/20741
## GO:0034470 GO:0034470 ncRNA processing 248/4898 354/20741
## GO:0042254 GO:0042254 ribosome biogenesis 212/4898 291/20741
## GO:0006364 GO:0006364 rRNA processing 153/4898 206/20741
## pvalue p.adjust qvalue
## GO:0022613 1.050936e-92 6.443286e-89 5.183878e-89
## GO:0034660 7.292474e-82 2.235508e-78 1.798554e-78
## GO:0034470 1.300402e-77 2.657589e-74 2.138135e-74
## GO:0042254 2.150571e-71 3.296288e-68 2.651994e-68
## GO:0006364 1.328419e-53 1.628908e-50 1.310520e-50
## geneID
## GO:0022613 59014/19989/27993/73674/55989/57750/22596/17975/67949/66942/20643/13211/15278/67459/67223/215193/353258/67239/56200/67332/108707/110816/18114/216156/20054/20222/103136/70683/69878/52705/432508/56351/18813/64934/67465/52513/57316/103573/18148/52530/216766/216767/19069/12261/18432/67493/104662/192159/276919/216987/217011/78394/56321/217038/110809/12812/67684/107732/19942/68979/67671/72344/217431/20115/71952/15519/217869/69237/217995/30946/208366/70930/28126/94066/105372/108143/66254/72198/218629/20088/52535/27374/75751/27275/72662/14000/223499/78581/239528/67959/109075/12181/55944/27367/20826/74778/72193/75705/21357/66409/106298/224092/57808/68193/67014/106338/207932/224481/67045/213773/66978/67097/20630/19896/72462/54127/27756/53817/15516/224823/66229/53414/18519/72515/12607/67242/20641/108013/69241/225348/67936/67222/20044/68731/20227/67674/107071/28000/59028/19211/54364/107094/66583/18150/18572/13669/227522/20935/27176/227656/118451/64930/227674/227715/66489/75422/13046/75786/98956/66181/271842/320632/20638/67134/66580/67204/16418/68295/67996/228889/66481/68879/69126/69639/229317/56758/97112/74200/56417/70767/56195/68147/229937/70285/100019/230082/20104/12578/72181/20116/69072/67205/75062/230737/214150/14105/67025/69902/50912/74035/66448/57741/20817/433864/75416/17089/65961/231452/94061/100503670/100608/117109/11837/71990/19988/208144/19384/22384/100609/74097/66596/192196/28035/68011/57905/56505/72171/213895/110109/60321/20901/68988/20103/68077/107686/13871/69731/20085/27998/14113/20055/22121/67973/20068/67148/101867/67920/67276/234344/234374/50783/19656/78651/24015/234549/214627/214572/66164/67619/72544/235036/215194/71989/382137/77591/11920/102462/67941/225215/245000/22258/72831/16785/67115/66446/13205/72584/68995/72554/19655/110954/245474/26905/76130/237107/69499/22184/237221
## GO:0034660 59014/17864/19989/27993/73674/212679/57750/23874/69163/17975/67949/66942/226414/15278/67459/381314/107508/67223/77065/215193/73681/67239/56200/108707/110816/18114/216156/20054/28088/70683/52705/17299/216443/18813/64934/52513/57316/74467/71701/18148/104458/66628/52530/194908/68626/54351/67493/104662/216987/78394/217038/70422/20848/69684/19942/52686/68979/76265/72344/68730/217431/20115/328092/66132/211064/328162/74691/69237/217995/30946/68916/208366/69955/70930/105148/28114/100715/105372/66254/72198/105083/67053/68045/20088/67011/105638/56335/18130/72662/110960/14000/67724/223499/78581/68260/239528/109075/12181/20826/74778/72026/21357/66409/74108/70120/57808/106248/52575/106338/207932/224481/67045/213773/19896/57773/72462/54127/22321/68915/224823/66229/53414/18519/224907/72515/72167/240174/76781/58523/225348/67936/15115/107045/68857/67222/20044/70223/68731/20227/19697/67674/107071/109077/70044/59028/54364/107094/66583/18150/18572/72133/227522/13434/74455/27176/227656/68205/227715/66489/20823/75422/98956/66181/271842/69185/66580/66044/16418/228889/66481/67005/69639/229487/56417/229543/71807/70560/20226/229780/28036/68147/108943/229937/70285/100019/109093/230082/69934/74753/230233/20104/12578/100087/72181/68276/20116/69072/66966/67205/214150/107271/67025/69902/50912/74035/71957/78697/75416/78890/17089/211006/65961/231413/94061/100503670/56361/117109/71990/70650/208144/100929/100609/83701/74097/231803/56463/53379/353172/69786/72171/70047/14911/381802/213895/110109/60321/66078/68077/13871/20085/27998/14113/20055/233189/67973/20068/244141/101861/101867/233802/71941/67920/67276/234344/234374/70359/78651/212528/66590/68544/214627/66369/67619/234734/72544/30947/85305/66965/212728/215194/71989/67049/382137/77591/102462/69882/69606/68291/75686/97541/67789/72341/67115/66446/102436/54632/13205/69942/72554/245474/76130/69499/15108
## GO:0034470 59014/19989/27993/73674/57750/69163/17975/67949/66942/15278/67459/67223/77065/215193/73681/67239/56200/108707/110816/18114/216156/20054/28088/70683/52705/17299/18813/64934/52513/57316/74467/18148/66628/52530/68626/54351/67493/104662/216987/78394/217038/70422/20848/19942/52686/68979/76265/72344/68730/217431/20115/66132/211064/328162/69237/217995/30946/68916/208366/69955/70930/28114/105372/66254/72198/67053/68045/20088/67011/105638/56335/18130/72662/14000/67724/223499/78581/68260/239528/109075/12181/20826/74778/72026/21357/66409/74108/57808/106248/52575/106338/207932/224481/67045/213773/19896/57773/72462/54127/224823/66229/53414/18519/224907/72515/72167/240174/58523/225348/67936/68857/67222/20044/68731/20227/67674/107071/109077/70044/59028/54364/107094/66583/18150/18572/72133/227522/13434/74455/27176/227656/68205/227715/66489/20823/75422/98956/66181/271842/69185/66580/16418/228889/66481/67005/69639/56417/229543/229780/28036/68147/108943/229937/70285/100019/230082/69934/74753/230233/20104/12578/100087/72181/68276/20116/69072/66966/67205/214150/67025/69902/50912/74035/71957/78697/75416/78890/17089/65961/231413/94061/100503670/56361/117109/71990/70650/208144/100929/100609/83701/74097/53379/69786/72171/70047/14911/381802/213895/110109/60321/66078/68077/13871/20085/27998/14113/20055/233189/67973/20068/101861/101867/233802/67920/67276/234344/234374/70359/78651/212528/214627/66369/67619/234734/72544/30947/66965/212728/215194/71989/67049/382137/77591/102462/69882/69606/68291/72341/67115/66446/54632/13205/69942/72554/245474/76130/69499/15108
## GO:0042254 59014/19989/27993/73674/55989/57750/22596/17975/67949/66942/15278/67459/67223/215193/353258/67239/56200/108707/110816/18114/216156/20054/103136/70683/52705/18813/64934/52513/57316/103573/18148/52530/216767/19069/12261/18432/67493/104662/216987/217011/78394/56321/217038/107732/19942/68979/67671/72344/217431/20115/71952/69237/217995/30946/208366/70930/28126/94066/105372/66254/72198/218629/20088/52535/75751/72662/14000/223499/78581/109075/12181/27367/20826/74778/66409/106298/224092/57808/68193/67014/106338/207932/224481/67045/213773/67097/19896/72462/54127/224823/66229/53414/18519/72515/12607/225348/67936/67222/20044/68731/20227/67674/107071/59028/19211/54364/107094/66583/18150/18572/227522/20935/27176/227656/118451/64930/227674/227715/66489/75422/98956/66181/271842/67134/66580/16418/228889/66481/69126/69639/229317/97112/68147/229937/70285/100019/230082/20104/12578/72181/20116/69072/67205/230737/67025/69902/50912/74035/66448/57741/433864/75416/17089/65961/231452/94061/100503670/100608/117109/11837/71990/19988/208144/19384/100609/74097/66596/72171/213895/110109/60321/20103/68077/13871/20085/27998/14113/20055/67973/20068/101867/67920/67276/234344/234374/78651/24015/234549/214627/66164/67619/72544/235036/215194/71989/382137/77591/102462/67941/225215/72831/16785/67115/66446/13205/72584/72554/110954/245474/76130/237107/69499
## GO:0006364 59014/19989/27993/73674/57750/17975/67949/66942/15278/67459/67223/215193/67239/56200/108707/110816/18114/216156/20054/70683/52705/18813/64934/52513/57316/18148/52530/67493/104662/216987/78394/217038/19942/68979/72344/217431/20115/69237/217995/30946/208366/70930/105372/66254/72198/20088/72662/223499/78581/109075/12181/20826/74778/66409/57808/106338/207932/224481/67045/213773/19896/72462/54127/224823/66229/53414/18519/72515/225348/67936/67222/20044/68731/20227/67674/107071/59028/54364/107094/66583/18150/18572/227522/27176/227656/227715/66489/75422/98956/66181/271842/66580/16418/228889/66481/69639/68147/229937/70285/100019/230082/20104/12578/72181/20116/69072/67205/67025/69902/50912/74035/75416/17089/65961/94061/100503670/117109/71990/208144/100609/74097/72171/213895/110109/60321/68077/13871/20085/27998/14113/20055/67973/20068/101867/67920/67276/234344/234374/78651/214627/67619/72544/215194/71989/382137/77591/102462/67115/66446/72554/245474/76130/69499
## Count
## GO:0022613 288
## GO:0034660 293
## GO:0034470 248
## GO:0042254 212
## GO:0006364 153
Network plots can be generated from any enrichResult object using the enrichplot package.
We measure similarities between the various significant gene sets and group them accordingly. The showCategory argument specifies how many top gene ontology hits to show.
library(enrichplot)
<- pairwise_termsim(GO_result)
GO_result_plot emapplot(GO_result_plot, showCategory = 20)
In addition to gene ontology, we can test our gene list against custom gene sets that we import as gmt files using the clusterProfiler enricher function. Similar to the enrichGO function, this will produce an enrichResult object that can be used for visualization.
Here we will get gene sets from MSigDB using the msigdbr package. We can use this package to pull out specific gene sets, which can be seen at this link. We can also run the msigdbr_collections function to see the categories and subcategory codes that will be used for accessing the gene sets.
library(msigdbr)
msigdbr_collections()
## # A tibble: 23 × 3
## gs_cat gs_subcat num_genesets
## <chr> <chr> <int>
## 1 C1 "" 278
## 2 C2 "CGP" 3368
## 3 C2 "CP" 29
## 4 C2 "CP:BIOCARTA" 292
## 5 C2 "CP:KEGG" 186
## 6 C2 "CP:PID" 196
## 7 C2 "CP:REACTOME" 1604
## 8 C2 "CP:WIKIPATHWAYS" 615
## 9 C3 "MIR:MIR_Legacy" 221
## 10 C3 "MIR:MIRDB" 2377
## # … with 13 more rows
From the data frame on the previous slide we can identify the category/subcategory we want, and use those in the msigdbr function. Here we will use “H” to access the Hallmark gene sets, and in the end we need to get a data frame where the first column contains the name of the gene sets and the second column contains the gene ID.
library(msigdbr)
<- msigdbr(species = "Mus musculus", category = "H", subcategory = NULL)
msig_t2g <- msig_t2g[, colnames(msig_t2g) %in% c("gs_name", "entrez_gene")]
msig_t2g 1:3, ] msig_t2g[
## # A tibble: 3 × 2
## gs_name entrez_gene
## <chr> <int>
## 1 HALLMARK_ADIPOGENESIS 11303
## 2 HALLMARK_ADIPOGENESIS 74610
## 3 HALLMARK_ADIPOGENESIS 52538
We then run the gene set enrichment, using the term to gene mapping we created as the TERM2GENE argument in the enricher function.
<- enricher(gene = genesWithPeakInTSS, universe = allGeneIDs, TERM2GENE = msig_t2g)
hallmark <- data.frame(hallmark)
hallmark_df 1:3, ] hallmark_df[
## ID Description
## HALLMARK_MYC_TARGETS_V1 HALLMARK_MYC_TARGETS_V1 HALLMARK_MYC_TARGETS_V1
## HALLMARK_E2F_TARGETS HALLMARK_E2F_TARGETS HALLMARK_E2F_TARGETS
## HALLMARK_MYC_TARGETS_V2 HALLMARK_MYC_TARGETS_V2 HALLMARK_MYC_TARGETS_V2
## GeneRatio BgRatio pvalue p.adjust
## HALLMARK_MYC_TARGETS_V1 154/1425 200/4396 4.710378e-40 2.355189e-38
## HALLMARK_E2F_TARGETS 126/1425 200/4396 1.049344e-19 2.623360e-18
## HALLMARK_MYC_TARGETS_V2 50/1425 58/4396 1.878537e-17 3.130896e-16
## qvalue
## HALLMARK_MYC_TARGETS_V1 1.784985e-38
## HALLMARK_E2F_TARGETS 1.988231e-18
## HALLMARK_MYC_TARGETS_V2 2.372889e-16
## geneID
## HALLMARK_MYC_TARGETS_V1 15510/18393/66942/17219/51810/107508/15182/56200/67332/66094/103136/18674/53605/12461/73192/12567/56351/18813/67465/18972/12464/103573/18148/14694/12330/52530/22333/27041/13681/12261/22627/110809/18102/18673/16211/26446/22123/16647/20382/18263/22630/26443/13665/13877/218236/105148/28126/30877/11792/15381/12465/18458/55944/14375/15382/66409/14852/22195/19385/20462/28185/21454/19826/67097/20383/109801/108121/27756/66973/15516/225027/20641/225363/27366/70769/11757/56086/20823/229279/100042807/110074/99138/67134/67204/26444/12428/15191/12462/68436/51788/13684/19359/20104/230484/107995/51797/230721/26445/74326/20810/230908/19934/69719/14208/13204/11991/11837/19988/19384/22384/17220/27979/381760/53379/56150/66870/12468/68011/23983/12785/12034/68988/20103/21849/107686/53607/14113/57296/50995/19899/16828/68981/27050/20133/13690/233870/19935/21781/17218/24015/14719/17463/101943/85305/26441/12858/56403/23918/67115/66446/18655/108062/53381/66235
## HALLMARK_E2F_TARGETS 18393/17219/67196/98386/66403/12534/21917/53605/103468/12567/18813/18972/21379/15574/103573/15366/19360/20425/20877/78304/22059/114714/110809/18102/67684/21973/103733/16647/15374/20382/21877/11799/67241/18949/13665/218210/66197/69716/19355/70472/14375/26934/105988/12419/14852/19385/20462/22042/106618/15361/22154/18969/70769/20833/12236/110074/20638/67134/12531/69270/17865/110750/69639/70099/140917/51788/27354/12578/230484/50927/73804/107995/51797/11637/19891/100336/16765/21335/17089/19687/11991/50883/22256/12704/19384/19718/74097/17220/18861/12190/100710/71777/70699/14056/56150/66979/17768/68298/19362/77891/16881/12447/18971/108961/13626/233726/22390/18817/66422/109145/17218/68278/69724/17535/13433/16201/67629/12649/66131/56403/20842/12530/17350/19139/24061/53381
## HALLMARK_MYC_TARGETS_V2 27993/15510/66942/15278/338359/20509/70683/12567/18813/64934/21379/18148/18432/69071/18673/28126/30877/75751/53414/72515/21453/13340/107071/59028/107094/226169/70769/20322/13537/67134/68493/53608/69902/20810/100608/56361/22256/15277/110109/27998/101612/71974/67973/18817/66422/17218/66590/66164/235036/76130
## Count
## HALLMARK_MYC_TARGETS_V1 154
## HALLMARK_E2F_TARGETS 126
## HALLMARK_MYC_TARGETS_V2 50
We learned about the goseq package during the RNAseq course, which is another functional annotation package similar to clusterProfiler,
Here we perform the same enrichment test for the MSigDB Hallmark gene sets.
For goseq, we need a named vector of all genes (the universe) with 1s or 0s representing whether a gene had peak in TSS or not. We can turn a logical vector into 1 for TRUE and 0 for FALSE simply using the as.integer function.
<- as.integer(allGeneIDs %in% genesWithPeakInTSS)
allGenesForGOseq names(allGenesForGOseq) <- allGeneIDs
1:3] allGenesForGOseq[
## 100009600 100009609 100009614
## 1 0 0
First we must construct a nullp data.frame for use within goseq using the nullp function and supplying our named vector, genome to be used and gene identifier used.
The nullp function attempts to correct for gene length biases we may see in gene set testing. i.e. a longer gene may have more chance to have a peak within it.
library(goseq)
= nullp(allGenesForGOseq, "mm10", "knownGene", plot.fit = FALSE) pwf
## Can't find mm10/knownGene length data in genLenDataBase...
## Found the annotation package, TxDb.Mmusculus.UCSC.mm10.knownGene
## Trying to get the gene lengths from it.
We can use the same term to gene mapping we used for clusterProfiler (though it must be converted from a tibble to data frame for goseq) to run the gene set enrichment test.
<- goseq(pwf, "mm10", "knownGene", gene2cat = data.frame(msig_t2g))
Myc_hallMarks
1:3, ] Myc_hallMarks[
## category over_represented_pvalue under_represented_pvalue
## 32 HALLMARK_MYC_TARGETS_V1 6.653087e-41 1
## 13 HALLMARK_E2F_TARGETS 4.136764e-20 1
## 33 HALLMARK_MYC_TARGETS_V2 1.166240e-17 1
## numDEInCat numInCat
## 32 154 200
## 13 126 200
## 33 50 58
In addition to a standard enrichment tests, methods have been implemented specifically for ChIPseq. Many of these tools aim to incorporate peaks distal to genes into their enrichment testing such as the popular GREAT toolset.
Incorporating distal peaks by rules such as nearest gene results in some genes having a higher chance of being selected and hence some gene sets as a whole having a higher chance of having its members selected.
GREAT defines regulatory regions for each individual gene and compares the proportion of peaks mapping to a gene set’s regulatory regions to the proportion of the genome occupied by gene set’s regulatory regions.
i.e. If a gene set’s regulatory regions account for 1% of the genome then one might expect 1% of peaks to overlap these regions by chance.
We can use the GREAT Bioconductor interface available in the rGREAT package.
library(rGREAT)
To submit jobs we can use our GRanges of Myc peaks and specify a genome with the submitGreatJob function.
This function returns a GreatJob object containing a reference to our results on the GREAT server. To review the categories of results available we can use the availableCategories function on our GreatJob object.
<- submitGreatJob(macsPeaks_GR, species = "mm10", version = "3.0.0", request_interval = 1)
great_Job availableCategories(great_Job)
## [1] "GO" "Phenotype Data and Human Disease"
## [3] "Pathway Data" "Gene Expression"
## [5] "Regulatory Motifs" "Gene Families"
The results table can be retrieved using the getEnrichmentTables function and specifying which tables we wish to review.
Here we retrieve the results tables for the “Regulatory Motifs” gene sets which contains 2 separate database results.
= getEnrichmentTables(great_Job, category = "Regulatory Motifs")
great_ResultTable names(great_ResultTable)
## [1] "MSigDB Predicted Promoter Motifs" "MSigDB miRNA Motifs"
Now we can review the enrichment of our genes with Myc peaks in their TSS for the “MSigDB Predicted Promoter Motifs” gene sets.
<- great_ResultTable[["MSigDB Predicted Promoter Motifs"]]
msigProMotifs 1:4, ] msigProMotifs[
## ID
## 1 SCGGAAGY_V$ELK1_02
## 2 GGGCGGR_V$SP1_Q6
## 3 CACGTG_V$MYC_Q2
## 4 RCGCANGCGY_V$NRF1_Q6
## name
## 1 Motif SCGGAAGY matches ELK1: ELK1, member of ETS oncogene family
## 2 Motif GGGCGGR matches SP1: Sp1 transcription factor
## 3 Motif CACGTG matches MYC: v-myc myelocytomatosis viral oncogene homolog (avian)
## 4 Motif RCGCANGCGY matches NRF1: nuclear respiratory factor 1
## Binom_Genome_Fraction Binom_Expected Binom_Observed_Region_Hits
## 1 0.05912876 814.6169 1586
## 2 0.20132920 2773.7130 3782
## 3 0.07555623 1040.9380 1638
## 4 0.05163166 711.3293 1195
## Binom_Fold_Enrichment Binom_Region_Set_Coverage Binom_Raw_PValue
## 1 1.946928 0.11511940 1.738355e-136
## 2 1.363515 0.27451550 1.823467e-94
## 3 1.573580 0.11889380 1.110696e-71
## 4 1.679953 0.08673877 2.182812e-65
## Binom_Adjp_BH Hyper_Total_Genes Hyper_Expected Hyper_Observed_Gene_Hits
## 1 1.069088e-133 1069 502.9698 806
## 2 5.607161e-92 2689 1265.1880 1812
## 3 2.276927e-69 941 442.7452 695
## 4 3.356073e-63 822 386.7551 606
## Hyper_Fold_Enrichment Hyper_Gene_Set_Coverage Hyper_Term_Gene_Coverage
## 1 1.602482 0.07821446 0.7539757
## 2 1.432198 0.17583700 0.6738565
## 3 1.569752 0.06744299 0.7385760
## 4 1.566883 0.05880640 0.7372263
## Hyper_Raw_PValue Hyper_Adjp_BH
## 1 1.739538e-83 5.349079e-81
## 2 5.523247e-114 3.396797e-111
## 3 1.936613e-65 3.970057e-63
## 4 1.614352e-56 2.482066e-54
A common practice in transcription factor ChIPseq is to investigate the motifs enriched under peaks.
De novo motif enrichment can be performed in R/Bioconductor but this can be very time consuming. Here we will use the MEME-ChIP suite available online to identify de novo motifs.
MEME-ChIP requires a FASTA file of sequences under peaks as input so we extract this using the BSgenome package.
First we need to load the BSgenome object for the genome we are working on, UCSC’s mm10 build for the mouse genome, BSgenome.Mmusculus.UCSC.mm10.
library(BSgenome)
library(BSgenome.Mmusculus.UCSC.mm10)
BSgenome.Mmusculus.UCSC.mm10
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # genome: mm10
## # provider: UCSC
## # release date: Dec. 2011
## # 66 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chrUn_GL456372 chrUn_GL456378 chrUn_GL456379
## # chrUn_GL456381 chrUn_GL456382 chrUn_GL456383
## # chrUn_GL456385 chrUn_GL456387 chrUn_GL456389
## # chrUn_GL456390 chrUn_GL456392 chrUn_GL456393
## # chrUn_GL456394 chrUn_GL456396 chrUn_JH584304
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)
The motif for the ChIP-ed transcription factor should near the summit of a peak. MEME-ChIP will trim our peaks to a common length internally if sequences are of different length.
It is best therefore to provide a peak set resized to a common length.
<- GRanges(seqnames(macsPeaks_GR), IRanges(macsPeaks_GR$abs_summit,
macsSummits_GR $abs_summit), score = macsPeaks_GR$fold_enrichment)
macsPeaks_GR<- resize(macsSummits_GR, 100, fix = "center") macsSummits_GR
We now have a GRanges, centred on the summit, highest point of signal for every peak.
macsSummits_GR
## GRanges object with 13777 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 4785515-4785614 * | 5.23296
## [2] chr1 5083072-5083171 * | 4.26417
## [3] chr1 7397785-7397884 * | 9.84580
## [4] chr1 9545350-9545449 * | 11.85079
## [5] chr1 9700928-9701027 * | 5.14291
## ... ... ... ... . ...
## [13773] chrX 168674050-168674149 * | 5.21700
## [13774] chrX 169047705-169047804 * | 5.33257
## [13775] chrX 169320320-169320419 * | 4.67030
## [13776] chrY 1245745-1245844 * | 4.04550
## [13777] chrY 1286577-1286676 * | 5.00143
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
Once we have recentered our peaks we can use the getSeq function with our GRanges of resized common peaks and the BSgenome object for mm10.
The getSeq function returns a DNAStringSet object containing sequences under peaks.
<- getSeq(BSgenome.Mmusculus.UCSC.mm10, macsSummits_GR)
peaksSequences names(peaksSequences) <- paste0(seqnames(macsSummits_GR), ":", start(macsSummits_GR),
"-", end(macsSummits_GR))
1:2, ] peaksSequences[
## DNAStringSet object of length 2:
## width seq names
## [1] 100 CGTCCATCGCCAGAGTGACGCGG...CTGGGCTTCAGGTTGGCCAGGCT chr1:4785515-4785614
## [2] 100 CCGCCCTCAGCTGCGGTCACGTG...TGAGTGAGGCTCGCAACGTCTCC chr1:5083072-5083171
The writeXStringSet function allows the user to write DNA/RNA/AA(aminoacid)StringSet objects out to file.
By default the writeXStringSet function writes the sequence information in FASTA format (as required for MEME-ChIP).
writeXStringSet(peaksSequences, file = "mycMel_rep1.fa")
Now the file “mycMel_rep1.fa” contains sequences around the geometric center of peaks suitable for Motif analysis in MEME-ChIP.
In your own work you will typically run this from your own laptop with MEME installed locally but today we will upload our generated FASTA file to their web portal.
Follow instructions here to install MEME localy.
Results files from MEME-ChIP can be found here
We can retrieve the locations of Myc motifs identified in MEME-ChIP from the FIMO output.
FIMO reports Myc motif locations as a GFF3 file which we should be able to vizualise in IGV. Sadly, this GFF file’s naming conventions cause only a fraction of motifs to be reported.
Fortunately we can parse our motif’s GFF file into R and address this using the import function in the rtracklayer package.
library(rtracklayer)
<- import("~/Downloads/fimo.gff") motifGFF
We can give the sequences some more sensible names and export the GFF to file to visualise in IGV.
$Name <- paste0(seqnames(motifGFF), ":", start(motifGFF), "-", end(motifGFF))
motifGFF$ID <- paste0(seqnames(motifGFF), ":", start(motifGFF), "-", end(motifGFF))
motifGFFexport.gff3(motifGFF, con = "~/Downloads/fimoUpdated.gff")
We saw previously we can scan sequences using some of the Biostrings functionality matchPattern.
Often with ChIPseq we may know the motif we are looking for or we can use a set of known motifs from a database such as a JASPAR.
We can access to JASPAR using the JASPAR2020 bioconductor library.
library(JASPAR2020)
JASPAR2020
## An object of class "JASPAR2020"
## Slot "db":
## [1] "/usr/local/lib/R/host-site-library/JASPAR2020/extdata/JASPAR2020.sqlite"
We can access the model for the our motif of interest using the TFBSTools package and its getMatrixByName function.
library(TFBSTools)
<- getMatrixByName(JASPAR2020, name = "MYC")
pfm pfm
## An object of class PFMatrix
## ID: MA0147.3
## Name: MYC
## Matrix Class: Basic helix-loop-helix factors (bHLH)
## strand: +
## Tags:
## $centrality_logp
## [1] "-296.164"
##
## $family
## [1] "bHLH-ZIP factors"
##
## $medline
## [1] "18555785"
##
## $remap_tf_name
## [1] "MYC"
##
## $source
## [1] "29126285"
##
## $tax_group
## [1] "vertebrates"
##
## $tfbs_shape_id
## [1] "140"
##
## $type
## [1] "ChIP-seq"
##
## $unibind
## [1] "1"
##
## $collection
## [1] "CORE"
##
## $species
## 9606
## "Homo sapiens"
##
## $acc
## [1] "P01106"
##
## Background:
## A C G T
## 0.25 0.25 0.25 0.25
## Matrix:
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
## A 1038 1019 438 44 4600 54 169 59 23 419 961 755
## C 1345 1287 3746 4840 59 4525 157 162 56 2801 1524 1785
## G 1649 1757 472 31 170 70 4578 63 4817 1198 1036 1348
## T 910 879 286 27 113 293 38 4658 46 524 1421 1054
With this PWM we can use the motifmatchr package to scan our summits for the Myc motif and return the positions of the motifs.
We will need to provide our PWM, GRanges to scan within and BSGenome object to extract sequence from.
We also set the out paramter to positions for this instance.
library(motifmatchr)
<- matchMotifs(pfm, macsSummits_GR, BSgenome.Mmusculus.UCSC.mm10, out = "positions")
MycMotifs MycMotifs
## GRangesList object of length 1:
## [[1]]
## GRanges object with 1600 ranges and 1 metadata column:
## seqnames ranges strand | score
## <Rle> <IRanges> <Rle> | <numeric>
## [1] chr1 9545433-9545444 - | 14.0312
## [2] chr1 13151280-13151291 + | 15.7259
## [3] chr1 13293444-13293455 + | 14.5201
## [4] chr1 16665901-16665912 - | 14.6571
## [5] chr1 43561269-43561280 + | 14.2468
## ... ... ... ... . ...
## [1596] chrX 155338504-155338515 + | 14.8300
## [1597] chrX 155338504-155338515 - | 14.8266
## [1598] chrX 160610006-160610017 + | 12.6818
## [1599] chrX 166454533-166454544 + | 13.6368
## [1600] chrX 166454533-166454544 - | 13.6368
## -------
## seqinfo: 21 sequences from an unspecified genome; no seqlengths
We can export the Myc motif positions within peaks for use later in IGV or for metaplot vizualisation.
export.bed(MycMotifs[[1]], con = "MycMotifs.bed")
Exercise on ChIPseq data can be found here
Answers can be found here