Session 1: Alignment and counting
Session 2: Differential gene expression analysis
Session 3: Visualizing results through PCA and clustering
Session 4: Gene Set Analysis
Session 5: Differential Transcript Utilization analysis
Over these sessions we will review data from Christina Leslie’s lab at MSKCC on T-Reg cells. This can be found on the Encode portal. T-Reg data here and activated T-Reg data here.
Several intermediate files have already been made for you to use. You can download the course content from GitHub here. Once downloaded you will need to unzip the folder and then navigate into the r_course directory of the download. This will mean the paths in the code are correct.
I have aligned all FQ to BAM and counted in genes and exons using Rsubread and summarizeOverlaps() and then analyzed this data for expression changes using DESeq2. You can find an excel file of differential expression analysis for this comparison in the data directory.
In our previous sessions we looked at how we can identify experimentally interesting genes. Either through finding genes with significant differential expression or performing clustering.
In this session we will explore a few ways we can evaluate any enrichment for functionally related genes. This helps us address common questions like:
Do “cell cycle” genes change more between conditions then other genes?
Are genes related to “immune response” enriched in any clusters?
There are many options available to us for both functions/pathways/categories (Gene set) we will test and the methods we can use to test these gene sets for enrichment in our analysis (Gene set Enrichment Analysis).
Sources of well curated gene sets include GO consortium (gene’s function, biological process and cellular localization), REACTOME (Biological Pathways) and MsigDB (Computationally and Experimentally derived).
You can also use a user-defined gene set i.e. genes of interest from a specific paper.
The Gene Ontology consortium aims to provide a comprehensive resource of all the currently available knowledge regarding the functions of genes and gene products.
Functional categories of genes are broadly split into three main groups:
The three sub categories of gene ontology are arranged in nested, structured graph with gene sets at the top of graph representing more general terms and those at the bottom more specific terms.
The Reactome and KEGG (Kyoto Encyclopedia of genes and genomes) contains information on genes’ membership and roles in molecular pathways.
These databases focus largely on metabolic and disease pathways, and allow us to investigate our genes in the context of not only functional roles but relative positions within pathways.
The molecular signatures database (MsigDB) is available from the Broad institute and provides a set of curated gene sets derived from sources such as GO, pathway databases, motif scans and even other experimental sets.
MsigDB databases are widely used in Gene Set Enrichment Analysis and are available as plain text, in formats used with the popular Java based gene set enrichment software GSEA.
In R we can access information on these gene sets through database libraries (such as the Org.db.eg we have reviewed) such as GO.db, reactome.db or by making use of libraries which allows us to our gene sets from parse plain text formats, GSEABase.
In our Bioconductor sessions we reviewed how we to get information on genes using the Org.db packages. We can access these dbs in similar ways.
The GSEA software and MSigDB gene set collections can be found at the Broad’s site.
We have access to Human gene sets containing information on:
The MSigDB collection has been wrapped up in the msigdbr CRAN package. This package also contains computationally predicted homologs for a number of common species, so you can looks at these MSigDb groups in other organisms.
The msigdbr() function is used to specify which organism and which categories/collection you want. In return you get a type of data frame called a tibble. There is a lot of information in this data frame.
## # A tibble: 6 x 25
## gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
## <chr> <int> <chr> <chr> <chr> <chr>
## 1 Abca1 11303 ENSMUSG0000~ ABCA1 19 ENSG00000165029
## 2 Abcb8 74610 ENSMUSG0000~ ABCB8 11194 ENSG00000197150
## 3 Acaa2 52538 ENSMUSG0000~ ACAA2 10449 ENSG00000167315
## 4 Acadl 11363 ENSMUSG0000~ ACADL 33 ENSG00000115361
## 5 Acadm 11364 ENSMUSG0000~ ACADM 34 ENSG00000117054
## 6 Acads 11409 ENSMUSG0000~ ACADS 35 ENSG00000122971
## # i 19 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
## # gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
## # gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
## # gs_geoid <chr>, gs_url <chr>, db_version <chr>, db_target_species <chr>,
## # ortholog_taxon_id <int>, ortholog_sources <chr>,
## # num_ortholog_sources <dbl>, entrez_gene <int>, gs_cat <chr>,
## # gs_subcat <chr>
If you are interested in GO terms we can grab them using the same approach. Here we get the GO terms but for Pig.
## # A tibble: 6 x 25
## gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
## <chr> <int> <chr> <chr> <chr> <chr>
## 1 ASS1 414411 ENSSSCG0000~ ASS1 445 ENSG00000130707
## 2 CPS1 100157716 ENSSSCG0000~ CPS1 1373 ENSG00000021826
## 3 GRIA1 100623647 ENSSSCG0000~ GRIA1 2890 ENSG00000155511
## 4 KCNC2 100153996 ENSSSCG0000~ KCNC2 3747 ENSG00000166006
## 5 SLC1A1 397003 ENSSSCG0000~ SLC1A1 6505 ENSG00000106688
## 6 APLF 100516073 ENSSSCG0000~ APLF 200558 ENSG00000169621
## # i 19 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
## # gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
## # gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
## # gs_geoid <chr>, gs_url <chr>, db_version <chr>, db_target_species <chr>,
## # ortholog_taxon_id <int>, ortholog_sources <chr>,
## # num_ortholog_sources <dbl>, entrez_gene <int>, gs_cat <chr>,
## # gs_subcat <chr>
We will review two different methods to identify functional groups associated with our condition of interest.
First lets read in our differential expression results from DESeq2 analysis of activated vs resting T-cells.
Activated_minus_Resting <- read.delim(file = "data/Group_Activated_minus_Resting.csv",
sep = ",")
Activated_minus_Resting[1:3, ]
## ENTREZID SYMBOL baseMean log2FoldChange lfcSE stat pvalue padj
## 1 100009600 Zglp1 3.685411 0.2755725 1.855514 0.1485154 0.881936 NA
## 2 100009609 Vmn2r65 0.000000 NA NA NA NA NA
## 3 100009614 Gm10024 0.000000 NA NA NA NA NA
## newPadj
## 1 1
## 2 NA
## 3 NA
All GSA methods will require us to filter to genes to those tested for differential expression.
We will therefore filter to all genes which pass our independent filtering step from DESeq2. These will be our genes with no NA in the DESeq2 padj column.
Activated_minus_Resting <- Activated_minus_Resting[!is.na(Activated_minus_Resting$padj),
]
Activated_minus_Resting[1:3, ]
## ENTREZID SYMBOL baseMean log2FoldChange lfcSE stat pvalue
## 6 100017 Ldlrap1 2752.367 -0.04815446 0.08880811 -0.5422305 5.876598e-01
## 7 100019 Mdn1 2281.866 -1.09612289 0.09214164 -11.8960650 1.240603e-32
## 8 100033459 Ifi208 3282.679 -0.02774770 0.09029354 -0.3073055 7.586109e-01
## padj newPadj
## 6 7.416249e-01 1.000000e+00
## 7 7.038705e-31 2.424386e-28
## 8 8.601547e-01 1.000000e+00
When we run a functional enrichment analysis (a.ka. over representation analysis) we are testing if we observe a greater proportion of genes of interest in a gene set versus the proportion in the background genes. We do this with a Fisher exact test.
Example:
We have 100 genes of interest. 1/4 are in a gene set we care about. Is it enriched?
Genes.of.Interest | Gene.Universe | |
---|---|---|
In Gene Set | 25 | 50 |
Out of Gene Set | 75 | 1950 |
The clusterProfiler package provides multiple enrichment functions that work with curated gene sets (e.g. GO, KEGG) or custom gene sets. The most common way to use clusterProfiler is to run our over enrichment test using the one sided Fisher exact test.
clusterProfiler has both GSA and GSEA approaches, so is a nice one stop shop for everything. Plus it has some really nice and easy visualization options. Detailed information about all of the functionality within this package is available here.
First we need our gene information. For a simple functional enrichment of GO terms there is the enrichGO() function. You just provide this a vector of gene IDs you want to check, and the Org.db of the relevant organism. There are other specialized gene set functions i.e. enrichKEGG() for KEGG gene sets, or a generalized function enrichr() for any user provided gene sets.
## [1] 100019 100036521 100038516 100038882 100039796 100040736
The result is special object which contains the outcome of the functional enrichment. When you look at what is inside you get a preview of what test was done and what the results look like.
## #
## # over-representation test
## #
## #...@organism Mus musculus
## #...@ontology MF
## #...@keytype ENTREZID
## #...@gene chr [1:4435] "100019" "100036521" "100038516" "100038882" "100039796" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05
## #...499 enriched terms found
## 'data.frame': 499 obs. of 12 variables:
## $ ID : chr "GO:0003779" "GO:0016887" "GO:0019207" "GO:0030695" ...
## $ Description : chr "actin binding" "ATP hydrolysis activity" "kinase regulator activity" "GTPase regulator activity" ...
## $ GeneRatio : chr "152/4247" "118/4247" "107/4247" "143/4247" ...
## $ BgRatio : chr "448/28396" "314/28396" "277/28396" "430/28396" ...
## $ RichFactor : num 0.339 0.376 0.386 0.333 0.333 ...
## $ FoldEnrichment: num 2.27 2.51 2.58 2.22 2.22 ...
## $ zScore : num 11.3 11.3 11.1 10.7 10.7 ...
## $ pvalue : num 5.18e-24 4.16e-23 3.66e-22 9.82e-22 9.82e-22 ...
## $ p.adjust : num 6.48e-21 2.60e-20 1.52e-19 2.45e-19 2.45e-19 ...
## $ qvalue : num 3.41e-21 1.37e-20 8.02e-20 1.29e-19 1.29e-19 ...
## $ geneID : chr "104027/109711/110253/11426/11518/11629/11727/11838/11867/12331/12332/12340/12343/12345/12385/12488/12631/12632/"| __truncated__ "107239/108888/110033/110350/110920/11303/11426/11666/11928/11938/11946/11949/11950/11957/11964/11980/11981/1198"| __truncated__ "108099/11651/11839/12042/12283/12313/12315/12367/12428/12442/12444/12445/12447/12448/12449/12450/12452/12539/12"| __truncated__ "100088/101497/102098/105689/106952/107566/108911/108995/109689/109905/109934/114713/11855/11857/12672/13196/136"| __truncated__ ...
## $ Count : int 152 118 107 143 143 96 125 120 143 111 ...
## #...Citation
## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
## clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
## The Innovation. 2021, 2(3):100141
You can also convert this to a dataframe and export.
## ID Description GeneRatio
## GO:0003779 GO:0003779 actin binding 152/4247
## GO:0016887 GO:0016887 ATP hydrolysis activity 118/4247
## GO:0019207 GO:0019207 kinase regulator activity 107/4247
## GO:0030695 GO:0030695 GTPase regulator activity 143/4247
## GO:0060589 GO:0060589 nucleoside-triphosphatase regulator activity 143/4247
## GO:0140097 GO:0140097 catalytic activity, acting on DNA 96/4247
## BgRatio RichFactor FoldEnrichment zScore pvalue
## GO:0003779 448/28396 0.3392857 2.268509 11.34931 5.184531e-24
## GO:0016887 314/28396 0.3757962 2.512623 11.30300 4.160383e-23
## GO:0019207 277/28396 0.3862816 2.582729 11.10092 3.658863e-22
## GO:0030695 430/28396 0.3325581 2.223527 10.72125 9.819222e-22
## GO:0060589 430/28396 0.3325581 2.223527 10.72125 9.819222e-22
## GO:0140097 238/28396 0.4033613 2.696927 11.02461 1.331557e-21
## p.adjust qvalue
## GO:0003779 6.475479e-21 3.410876e-21
## GO:0016887 2.598159e-20 1.368547e-20
## GO:0019207 1.523307e-19 8.023823e-20
## GO:0030695 2.452842e-19 1.292003e-19
## GO:0060589 2.452842e-19 1.292003e-19
## GO:0140097 2.771859e-19 1.460041e-19
## geneID
## GO:0003779 104027/109711/110253/11426/11518/11629/11727/11838/11867/12331/12332/12340/12343/12345/12385/12488/12631/12632/12721/12774/12798/13138/13169/13367/13419/13726/13821/13822/14026/14083/14248/14457/14468/14469/15163/16332/16412/16796/16985/17118/171580/17357/17698/17886/17896/17909/17913/17916/17918/17920/17996/18301/18643/18754/18810/18826/19200/19201/192176/19241/194401/20198/20259/20430/20649/20650/20739/20740/211401/212073/213019/21346/215114/215280/217692/21894/21961/22138/22288/22323/22351/22376/22379/22388/22629/227753/230917/231003/232533/235431/237436/237860/23789/23790/242687/242894/246177/263406/26936/270058/270163/29816/29856/29875/319565/319876/320878/326618/338367/404710/50875/52690/54132/544963/55991/56226/56376/56378/56419/56431/57748/57778/58800/63873/63986/66713/67399/67425/67771/67803/68089/68142/68743/70292/70394/71602/71653/71994/72042/74117/74192/75646/76448/76467/76580/76709/77767/77963/78885/78926/80743/80889
## GO:0016887 107239/108888/110033/110350/110920/11303/11426/11666/11928/11938/11946/11949/11950/11957/11964/11980/11981/11982/12461/12462/12464/12495/12648/13205/13404/13424/13682/14211/14828/15481/15511/15526/15925/16554/16558/16561/16573/16574/16581/170759/17215/17250/17350/17685/18671/19179/19182/19299/19348/19360/19361/193740/19718/20174/20480/20586/209737/21354/21355/216848/217716/219189/22027/22427/224824/226026/226153/227099/227612/229003/230073/231912/232975/237877/237886/23834/23924/239273/24015/240641/244059/26357/269523/27406/27407/27416/276950/30935/320790/338467/381290/50770/53313/54667/56495/56505/60530/66043/67126/67241/67398/68142/69716/70099/70218/70472/71389/71586/71602/71819/72151/73804/74142/74355/76295/76408/80861/98660
## GO:0019207 108099/11651/11839/12042/12283/12313/12315/12367/12428/12442/12444/12445/12447/12448/12449/12450/12452/12539/12575/12578/12579/12580/12581/12700/12704/13001/13205/13867/14168/14727/14728/14870/15944/16002/16145/16195/17005/17096/17420/17751/17973/18148/18439/18519/18708/18709/18710/18712/19084/19087/192652/19354/19744/20304/208228/208659/20869/211770/214944/216233/216505/216965/217410/217588/21802/21803/21813/219181/22375/231659/232157/232341/23991/23994/240354/24056/240665/26396/26457/268697/27214/29875/319955/320207/380694/384783/52552/54124/54396/54401/54673/56036/56398/56706/66214/66787/67287/67605/70769/72949/74018/74370/74734/77799/78757/97541/97998
## GO:0030695 100088/101497/102098/105689/106952/107566/108911/108995/109689/109905/109934/114713/11855/11857/12672/13196/13605/140500/14270/14469/14569/14682/14784/15357/15463/16413/16476/16800/16801/16897/18220/18582/19159/192662/19347/19395/19418/19419/19732/19739/210789/212285/213990/214137/216363/217692/217835/217869/217944/218397/21844/218581/219140/22324/223254/223435/225358/226970/227801/228355/228359/228482/229877/229898/230784/231207/231821/232089/232201/232906/233204/234094/234353/238130/24056/241308/243780/244668/244867/259302/263406/26382/26934/270163/27054/277360/29875/319934/320435/320484/320634/329260/329727/380608/380711/381085/381605/404710/50778/50780/51791/54189/544963/54645/56349/56508/56715/56784/57257/66096/66687/66691/67425/67865/68576/69440/69632/69780/70785/71085/71435/71709/71810/72121/72662/73341/73910/74018/74030/74334/75415/75547/76117/76123/76131/78255/78514/78808/79264/83945/94176/94254/98910
## GO:0060589 100088/101497/102098/105689/106952/107566/108911/108995/109689/109905/109934/114713/11855/11857/12672/13196/13605/140500/14270/14469/14569/14682/14784/15357/15463/16413/16476/16800/16801/16897/18220/18582/19159/192662/19347/19395/19418/19419/19732/19739/210789/212285/213990/214137/216363/217692/217835/217869/217944/218397/21844/218581/219140/22324/223254/223435/225358/226970/227801/228355/228359/228482/229877/229898/230784/231207/231821/232089/232201/232906/233204/234094/234353/238130/24056/241308/243780/244668/244867/259302/263406/26382/26934/270163/27054/277360/29875/319934/320435/320484/320634/329260/329727/380608/380711/381085/381605/404710/50778/50780/51791/54189/544963/54645/56349/56508/56715/56784/57257/66096/66687/66691/67425/67865/68576/69440/69632/69780/70785/71085/71435/71709/71810/72121/72662/73341/73910/74018/74030/74334/75415/75547/76117/76123/76131/78255/78514/78808/79264/83945/94176/94254/98910
## GO:0140097 104806/107182/109151/114714/11792/12648/13205/13404/13419/13421/13433/13435/14156/15201/16881/16952/17215/17216/17217/17218/17219/17350/17685/18538/18807/18968/18971/18973/18975/19205/192119/19360/19361/19366/19646/19714/19718/20174/20586/208084/214627/216848/21969/21973/21974/21975/21976/22427/225182/22594/226153/234258/234396/237877/237911/244059/245474/26383/26447/268281/26909/269400/27015/27041/319955/320209/320790/327762/330554/338359/50505/50776/56196/56210/56505/57444/623474/66408/67155/67967/68048/68142/70603/71175/71389/72103/72107/72151/72831/72960/74549/75560/76251/78286/93696/93760
## Count
## GO:0003779 152
## GO:0016887 118
## GO:0019207 107
## GO:0030695 143
## GO:0060589 143
## GO:0140097 96
There are a couple of easy functions to visualize the top hits in your result. For example this dotplot which shows amount of overlap with geneset. These are also all built in ggplot2, so it is easy to modify parameters. For more info on ggplot2 check our course here.
library(ggplot2)
dotplot(sig_gene_enr, showCategory = 6) + theme(axis.text.y = element_text(size = 7))
Other useful visualizations include enrichment maps. These network plots show how the significant groups in the gene sets relate to each other.
library(enrichplot)
sig_gene_enr <- pairwise_termsim(sig_gene_enr)
emapplot(sig_gene_enr, showCategory = 15, cex_label_category = 0.6) + theme(text = element_text(size = 7))
There are several in-built functions that allow you to test against curated gene sets like GO terms (enrichGO) or KEGG (enrichKEGG).
If we want to use our own gene sets or MSigDB we can provide them to the generic function enricher().
First lets grab the curated gene sets (C2) from MSigDB using msigdbr. We only need the Gene Set Name and the Gene ID.
## # A tibble: 6 x 2
## gs_name entrez_gene
## <chr> <int>
## 1 HALLMARK_ADIPOGENESIS 11303
## 2 HALLMARK_ADIPOGENESIS 74610
## 3 HALLMARK_ADIPOGENESIS 52538
## 4 HALLMARK_ADIPOGENESIS 11363
## 5 HALLMARK_ADIPOGENESIS 11364
## 6 HALLMARK_ADIPOGENESIS 11409
We can then supply this data frame to enricher() as the TERM2GENE argument.
enricher() also helps control your universe. The term universe refers to the background genes i.e. all the genes in your experiment.
By default most of these tools will consider the background to be all genes in you gene sets. But a lot of these may not be in your experiment because they have been filtered or were not part of your annotation.
To avoid inflating the significance it is good to make sure the universe is just the genes in your experiment.
Lets update our universe to be the genes from our differential results. Remember this has been subset based on an NA value in the padj from DESeq2.
It is always important to think not just about your genes of interest, but also what the background is.
sig_gene_enr <- enricher(sig_genes, TERM2GENE = mm_H, universe = as.character(Activated_minus_Resting$ENTREZID))
dotplot(sig_gene_enr)
clusterProfiler has a bunch of additional options for controlling your enrichment test.
We have focused on clusterProfiler as our package of choice for running these enrihcment analyses. For most use cases it is sufficient, but there are some other tools with specialist functions:
Another popular method for testing gene set enrichment with differential expression analysis results is the Broad’s GSEA method.
GSEA tests whether our gene set is correlated with the ranking of genes by our differential expression analysis metric.
GSEA walks down the ranked gene list, increasing a score when a gene is in the gene set, decreasing it when it’s not. The resulting distribution is then tested versus a null distribution using a modified KS-test. The results is a significance score, an enrichment score (based on the size of the peak), and leading edge genes (based on all genes up from the peak).
We can run GSEA easily with clusterProfiler. First we will need to produce a ranked and named vector of gene scores. Here will rank by stat column to give sensible measure of differential expression. We could also use log2FoldChange column if we have modified log2 fold changes using lfsShrink() function.
forRNK <- Activated_minus_Resting$stat
names(forRNK) <- Activated_minus_Resting$ENTREZID
forRNK <- forRNK[order(forRNK, decreasing = T)]
forRNK[1:6]
## 14939 110454 12772 16407 20198 54167
## 59.24250 56.23897 47.31871 44.47817 43.67879 43.32388
clusterProfiler though which can just use a simple data frame. For this we will look at the MSigDb enrichment of C7 (immunological signature). clusterProfiler needs a 2 column data frame with the geneset names and gene IDs, just as we used for our over representation analysis we did earlier.
## The 'msigdbdf' package must be installed to access the full dataset.
## Please run the following command to install the 'msigdbdf' package:
## install.packages('msigdbdf', repos = 'https://igordot.r-universe.dev')
## # A tibble: 6 x 2
## gs_name entrez_gene
## <chr> <int>
## 1 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 11768
## 2 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 67841
## 3 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 240613
## 4 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 67704
## 5 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 12772
## 6 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 26371
To run a GSEA there is a GSEA() function. We just provide our ranked list from differential gene expression analysis and the MSigDb Term2Gene dataframe.
We can easily visualize these GSEA results using the same methodology as GSAs to get dotplots and emaps.
clusterProfiler::dotplot(sig_gene_enr, showCategory = 6) + theme(axis.text.y = element_text(size = 7))
library(enrichplot)
sig_gene_enr <- pairwise_termsim(sig_gene_enr)
emapplot(sig_gene_enr, showCategory = 10, cex_label_category = 0.6) + theme(text = element_text(size = 7))
An additional visualization for GSEA is the Running Score plot. These are plotted for individual gene sets. here we are looking at the most signifcant group:
gseaplot(sig_gene_enr, geneSetID = 1, by = "runningScore", title = "GSE15330_HSC_VS_LYMPHOID_PRIMED_MULTIPOTENT_PROGENITOR_DN")
fgsea is an alternative tool to run GSEA. If you are running a lot of these tests or want a greater level of customization of the testing parameters it could be a better option.
It doesn’t have the same amount of plotting options, but it it very very fast and also gives lower level control.
We have presented two options for testing for gene set enrichment.
Fisher: * Great for curated gene sets like clusters or filtered DGE results
GSEA: * Great for subtle coordinated changes and avoids using arbitrary cutoffs
If you’re unsure, try both and review the results.