class: middle, inverse, title-slide .title[ # Analysis of RNAseq data in R and Bioconductor (part 4) ] .subtitle[ ##
Bioinformatics Resource Center - Rockefeller University ] .author[ ###
http://rockefelleruniversity.github.io/RU_RNAseq/
] .author[ ###
brc@rockefeller.edu
] --- ## What we will cover * Session 1: Alignment and counting * Session 2: Differential gene expression analysis * Session 3: Visualizing results through PCA and clustering * _Session 4: Gene Set Analysis_ * Session 5: Differential Transcript Utilization analysis --- ## The data Over these sessions we will review data from Christina Leslie's lab at MSKCC on T-Reg cells. This can be found on the Encode portal. T-Reg data [here](https://www.encodeproject.org/experiments/ENCSR486LMB/) and activated T-Reg data [here](https://www.encodeproject.org/experiments/ENCSR726DNP/). --- ## The data Several intermediate files have already been made for you to use. You can download the course content from GitHub [here](https://github.com/RockefellerUniversity/RU_RNAseq/archive/master.zip). Once downloaded you will need to unzip the folder and then navigate into the *r_course* directory of the download. This will mean the paths in the code are correct. ``` r setwd("Path/to/Download/RU_RNAseq-master/r_course") ``` --- ## The data I have aligned all FQ to BAM and counted in genes and exons using Rsubread and summarizeOverlaps() and then analyzed this data for expression changes using DESeq2. You can find an excel file of differential expression analysis for this comparison in the **data** directory. - CSV file of differential expression results for activated versus resting T-cells - **data/Group_Activated_minus_Resting.csv**. --- ## What we will cover In our previous sessions we looked at how we can identify experimentally interesting genes. Either through finding genes with significant differential expression or performing clustering. In this session we will explore a few ways we can evaluate any enrichment for functionally related genes. This helps us address common questions like: **Do "cell cycle" genes change more between conditions then other genes?** **Are genes related to "immune response" enriched in any clusters?** --- class: inverse, center, middle # Gene Sets <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## What we test and how we test? There are many options available to us for both functions/pathways/categories **(Gene set)** we will test and the methods we can use to test these gene sets for enrichment in our analysis **(Gene set Enrichment Analysis)**. - **Gene set** - A named collection of genes. - **Gene Set <u>e</u>nrichment Analysis - GSA** - A broad term for correlating a set of genes with a condition/phenotype. - **Gene Set <u>E</u>nrichment Analysis - GSEA** - Broad Institute's term for correlating a set of genes with a condition/phenotype. --- ## Gene sets Sources of well curated gene sets include [GO consortium](http://geneontology.org/) (gene's function, biological process and cellular localization), [REACTOME](http://www.reactome.org/) (Biological Pathways) and [MsigDB](http://software.broadinstitute.org/gsea/msigdb/) (Computationally and Experimentally derived). You can also use a user-defined gene set i.e. genes of interest from a specific paper. --- ## Gene Ontology The [Gene Ontology consortium](http://geneontology.org/) aims to provide a comprehensive resource of all the currently available knowledge regarding the functions of genes and gene products. Functional categories of genes are broadly split into three main groups: * **Molecular functions.** - Activity of a gene's protein product. * **Biological processes.** - Role of gene's protein product. * **Cellular components.** - Where in cell molecular function of protein product is performed. --- ## Gene Ontology The three sub categories of gene ontology are arranged in nested, structured graph with gene sets at the top of graph representing more general terms and those at the bottom more specific terms.  --- ## Reactome and KEGG The [Reactome](http://www.reactome.org/) and [KEGG (Kyoto Encyclopedia of genes and genomes)](http://www.genome.jp/kegg/kegg2.html) contains information on genes' membership and roles in molecular pathways. These databases focus largely on metabolic and disease pathways, and allow us to investigate our genes in the context of not only functional roles but relative positions within pathways. <div align="center"> <img src="imgs/map01100.png" alt="offset" height="300" width="600"> </div> --- ## MsigDB .pull-left[ The [molecular signatures database (MsigDB)](http://software.broadinstitute.org/gsea/msigdb/) is available from the Broad institute and provides a set of curated gene sets derived from sources such as GO, pathway databases, motif scans and even other experimental sets. MsigDB databases are widely used in Gene Set Enrichment Analysis and are available as plain text, in formats used with the popular Java based gene set enrichment software GSEA. ] .pull-right[ <div align="center"> <img src="imgs/msigdb.png" alt="offset" height="500" width="350"> </div> ] --- ## Gene sets in Bioconductor In R we can access information on these gene sets through database libraries (such as the Org.db.eg we have reviewed) such as **GO.db**, **reactome.db** or by making use of libraries which allows us to our gene sets from parse plain text formats, **GSEABase**. In our [Bioconductor sessions](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/GenomicFeatures_In_Bioconductor.html#Gene_Annotation) we reviewed how we to get information on genes using the **Org.db** packages. We can access these dbs in similar ways. ``` r library(GO.db) library(reactome.db) library(GSEABase) ``` --- class: inverse, center, middle # MSigDB and Gene Set Collections <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## GSEA and MSigDB The GSEA software and MSigDB gene set collections can be found at the [Broad's site](http://software.broadinstitute.org/gsea/index.jsp). We have access to Human gene sets containing information on: * H - hallmark gene sets * C1 - positional gene sets * C2 - curated gene sets * C3 - motif gene sets * C4 - computational gene sets * C5 - GO gene sets * C6 - oncogenic gene sets * C7 - immunologic gene sets --- ## Msigdbr The MSigDB collection has been wrapped up in the [*msigdbr* CRAN package](https://igordot.github.io/msigdbr/). This package also contains computationally predicted homologs for a number of common species, so you can looks at these MSigDb groups in other organisms. The *msigdbr()* function is used to specify which organism and which categories/collection you want. In return you get a type of data frame called a tibble. There is a lot of information in this data frame. ``` r library(msigdbr) mm_H <- msigdbr(species = "Mus musculus", category = "H") head(mm_H) ``` ``` ## # A tibble: 6 x 25 ## gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene ## <chr> <int> <chr> <chr> <chr> <chr> ## 1 Abca1 11303 ENSMUSG0000~ ABCA1 19 ENSG00000165029 ## 2 Abcb8 74610 ENSMUSG0000~ ABCB8 11194 ENSG00000197150 ## 3 Acaa2 52538 ENSMUSG0000~ ACAA2 10449 ENSG00000167315 ## 4 Acadl 11363 ENSMUSG0000~ ACADL 33 ENSG00000115361 ## 5 Acadm 11364 ENSMUSG0000~ ACADM 34 ENSG00000117054 ## 6 Acads 11409 ENSMUSG0000~ ACADS 35 ENSG00000122971 ## # i 19 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>, ## # gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>, ## # gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>, ## # gs_geoid <chr>, gs_url <chr>, db_version <chr>, db_target_species <chr>, ## # ortholog_taxon_id <int>, ortholog_sources <chr>, ## # num_ortholog_sources <dbl>, entrez_gene <int>, gs_cat <chr>, ## # gs_subcat <chr> ``` --- ## Msigdbr and GO If you are interested in GO terms we can grab them using the same approach. Here we get the GO terms but for Pig. ``` r ss_C5 <- msigdbr(species = "pig", category = "C5") head(ss_C5) ``` ``` ## # A tibble: 6 x 25 ## gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene ## <chr> <int> <chr> <chr> <chr> <chr> ## 1 ASS1 414411 ENSSSCG0000~ ASS1 445 ENSG00000130707 ## 2 CPS1 100157716 ENSSSCG0000~ CPS1 1373 ENSG00000021826 ## 3 GRIA1 100623647 ENSSSCG0000~ GRIA1 2890 ENSG00000155511 ## 4 KCNC2 100153996 ENSSSCG0000~ KCNC2 3747 ENSG00000166006 ## 5 SLC1A1 397003 ENSSSCG0000~ SLC1A1 6505 ENSG00000106688 ## 6 APLF 100516073 ENSSSCG0000~ APLF 200558 ENSG00000169621 ## # i 19 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>, ## # gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>, ## # gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>, ## # gs_geoid <chr>, gs_url <chr>, db_version <chr>, db_target_species <chr>, ## # ortholog_taxon_id <int>, ortholog_sources <chr>, ## # num_ortholog_sources <dbl>, entrez_gene <int>, gs_cat <chr>, ## # gs_subcat <chr> ``` --- class: inverse, center, middle # Testing gene set enrichment <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Testing gene sets We will review two different methods to identify functional groups associated with our condition of interest. * The first method will test for any association of our gene set with our a group of interesting genes (differentially expressed genes/gene cluster) - Functional enrichment. * The second method will test for any association of our gene set with high/low ranked genes (ranked by measure of differential expression) - GSEA. --- ## DE results from DESeq2 First lets read in our differential expression results from DESeq2 analysis of activated vs resting T-cells. ``` r Activated_minus_Resting <- read.delim(file = "data/Group_Activated_minus_Resting.csv", sep = ",") Activated_minus_Resting[1:3, ] ``` ``` ## ENTREZID SYMBOL baseMean log2FoldChange lfcSE stat pvalue padj ## 1 100009600 Zglp1 3.685411 0.2755725 1.855514 0.1485154 0.881936 NA ## 2 100009609 Vmn2r65 0.000000 NA NA NA NA NA ## 3 100009614 Gm10024 0.000000 NA NA NA NA NA ## newPadj ## 1 1 ## 2 NA ## 3 NA ``` --- ## Background All GSA methods will require us to filter to genes to those tested for differential expression. We will therefore filter to all genes which pass our independent filtering step from DESeq2. These will be our genes with no NA in the DESeq2 padj column. ``` r Activated_minus_Resting <- Activated_minus_Resting[!is.na(Activated_minus_Resting$padj), ] Activated_minus_Resting[1:3, ] ``` ``` ## ENTREZID SYMBOL baseMean log2FoldChange lfcSE stat pvalue ## 6 100017 Ldlrap1 2752.367 -0.04815446 0.08880811 -0.5422305 5.876598e-01 ## 7 100019 Mdn1 2281.866 -1.09612289 0.09214164 -11.8960650 1.240603e-32 ## 8 100033459 Ifi208 3282.679 -0.02774770 0.09029354 -0.3073055 7.586109e-01 ## padj newPadj ## 6 7.416249e-01 1.000000e+00 ## 7 7.038705e-31 2.424386e-28 ## 8 8.601547e-01 1.000000e+00 ``` --- class: inverse, center, middle # Functional enrichment <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Functional enrichment When we run a functional enrichment analysis (a.ka. over representation analysis) we are testing if we observe a greater proportion of genes of interest in a gene set versus the proportion in the background genes. We do this with a Fisher exact test. Example: We have 100 genes of interest. 1/4 are in a gene set we care about. Is it enriched? Table: Contingency Table for Fisher Test | | Genes.of.Interest| Gene.Universe| |:---------------|-----------------:|-------------:| |In Gene Set | 25| 50| |Out of Gene Set | 75| 1950| --- ## ClusterProfiler The **clusterProfiler** package provides multiple enrichment functions that work with curated gene sets (e.g. GO, KEGG) or custom gene sets. The most common way to use clusterProfiler is to run our over enrichment test using the one sided Fisher exact test. clusterProfiler has both GSA and GSEA approaches, so is a nice one stop shop for everything. Plus it has some really nice and easy visualization options. Detailed information about all of the functionality within this package is available [here](http://yulab-smu.top/clusterProfiler-book/). ``` r library(clusterProfiler) ``` --- ## Running ClusterProfiler First we need our gene information. For a simple functional enrichment of GO terms there is the *enrichGO()* function. You just provide this a vector of gene IDs you want to check, and the Org.db of the relevant organism. There are other specialized gene set functions i.e. *enrichKEGG()* for KEGG gene sets, or a generalized function *enrichr()* for any user provided gene sets. ``` r sig_genes <- Activated_minus_Resting[Activated_minus_Resting$padj < 0.05, 1] head(sig_genes) ``` ``` ## [1] 100019 100036521 100038516 100038882 100039796 100040736 ``` ``` r sig_gene_enr <- enrichGO(sig_genes, OrgDb = org.Mm.eg.db) ``` --- ## enrichResult object The result is special object which contains the outcome of the functional enrichment. When you look at what is inside you get a preview of what test was done and what the results look like. ``` r sig_gene_enr ``` ``` ## # ## # over-representation test ## # ## #...@organism Mus musculus ## #...@ontology MF ## #...@keytype ENTREZID ## #...@gene chr [1:4435] "100019" "100036521" "100038516" "100038882" "100039796" ... ## #...pvalues adjusted by 'BH' with cutoff <0.05 ## #...499 enriched terms found ## 'data.frame': 499 obs. of 12 variables: ## $ ID : chr "GO:0003779" "GO:0016887" "GO:0019207" "GO:0030695" ... ## $ Description : chr "actin binding" "ATP hydrolysis activity" "kinase regulator activity" "GTPase regulator activity" ... ## $ GeneRatio : chr "152/4247" "118/4247" "107/4247" "143/4247" ... ## $ BgRatio : chr "448/28396" "314/28396" "277/28396" "430/28396" ... ## $ RichFactor : num 0.339 0.376 0.386 0.333 0.333 ... ## $ FoldEnrichment: num 2.27 2.51 2.58 2.22 2.22 ... ## $ zScore : num 11.3 11.3 11.1 10.7 10.7 ... ## $ pvalue : num 5.18e-24 4.16e-23 3.66e-22 9.82e-22 9.82e-22 ... ## $ p.adjust : num 6.48e-21 2.60e-20 1.52e-19 2.45e-19 2.45e-19 ... ## $ qvalue : num 3.41e-21 1.37e-20 8.02e-20 1.29e-19 1.29e-19 ... ## $ geneID : chr "104027/109711/110253/11426/11518/11629/11727/11838/11867/12331/12332/12340/12343/12345/12385/12488/12631/12632/"| __truncated__ "107239/108888/110033/110350/110920/11303/11426/11666/11928/11938/11946/11949/11950/11957/11964/11980/11981/1198"| __truncated__ "108099/11651/11839/12042/12283/12313/12315/12367/12428/12442/12444/12445/12447/12448/12449/12450/12452/12539/12"| __truncated__ "100088/101497/102098/105689/106952/107566/108911/108995/109689/109905/109934/114713/11855/11857/12672/13196/136"| __truncated__ ... ## $ Count : int 152 118 107 143 143 96 125 120 143 111 ... ## #...Citation ## T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu. ## clusterProfiler 4.0: A universal enrichment tool for interpreting omics data. ## The Innovation. 2021, 2(3):100141 ``` --- ## enrichResult object You can also convert this to a dataframe and export. ``` r gsa_out <- as.data.frame(sig_gene_enr) rio::export(gsa_out, "gsa_result.xlsx") head(gsa_out) ``` ``` ## ID Description GeneRatio ## GO:0003779 GO:0003779 actin binding 152/4247 ## GO:0016887 GO:0016887 ATP hydrolysis activity 118/4247 ## GO:0019207 GO:0019207 kinase regulator activity 107/4247 ## GO:0030695 GO:0030695 GTPase regulator activity 143/4247 ## GO:0060589 GO:0060589 nucleoside-triphosphatase regulator activity 143/4247 ## GO:0140097 GO:0140097 catalytic activity, acting on DNA 96/4247 ## BgRatio RichFactor FoldEnrichment zScore pvalue ## GO:0003779 448/28396 0.3392857 2.268509 11.34931 5.184531e-24 ## GO:0016887 314/28396 0.3757962 2.512623 11.30300 4.160383e-23 ## GO:0019207 277/28396 0.3862816 2.582729 11.10092 3.658863e-22 ## GO:0030695 430/28396 0.3325581 2.223527 10.72125 9.819222e-22 ## GO:0060589 430/28396 0.3325581 2.223527 10.72125 9.819222e-22 ## GO:0140097 238/28396 0.4033613 2.696927 11.02461 1.331557e-21 ## p.adjust qvalue ## GO:0003779 6.475479e-21 3.410876e-21 ## GO:0016887 2.598159e-20 1.368547e-20 ## GO:0019207 1.523307e-19 8.023823e-20 ## GO:0030695 2.452842e-19 1.292003e-19 ## GO:0060589 2.452842e-19 1.292003e-19 ## GO:0140097 2.771859e-19 1.460041e-19 ## geneID ## GO:0003779 104027/109711/110253/11426/11518/11629/11727/11838/11867/12331/12332/12340/12343/12345/12385/12488/12631/12632/12721/12774/12798/13138/13169/13367/13419/13726/13821/13822/14026/14083/14248/14457/14468/14469/15163/16332/16412/16796/16985/17118/171580/17357/17698/17886/17896/17909/17913/17916/17918/17920/17996/18301/18643/18754/18810/18826/19200/19201/192176/19241/194401/20198/20259/20430/20649/20650/20739/20740/211401/212073/213019/21346/215114/215280/217692/21894/21961/22138/22288/22323/22351/22376/22379/22388/22629/227753/230917/231003/232533/235431/237436/237860/23789/23790/242687/242894/246177/263406/26936/270058/270163/29816/29856/29875/319565/319876/320878/326618/338367/404710/50875/52690/54132/544963/55991/56226/56376/56378/56419/56431/57748/57778/58800/63873/63986/66713/67399/67425/67771/67803/68089/68142/68743/70292/70394/71602/71653/71994/72042/74117/74192/75646/76448/76467/76580/76709/77767/77963/78885/78926/80743/80889 ## GO:0016887 107239/108888/110033/110350/110920/11303/11426/11666/11928/11938/11946/11949/11950/11957/11964/11980/11981/11982/12461/12462/12464/12495/12648/13205/13404/13424/13682/14211/14828/15481/15511/15526/15925/16554/16558/16561/16573/16574/16581/170759/17215/17250/17350/17685/18671/19179/19182/19299/19348/19360/19361/193740/19718/20174/20480/20586/209737/21354/21355/216848/217716/219189/22027/22427/224824/226026/226153/227099/227612/229003/230073/231912/232975/237877/237886/23834/23924/239273/24015/240641/244059/26357/269523/27406/27407/27416/276950/30935/320790/338467/381290/50770/53313/54667/56495/56505/60530/66043/67126/67241/67398/68142/69716/70099/70218/70472/71389/71586/71602/71819/72151/73804/74142/74355/76295/76408/80861/98660 ## GO:0019207 108099/11651/11839/12042/12283/12313/12315/12367/12428/12442/12444/12445/12447/12448/12449/12450/12452/12539/12575/12578/12579/12580/12581/12700/12704/13001/13205/13867/14168/14727/14728/14870/15944/16002/16145/16195/17005/17096/17420/17751/17973/18148/18439/18519/18708/18709/18710/18712/19084/19087/192652/19354/19744/20304/208228/208659/20869/211770/214944/216233/216505/216965/217410/217588/21802/21803/21813/219181/22375/231659/232157/232341/23991/23994/240354/24056/240665/26396/26457/268697/27214/29875/319955/320207/380694/384783/52552/54124/54396/54401/54673/56036/56398/56706/66214/66787/67287/67605/70769/72949/74018/74370/74734/77799/78757/97541/97998 ## GO:0030695 100088/101497/102098/105689/106952/107566/108911/108995/109689/109905/109934/114713/11855/11857/12672/13196/13605/140500/14270/14469/14569/14682/14784/15357/15463/16413/16476/16800/16801/16897/18220/18582/19159/192662/19347/19395/19418/19419/19732/19739/210789/212285/213990/214137/216363/217692/217835/217869/217944/218397/21844/218581/219140/22324/223254/223435/225358/226970/227801/228355/228359/228482/229877/229898/230784/231207/231821/232089/232201/232906/233204/234094/234353/238130/24056/241308/243780/244668/244867/259302/263406/26382/26934/270163/27054/277360/29875/319934/320435/320484/320634/329260/329727/380608/380711/381085/381605/404710/50778/50780/51791/54189/544963/54645/56349/56508/56715/56784/57257/66096/66687/66691/67425/67865/68576/69440/69632/69780/70785/71085/71435/71709/71810/72121/72662/73341/73910/74018/74030/74334/75415/75547/76117/76123/76131/78255/78514/78808/79264/83945/94176/94254/98910 ## GO:0060589 100088/101497/102098/105689/106952/107566/108911/108995/109689/109905/109934/114713/11855/11857/12672/13196/13605/140500/14270/14469/14569/14682/14784/15357/15463/16413/16476/16800/16801/16897/18220/18582/19159/192662/19347/19395/19418/19419/19732/19739/210789/212285/213990/214137/216363/217692/217835/217869/217944/218397/21844/218581/219140/22324/223254/223435/225358/226970/227801/228355/228359/228482/229877/229898/230784/231207/231821/232089/232201/232906/233204/234094/234353/238130/24056/241308/243780/244668/244867/259302/263406/26382/26934/270163/27054/277360/29875/319934/320435/320484/320634/329260/329727/380608/380711/381085/381605/404710/50778/50780/51791/54189/544963/54645/56349/56508/56715/56784/57257/66096/66687/66691/67425/67865/68576/69440/69632/69780/70785/71085/71435/71709/71810/72121/72662/73341/73910/74018/74030/74334/75415/75547/76117/76123/76131/78255/78514/78808/79264/83945/94176/94254/98910 ## GO:0140097 104806/107182/109151/114714/11792/12648/13205/13404/13419/13421/13433/13435/14156/15201/16881/16952/17215/17216/17217/17218/17219/17350/17685/18538/18807/18968/18971/18973/18975/19205/192119/19360/19361/19366/19646/19714/19718/20174/20586/208084/214627/216848/21969/21973/21974/21975/21976/22427/225182/22594/226153/234258/234396/237877/237911/244059/245474/26383/26447/268281/26909/269400/27015/27041/319955/320209/320790/327762/330554/338359/50505/50776/56196/56210/56505/57444/623474/66408/67155/67967/68048/68142/70603/71175/71389/72103/72107/72151/72831/72960/74549/75560/76251/78286/93696/93760 ## Count ## GO:0003779 152 ## GO:0016887 118 ## GO:0019207 107 ## GO:0030695 143 ## GO:0060589 143 ## GO:0140097 96 ``` --- ## Visualizing the result There are a couple of easy functions to visualize the top hits in your result. For example this dotplot which shows amount of overlap with geneset. These are also all built in ggplot2, so it is easy to modify parameters. For more info on ggplot2 check our course [here](https://rockefelleruniversity.github.io/Plotting_In_R/presentations/singlepage/ggplot2.html). ``` r library(ggplot2) dotplot(sig_gene_enr, showCategory = 6) + theme(axis.text.y = element_text(size = 7)) ``` <!-- --> --- ## Visualizing the result Other useful visualizations include enrichment maps. These network plots show how the significant groups in the gene sets relate to each other. ``` r library(enrichplot) sig_gene_enr <- pairwise_termsim(sig_gene_enr) emapplot(sig_gene_enr, showCategory = 15, cex_label_category = 0.6) + theme(text = element_text(size = 7)) ``` <!-- --> --- ## Beyond GO There are several in-built functions that allow you to test against curated gene sets like GO terms (enrichGO) or KEGG (enrichKEGG). If we want to use our own gene sets or MSigDB we can provide them to the generic function *enricher()*. First lets grab the curated gene sets (C2) from MSigDB using *msigdbr*. We only need the Gene Set Name and the Gene ID. ``` r mm_H <- msigdbr(species = "Mus musculus", category = "H")[, c("gs_name", "entrez_gene")] head(mm_H) ``` ``` ## # A tibble: 6 x 2 ## gs_name entrez_gene ## <chr> <int> ## 1 HALLMARK_ADIPOGENESIS 11303 ## 2 HALLMARK_ADIPOGENESIS 74610 ## 3 HALLMARK_ADIPOGENESIS 52538 ## 4 HALLMARK_ADIPOGENESIS 11363 ## 5 HALLMARK_ADIPOGENESIS 11364 ## 6 HALLMARK_ADIPOGENESIS 11409 ``` --- ## Beyond GO We can then supply this data frame to *enricher()* as the TERM2GENE argument. ``` r sig_gene_enr <- enricher(sig_genes, TERM2GENE = mm_H) dotplot(sig_gene_enr) ``` <!-- --> --- ## What universe? *enricher()* also helps control your universe. The term universe refers to the background genes i.e. all the genes in your experiment. By default most of these tools will consider the background to be all genes in you gene sets. But a lot of these may not be in your experiment because they have been filtered or were not part of your annotation. To avoid inflating the significance it is good to make sure the universe is just the genes in your experiment. --- ## What universe? Lets update our universe to be the genes from our differential results. Remember this has been subset based on an NA value in the padj from DESeq2. It is always important to think not just about your genes of interest, but also what the background is. ``` r sig_gene_enr <- enricher(sig_genes, TERM2GENE = mm_H, universe = as.character(Activated_minus_Resting$ENTREZID)) dotplot(sig_gene_enr) ``` <!-- --> --- ## Going further clusterProfiler has a bunch of additional options for controlling your enrichment test. * pvalueCutoff - change the cutoff for what is significant. * pAdjustMethod - change multiple testing methods. * qvalueCutoff - change the cutoff for what is significant after multiple testing. * minGSSize/maxGSSize - exclude genes sets that are too big/large. --- ## Extra Tools We have focused on clusterProfiler as our package of choice for running these enrihcment analyses. For most use cases it is sufficient, but there are some other tools with specialist functions: * [go.seq](https://bioconductor.org/packages/release/bioc/html/goseq.html) - This incorporates gene length into the null model to help mitigate any bias * [topGO](https://bioconductor.org/packages/release/bioc/html/topGO.html) - If you are interested in GO terms and are only getting big parent terms i.e. "Cellular Process", topGO has alternative tests which weight the results to smaller child GO terms. --- class: inverse, center, middle # GSEA <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## GSEA Another popular method for testing gene set enrichment with differential expression analysis results is the Broad's GSEA method. GSEA tests whether our gene set is correlated with the ranking of genes by our differential expression analysis metric. <div align="center"> <img src="imgs/GSEA2.png" alt="gsea" height="400" width="300"> </div> --- ## How it works GSEA walks down the ranked gene list, increasing a score when a gene is in the gene set, decreasing it when it’s not. The resulting distribution is then tested versus a null distribution using a modified KS-test. The results is a significance score, an enrichment score (based on the size of the peak), and leading edge genes (based on all genes up from the peak). <div align="center"> <img src="imgs/GSEA3.png" alt="gsea" height="400" width="300"> </div> --- ## clusterProfiler inputs We can run GSEA easily with clusterProfiler. First we will need to produce a ranked and named vector of gene scores. Here will rank by **stat** column to give sensible measure of differential expression. We could also use **log2FoldChange** column if we have modified log2 fold changes using **lfsShrink()** function. ``` r forRNK <- Activated_minus_Resting$stat names(forRNK) <- Activated_minus_Resting$ENTREZID forRNK <- forRNK[order(forRNK, decreasing = T)] forRNK[1:6] ``` ``` ## 14939 110454 12772 16407 20198 54167 ## 59.24250 56.23897 47.31871 44.47817 43.67879 43.32388 ``` --- ## clusterProfiler inputs **clusterProfiler** though which can just use a simple data frame. For this we will look at the MSigDb enrichment of C7 (immunological signature). clusterProfiler needs a 2 column data frame with the geneset names and gene IDs, just as we used for our over representation analysis we did earlier. ``` r mm_c7 <- msigdbr(species = "Mus musculus", category = "C7")[, c("gs_name", "entrez_gene")] ``` ``` ## The 'msigdbdf' package must be installed to access the full dataset. ``` ``` ## Please run the following command to install the 'msigdbdf' package: ## install.packages('msigdbdf', repos = 'https://igordot.r-universe.dev') ``` ``` r head(mm_c7) ``` ``` ## # A tibble: 6 x 2 ## gs_name entrez_gene ## <chr> <int> ## 1 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 11768 ## 2 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 67841 ## 3 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 240613 ## 4 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 67704 ## 5 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 12772 ## 6 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_6HR_UP 26371 ``` --- ## Running clusterProfiler To run a GSEA there is a *GSEA()* function. We just provide our ranked list from differential gene expression analysis and the MSigDb Term2Gene dataframe. ``` r sig_gene_enr <- GSEA(forRNK, TERM2GENE = mm_c7) ``` --- ## Visualizing the result We can easily visualize these GSEA results using the same methodology as GSAs to get dotplots and emaps. ``` r clusterProfiler::dotplot(sig_gene_enr, showCategory = 6) + theme(axis.text.y = element_text(size = 7)) ``` <!-- --> --- ## Visualizing the result ``` r library(enrichplot) sig_gene_enr <- pairwise_termsim(sig_gene_enr) emapplot(sig_gene_enr, showCategory = 10, cex_label_category = 0.6) + theme(text = element_text(size = 7)) ``` <!-- --> --- ## Visualizing the result An additional visualization for GSEA is the Running Score plot. These are plotted for individual gene sets. here we are looking at the most signifcant group: ``` r gseaplot(sig_gene_enr, geneSetID = 1, by = "runningScore", title = "GSE15330_HSC_VS_LYMPHOID_PRIMED_MULTIPOTENT_PROGENITOR_DN") ``` <!-- --> --- ## fgsea [**fgsea**](https://bioconductor.org/packages/release/bioc/html/fgsea.html) is an alternative tool to run GSEA. If you are running a lot of these tests or want a greater level of customization of the testing parameters it could be a better option. It doesn't have the same amount of plotting options, but it it very very fast and also gives lower level control. --- ## GSEA vs Fisher test We have presented two options for testing for gene set enrichment. Fisher: * Great for curated gene sets like clusters or filtered DGE results GSEA: * Great for subtle coordinated changes and avoids using arbitrary cutoffs If you're unsure, try both and review the results. --- ## Time for an exercise [Link_to_exercises](../../exercises/exercises/RNAseq_part4_exercise.html) [Link_to_answers](../../exercises/answers/RNAseq_part4_answers.html)