Analysis of RNAseq data in R and Bioconductor (part 4)

class: middle, inverse, title-slide

.title[
# Analysis of RNAseq data in R and Bioconductor (part 4)
]
.subtitle[
## <html><br />
<br />
<hr color='#EB811B' size=1px width=796px><br />
</html><br />
Bioinformatics Resource Center - Rockefeller University
]
.author[
### <a href="http://rockefelleruniversity.github.io/RU_RNAseq/" class="uri">http://rockefelleruniversity.github.io/RU_RNAseq/</a>
]
.author[
### <a href="mailto:brc@rockefeller.edu" class="email">brc@rockefeller.edu</a>
]

---

## What we will cover

* Session 1: Alignment and counting

* Session 2: Differential gene expression analysis

* Session 3: Visualizing results through PCA and clustering

* _Session 4: Gene Set Analysis_

* Session 5: Differential Transcript Utilization analysis

---
## The data

Over these sessions we will review data from Christina Leslie's lab at MSKCC on T-Reg cells. This can be found on the Encode portal. T-Reg data [here](https://www.encodeproject.org/experiments/ENCSR486LMB/) and activated T-Reg data [here](https://www.encodeproject.org/experiments/ENCSR726DNP/).

---
## The data

Several intermediate files have already been made for you to use. You can download the course content from GitHub [here](https://github.com/RockefellerUniversity/RU_RNAseq/archive/master.zip). Once downloaded you will need to unzip the folder and then navigate into the *r_course* directory of the download. This will mean the paths in the code are correct.

``` r
setwd("Path/to/Download/RU_RNAseq-master/r_course")
```

---
## The data

I have aligned all FQ to BAM and counted in genes and exons using Rsubread and summarizeOverlaps() and then analyzed this data for expression changes using DESeq2. You can find an excel file of differential expression analysis for this comparison in the **data** directory.

- CSV file of differential expression results for activated versus resting T-cells - **data/Group_Activated_minus_Resting.csv**.

---
## What we will cover

In our previous sessions we looked at how we can identify experimentally interesting genes. Either through finding genes with significant differential expression or performing clustering.

In this session we will explore a few ways we can evaluate any enrichment for functionally related genes. This helps us address common questions like:

**Do "cell cycle" genes change more between conditions then other genes?**

**Are genes related to "immune response" enriched in any clusters?**

---
class: inverse, center, middle

# Gene Sets

---

## What we test and how we test?

There are many options available to us for both functions/pathways/categories **(Gene set)** we will test and the methods we can use to test these gene sets for enrichment in our analysis **(Gene set Enrichment Analysis)**.

- **Gene set** - A named collection of genes.
- **Gene Set <u>e</u>nrichment Analysis - GSA** - A broad term for correlating a set of genes with a condition/phenotype.
- **Gene Set <u>E</u>nrichment Analysis - GSEA** - Broad Institute's term for correlating a set of genes with a condition/phenotype.

---
## Gene sets

Sources of well curated gene sets include [GO consortium](http://geneontology.org/) (gene's function, biological process and cellular localization), [REACTOME](http://www.reactome.org/) (Biological Pathways) and [MsigDB](http://software.broadinstitute.org/gsea/msigdb/) (Computationally and Experimentally derived).

You can also use a user-defined gene set i.e. genes of interest from a specific paper.

---
## Gene Ontology

The [Gene Ontology consortium](http://geneontology.org/) aims to provide a comprehensive resource of all the currently available knowledge regarding the functions of genes and gene products.

Functional categories of genes are broadly split into three main groups:

* **Molecular functions.** - Activity of a gene's protein product.
* **Biological processes.** - Role of gene's protein product.
* **Cellular components.** - Where in cell molecular function of protein product is performed.

---
## Gene Ontology

The three sub categories of gene ontology are arranged in nested, structured graph with gene sets at the top of graph representing more general terms and those at the bottom more specific terms.

![](imgs/goGraph.png)

---
## Reactome and KEGG

The [Reactome](http://www.reactome.org/) and [KEGG (Kyoto Encyclopedia of genes and genomes)](http://www.genome.jp/kegg/kegg2.html) contains information on genes' membership and roles in  molecular pathways.

These databases focus largely on metabolic and disease pathways, and allow us to investigate our genes in the context of not only functional roles but relative positions within pathways.

---
## MsigDB
.pull-left[
The [molecular signatures database (MsigDB)](http://software.broadinstitute.org/gsea/msigdb/) is available from the Broad institute and provides a set of curated gene sets derived from sources such as GO, pathway databases, motif scans and even other experimental sets.

MsigDB databases are widely used in Gene Set Enrichment Analysis and are available as plain text, in formats used with the popular Java based gene set enrichment software GSEA.
  ]

.pull-right[
<div align="center">
<img src="imgs/msigdb.png" alt="offset" height="500" width="350">
</div>
  ]

---
## Gene sets in Bioconductor

In R we can access information on these gene sets through database libraries (such as the Org.db.eg we have reviewed) such as **GO.db**, **reactome.db** or by making use of libraries which allows us to our gene sets from parse plain text formats, **GSEABase**.

In our [Bioconductor sessions](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/GenomicFeatures_In_Bioconductor.html#Gene_Annotation) we reviewed how we to get information on genes using the **Org.db** packages. We can access these dbs in similar ways.

``` r
library(GO.db)
library(reactome.db)
library(GSEABase)
```

---
class: inverse, center, middle

# MSigDB and Gene Set Collections

---

## GSEA and MSigDB

The GSEA software and MSigDB gene set collections can be found at the [Broad's site](http://software.broadinstitute.org/gsea/index.jsp).

We have access to Human gene sets containing information on:

* H	- hallmark gene sets
* C1	- positional gene sets
* C2	- curated gene sets
* C3	- motif gene sets
* C4	- computational gene sets
* C5	- GO gene sets
* C6	- oncogenic gene sets 
* C7	- immunologic gene sets

---
## Msigdbr

The MSigDB collection has been wrapped up in the [*msigdbr* CRAN package](https://igordot.github.io/msigdbr/). This package also contains computationally predicted homologs for a number of common species, so you can looks at these MSigDb groups in other organisms.

The *msigdbr()* function is used to specify which organism and which categories/collection you want. In return you get a type of data frame called a tibble. There is a lot of information in this data frame.

``` r
library(msigdbr)

mm_H <- msigdbr(species = "Mus musculus", category = "H")
head(mm_H)
```

```
## # A tibble: 6 x 26
##   gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
##   <chr>       <chr>     <chr>        <chr>          <chr>        <chr>          
## 1 Abca1       11303     ENSMUSG0000~ ABCA1          19           ENSG00000165029
## 2 Abcb8       74610     ENSMUSG0000~ ABCB8          11194        ENSG00000197150
## 3 Acaa2       52538     ENSMUSG0000~ ACAA2          10449        ENSG00000167315
## 4 Acadl       11363     ENSMUSG0000~ ACADL          33           ENSG00000115361
## 5 Acadm       11364     ENSMUSG0000~ ACADM          34           ENSG00000117054
## 6 Acads       11409     ENSMUSG0000~ ACADS          35           ENSG00000122971
## # i 20 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
## #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
## #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
## #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, db_version <chr>,
## #   db_target_species <chr>, ortholog_taxon_id <int>, ortholog_sources <chr>,
## #   num_ortholog_sources <dbl>, entrez_gene <chr>, gs_cat <chr>,
## #   gs_subcat <chr>
```

---
## Msigdbr and GO

If you are interested in GO terms we can grab them using the same approach. Here we get the GO terms but for Pig.

``` r
ss_C5 <- msigdbr(species = "pig", category = "C5")
head(ss_C5)
```

```
## # A tibble: 6 x 26
##   gene_symbol ncbi_gene ensembl_gene db_gene_symbol db_ncbi_gene db_ensembl_gene
##   <chr>       <chr>     <chr>        <chr>          <chr>        <chr>          
## 1 AASDHPPT    100525064 ENSSSCG0000~ AASDHPPT       60496        ENSG00000149313
## 2 ALDH1L1     100622210 ENSSSCG0000~ ALDH1L1        10840        ENSG00000144908
## 3 ALDH1L2     100151976 ENSSSCG0000~ ALDH1L2        160428       ENSG00000136010
## 4 MTHFD1      414382    ENSSSCG0000~ MTHFD1         4522         ENSG00000100714
## 5 MTHFD1L     100154722 ENSSSCG0000~ MTHFD1L        25902        ENSG00000120254
## 6 MTHFD2L     100525706 ENSSSCG0000~ MTHFD2L        441024       ENSG00000163738
## # i 20 more variables: source_gene <chr>, gs_id <chr>, gs_name <chr>,
## #   gs_collection <chr>, gs_subcollection <chr>, gs_collection_name <chr>,
## #   gs_description <chr>, gs_source_species <chr>, gs_pmid <chr>,
## #   gs_geoid <chr>, gs_exact_source <chr>, gs_url <chr>, db_version <chr>,
## #   db_target_species <chr>, ortholog_taxon_id <int>, ortholog_sources <chr>,
## #   num_ortholog_sources <dbl>, entrez_gene <chr>, gs_cat <chr>,
## #   gs_subcat <chr>
```

---
class: inverse, center, middle

# Testing gene set enrichment

---

## Testing gene sets

We will review two different methods to identify functional groups associated with our condition of interest.

* The first method will test for any association of our gene set with our a group of interesting genes (differentially expressed genes/gene cluster) - Functional enrichment.
* The second method will test for any association of our gene set with high/low ranked genes (ranked by measure of differential expression) - GSEA.

---
## DE results from DESeq2

First lets read in our differential expression results from DESeq2 analysis of activated vs resting T-cells.

``` r
Activated_minus_Resting <- read.delim(file = "data/Group_Activated_minus_Resting.csv",
    sep = ",")
Activated_minus_Resting[1:3, ]
```

```
##    ENTREZID  SYMBOL baseMean log2FoldChange    lfcSE      stat   pvalue padj
## 1 100009600   Zglp1 3.685411      0.2755725 1.855514 0.1485154 0.881936   NA
## 2 100009609 Vmn2r65 0.000000             NA       NA        NA       NA   NA
## 3 100009614 Gm10024 0.000000             NA       NA        NA       NA   NA
##   newPadj
## 1       1
## 2      NA
## 3      NA
```

---

## Background

All GSA methods will require us to filter to genes to those tested for differential expression.

We will therefore filter to all genes which pass our independent filtering step from DESeq2. These will be our genes with no NA in the DESeq2 padj column.

``` r
Activated_minus_Resting <- Activated_minus_Resting[!is.na(Activated_minus_Resting$padj),
    ]
Activated_minus_Resting[1:3, ]
```

```
##    ENTREZID  SYMBOL baseMean log2FoldChange      lfcSE        stat       pvalue
## 6    100017 Ldlrap1 2752.367    -0.04815446 0.08880811  -0.5422305 5.876598e-01
## 7    100019    Mdn1 2281.866    -1.09612289 0.09214164 -11.8960650 1.240603e-32
## 8 100033459  Ifi208 3282.679    -0.02774770 0.09029354  -0.3073055 7.586109e-01
##           padj      newPadj
## 6 7.416249e-01 1.000000e+00
## 7 7.038705e-31 2.424386e-28
## 8 8.601547e-01 1.000000e+00
```

---
class: inverse, center, middle

# Functional enrichment

---

## Functional enrichment

When we run a functional enrichment analysis (a.ka. over representation analysis) we are testing if we observe a greater proportion of genes of interest in a gene set versus the proportion in the background genes. We do this with a Fisher exact test.

Example:

We have 100 genes of interest. 1/4 are in a gene set we care about. Is it enriched?

Table: Contingency Table for Fisher Test

|                | Genes.of.Interest| Gene.Universe|
|:---------------|-----------------:|-------------:|
|In Gene Set     |                25|            50|
|Out of Gene Set |                75|          1950|

---
## ClusterProfiler

The **clusterProfiler** package provides multiple enrichment functions that work with curated gene sets (e.g. GO, KEGG) or custom gene sets. The most common way to use clusterProfiler is to run our over enrichment test using the one sided Fisher exact test.

clusterProfiler has both GSA and GSEA approaches, so is a nice one stop shop for everything. Plus it has some really nice and easy visualization options. Detailed information about all of the functionality within this package is available [here](http://yulab-smu.top/clusterProfiler-book/).

``` r
library(clusterProfiler)
```

---
## Running ClusterProfiler

First we need our gene information. For a simple functional enrichment of GO terms there is the *enrichGO()* function. You just provide this a vector of gene IDs you want to check, and the Org.db of the relevant organism. There are other specialized gene set functions i.e. *enrichKEGG()* for KEGG gene sets, or a generalized function *enrichr()* for any user provided gene sets.

``` r
sig_genes <- Activated_minus_Resting[Activated_minus_Resting$padj < 0.05, 1]
head(sig_genes)
```

```
## [1]    100019 100036521 100038516 100038882 100039796 100040736
```

``` r
sig_gene_enr <- enrichGO(sig_genes, OrgDb = org.Mm.eg.db)
```

---
## enrichResult object

The result is special object which contains the outcome of the functional enrichment. When you look at what is inside you get a preview of what test was done and what the results look like.

``` r
sig_gene_enr
```

```
## #
## # over-representation test
## #
## #...@organism 	 Mus musculus 
## #...@ontology 	 MF 
## #...@keytype 	 ENTREZID 
## #...@gene 	 chr [1:4435] "100019" "100036521" "100038516" "100038882" "100039796" ...
## #...pvalues adjusted by 'BH' with cutoff <0.05 
## #...499 enriched terms found
## 'data.frame':	499 obs. of  12 variables:
##  $ ID            : chr  "GO:0003779" "GO:0016887" "GO:0019207" "GO:0030695" ...
##  $ Description   : chr  "actin binding" "ATP hydrolysis activity" "kinase regulator activity" "GTPase regulator activity" ...
##  $ GeneRatio     : chr  "152/4247" "118/4247" "107/4247" "143/4247" ...
##  $ BgRatio       : chr  "448/28396" "314/28396" "277/28396" "430/28396" ...
##  $ RichFactor    : num  0.339 0.376 0.386 0.333 0.333 ...
##  $ FoldEnrichment: num  2.27 2.51 2.58 2.22 2.22 ...
##  $ zScore        : num  11.3 11.3 11.1 10.7 10.7 ...
##  $ pvalue        : num  5.18e-24 4.16e-23 3.66e-22 9.82e-22 9.82e-22 ...
##  $ p.adjust      : num  6.48e-21 2.60e-20 1.52e-19 2.45e-19 2.45e-19 ...
##  $ qvalue        : num  3.41e-21 1.37e-20 8.02e-20 1.29e-19 1.29e-19 ...
##  $ geneID        : chr  "104027/109711/110253/11426/11518/11629/11727/11838/11867/12331/12332/12340/12343/12345/12385/12488/12631/12632/"| __truncated__ "107239/108888/110033/110350/110920/11303/11426/11666/11928/11938/11946/11949/11950/11957/11964/11980/11981/1198"| __truncated__ "108099/11651/11839/12042/12283/12313/12315/12367/12428/12442/12444/12445/12447/12448/12449/12450/12452/12539/12"| __truncated__ "100088/101497/102098/105689/106952/107566/108911/108995/109689/109905/109934/114713/11855/11857/12672/13196/136"| __truncated__ ...
##  $ Count         : int  152 118 107 143 143 96 125 120 143 111 ...
## #...Citation
##  T Wu, E Hu, S Xu, M Chen, P Guo, Z Dai, T Feng, L Zhou, W Tang, L Zhan, X Fu, S Liu, X Bo, and G Yu.
##  clusterProfiler 4.0: A universal enrichment tool for interpreting omics data.
##  The Innovation. 2021, 2(3):100141
```

---
## enrichResult object

You can also convert this to a dataframe and export.

``` r
gsa_out <- as.data.frame(sig_gene_enr)
rio::export(gsa_out, "gsa_result.xlsx")

head(gsa_out)
```

```
##                    ID                                  Description GeneRatio
## GO:0003779 GO:0003779                                actin binding  152/4247
## GO:0016887 GO:0016887                      ATP hydrolysis activity  118/4247
## GO:0019207 GO:0019207                    kinase regulator activity  107/4247
## GO:0030695 GO:0030695                    GTPase regulator activity  143/4247
## GO:0060589 GO:0060589 nucleoside-triphosphatase regulator activity  143/4247
## GO:0140097 GO:0140097            catalytic activity, acting on DNA   96/4247
##              BgRatio RichFactor FoldEnrichment   zScore       pvalue
## GO:0003779 448/28396  0.3392857       2.268509 11.34931 5.184531e-24
## GO:0016887 314/28396  0.3757962       2.512623 11.30300 4.160383e-23
## GO:0019207 277/28396  0.3862816       2.582729 11.10092 3.658863e-22
## GO:0030695 430/28396  0.3325581       2.223527 10.72125 9.819222e-22
## GO:0060589 430/28396  0.3325581       2.223527 10.72125 9.819222e-22
## GO:0140097 238/28396  0.4033613       2.696927 11.02461 1.331557e-21
##                p.adjust       qvalue
## GO:0003779 6.475479e-21 3.410876e-21
## GO:0016887 2.598159e-20 1.368547e-20
## GO:0019207 1.523307e-19 8.023823e-20
## GO:0030695 2.452842e-19 1.292003e-19
## GO:0060589 2.452842e-19 1.292003e-19
## GO:0140097 2.771859e-19 1.460041e-19
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     geneID
## GO:0003779 104027/109711/110253/11426/11518/11629/11727/11838/11867/12331/12332/12340/12343/12345/12385/12488/12631/12632/12721/12774/12798/13138/13169/13367/13419/13726/13821/13822/14026/14083/14248/14457/14468/14469/15163/16332/16412/16796/16985/17118/171580/17357/17698/17886/17896/17909/17913/17916/17918/17920/17996/18301/18643/18754/18810/18826/19200/19201/192176/19241/194401/20198/20259/20430/20649/20650/20739/20740/211401/212073/213019/21346/215114/215280/217692/21894/21961/22138/22288/22323/22351/22376/22379/22388/22629/227753/230917/231003/232533/235431/237436/237860/23789/23790/242687/242894/246177/263406/26936/270058/270163/29816/29856/29875/319565/319876/320878/326618/338367/404710/50875/52690/54132/544963/55991/56226/56376/56378/56419/56431/57748/57778/58800/63873/63986/66713/67399/67425/67771/67803/68089/68142/68743/70292/70394/71602/71653/71994/72042/74117/74192/75646/76448/76467/76580/76709/77767/77963/78885/78926/80743/80889
## GO:0016887                                                                                                                                                                                                               107239/108888/110033/110350/110920/11303/11426/11666/11928/11938/11946/11949/11950/11957/11964/11980/11981/11982/12461/12462/12464/12495/12648/13205/13404/13424/13682/14211/14828/15481/15511/15526/15925/16554/16558/16561/16573/16574/16581/170759/17215/17250/17350/17685/18671/19179/19182/19299/19348/19360/19361/193740/19718/20174/20480/20586/209737/21354/21355/216848/217716/219189/22027/22427/224824/226026/226153/227099/227612/229003/230073/231912/232975/237877/237886/23834/23924/239273/24015/240641/244059/26357/269523/27406/27407/27416/276950/30935/320790/338467/381290/50770/53313/54667/56495/56505/60530/66043/67126/67241/67398/68142/69716/70099/70218/70472/71389/71586/71602/71819/72151/73804/74142/74355/76295/76408/80861/98660
## GO:0019207                                                                                                                                                                                                                                                                                         108099/11651/11839/12042/12283/12313/12315/12367/12428/12442/12444/12445/12447/12448/12449/12450/12452/12539/12575/12578/12579/12580/12581/12700/12704/13001/13205/13867/14168/14727/14728/14870/15944/16002/16145/16195/17005/17096/17420/17751/17973/18148/18439/18519/18708/18709/18710/18712/19084/19087/192652/19354/19744/20304/208228/208659/20869/211770/214944/216233/216505/216965/217410/217588/21802/21803/21813/219181/22375/231659/232157/232341/23991/23994/240354/24056/240665/26396/26457/268697/27214/29875/319955/320207/380694/384783/52552/54124/54396/54401/54673/56036/56398/56706/66214/66787/67287/67605/70769/72949/74018/74370/74734/77799/78757/97541/97998
## GO:0030695                     100088/101497/102098/105689/106952/107566/108911/108995/109689/109905/109934/114713/11855/11857/12672/13196/13605/140500/14270/14469/14569/14682/14784/15357/15463/16413/16476/16800/16801/16897/18220/18582/19159/192662/19347/19395/19418/19419/19732/19739/210789/212285/213990/214137/216363/217692/217835/217869/217944/218397/21844/218581/219140/22324/223254/223435/225358/226970/227801/228355/228359/228482/229877/229898/230784/231207/231821/232089/232201/232906/233204/234094/234353/238130/24056/241308/243780/244668/244867/259302/263406/26382/26934/270163/27054/277360/29875/319934/320435/320484/320634/329260/329727/380608/380711/381085/381605/404710/50778/50780/51791/54189/544963/54645/56349/56508/56715/56784/57257/66096/66687/66691/67425/67865/68576/69440/69632/69780/70785/71085/71435/71709/71810/72121/72662/73341/73910/74018/74030/74334/75415/75547/76117/76123/76131/78255/78514/78808/79264/83945/94176/94254/98910
## GO:0060589                     100088/101497/102098/105689/106952/107566/108911/108995/109689/109905/109934/114713/11855/11857/12672/13196/13605/140500/14270/14469/14569/14682/14784/15357/15463/16413/16476/16800/16801/16897/18220/18582/19159/192662/19347/19395/19418/19419/19732/19739/210789/212285/213990/214137/216363/217692/217835/217869/217944/218397/21844/218581/219140/22324/223254/223435/225358/226970/227801/228355/228359/228482/229877/229898/230784/231207/231821/232089/232201/232906/233204/234094/234353/238130/24056/241308/243780/244668/244867/259302/263406/26382/26934/270163/27054/277360/29875/319934/320435/320484/320634/329260/329727/380608/380711/381085/381605/404710/50778/50780/51791/54189/544963/54645/56349/56508/56715/56784/57257/66096/66687/66691/67425/67865/68576/69440/69632/69780/70785/71085/71435/71709/71810/72121/72662/73341/73910/74018/74030/74334/75415/75547/76117/76123/76131/78255/78514/78808/79264/83945/94176/94254/98910
## GO:0140097                                                                                                                                                                                                                                                                                                                                                        104806/107182/109151/114714/11792/12648/13205/13404/13419/13421/13433/13435/14156/15201/16881/16952/17215/17216/17217/17218/17219/17350/17685/18538/18807/18968/18971/18973/18975/19205/192119/19360/19361/19366/19646/19714/19718/20174/20586/208084/214627/216848/21969/21973/21974/21975/21976/22427/225182/22594/226153/234258/234396/237877/237911/244059/245474/26383/26447/268281/26909/269400/27015/27041/319955/320209/320790/327762/330554/338359/50505/50776/56196/56210/56505/57444/623474/66408/67155/67967/68048/68142/70603/71175/71389/72103/72107/72151/72831/72960/74549/75560/76251/78286/93696/93760
##            Count
## GO:0003779   152
## GO:0016887   118
## GO:0019207   107
## GO:0030695   143
## GO:0060589   143
## GO:0140097    96
```

---
## Visualizing the result

There are a couple of easy functions to visualize the top hits in your result. For example this dotplot which shows amount of overlap with geneset. These are also all built in ggplot2, so it is easy to modify parameters. For more info on ggplot2 check our course [here](https://rockefelleruniversity.github.io/Plotting_In_R/presentations/singlepage/ggplot2.html).

``` r
library(ggplot2)
dotplot(sig_gene_enr, showCategory = 6) + theme(axis.text.y = element_text(size = 7))
```

![](RU_RNAseq_p4_files/figure-html/unnamed-chunk-15-1.png)

---
## Visualizing the result

Other useful visualizations include enrichment maps. These network plots show how the significant groups in the gene sets relate to each other.

``` r
library(enrichplot)
sig_gene_enr <- pairwise_termsim(sig_gene_enr)
emapplot(sig_gene_enr, showCategory = 15, cex_label_category = 0.6) + theme(text = element_text(size = 7))
```

![](RU_RNAseq_p4_files/figure-html/unnamed-chunk-16-1.png)

---
## Beyond GO

There are several in-built functions that allow you to test against curated gene sets like GO terms (enrichGO) or KEGG (enrichKEGG).

If we want to use our own gene sets or MSigDB we can provide them to the generic function *enricher()*.

First lets grab the curated gene sets (C2) from MSigDB using *msigdbr*. We only need the Gene Set Name and the Gene ID.

``` r
mm_H <- msigdbr(species = "Mus musculus", category = "H")[, c("gs_name", "entrez_gene")]
head(mm_H)
```

```
## # A tibble: 6 x 2
##   gs_name               entrez_gene
##   <chr>                 <chr>      
## 1 HALLMARK_ADIPOGENESIS 11303      
## 2 HALLMARK_ADIPOGENESIS 74610      
## 3 HALLMARK_ADIPOGENESIS 52538      
## 4 HALLMARK_ADIPOGENESIS 11363      
## 5 HALLMARK_ADIPOGENESIS 11364      
## 6 HALLMARK_ADIPOGENESIS 11409
```

---
## Beyond GO

We can then supply this data frame to *enricher()* as the TERM2GENE argument.

``` r
sig_gene_enr <- enricher(sig_genes, TERM2GENE = mm_H)
dotplot(sig_gene_enr)
```

![](RU_RNAseq_p4_files/figure-html/unnamed-chunk-18-1.png)

---
## What universe?

*enricher()* also helps control your universe. The term universe refers to the background genes i.e. all the genes in your experiment.

By default most of these tools will consider the background to be all genes in you gene sets. But a lot of these may not be in your experiment because they have been filtered or were not part of your annotation.

To avoid inflating the significance it is good to make sure the universe is just the genes in your experiment.

---
## What universe?

Lets update our universe to be the genes from our differential results. Remember this has been subset based on an NA value in the padj from DESeq2.

It is always important to think not just about your genes of interest, but also what the background is.

``` r
sig_gene_enr <- enricher(sig_genes, TERM2GENE = mm_H, universe = as.character(Activated_minus_Resting$ENTREZID))
dotplot(sig_gene_enr)
```

![](RU_RNAseq_p4_files/figure-html/unnamed-chunk-19-1.png)

---
## Going further

clusterProfiler has a bunch of additional options for controlling your enrichment test.

* pvalueCutoff - change the cutoff for what is significant. 
* pAdjustMethod - change multiple testing methods. 
* qvalueCutoff - change the cutoff for what is significant after multiple testing. 
* minGSSize/maxGSSize - exclude genes sets that are too big/large.

---
## Extra Tools

We have focused on clusterProfiler as our package of choice for running these enrihcment analyses. For most use cases it is sufficient, but there are some other tools with specialist functions:

* [go.seq](https://bioconductor.org/packages/release/bioc/html/goseq.html) - This incorporates gene length into the null model to help mitigate any bias
* [topGO](https://bioconductor.org/packages/release/bioc/html/topGO.html) - If you are interested in GO terms and are only getting big parent terms i.e. "Cellular Process", topGO has alternative tests which weight the results to smaller child GO terms.

---
class: inverse, center, middle

# GSEA

---

## GSEA

Another popular method for testing gene set enrichment with differential expression analysis results is the Broad's GSEA method.

GSEA tests whether our gene set is correlated with the ranking of genes by our differential expression analysis metric.

---
## How it works

GSEA walks down the ranked gene list, increasing a score when a gene is in the gene set, decreasing it when it’s not. The resulting distribution is then tested versus a null distribution using a modified KS-test. The results is a significance score, an enrichment score (based on the size of the peak), and leading edge genes (based on all genes up from the peak).

---
## clusterProfiler inputs

We can run GSEA easily with clusterProfiler. First we will need to produce a ranked and named vector of gene scores. Here will rank by **stat** column to give sensible measure of differential expression. We could also use **log2FoldChange** column if we have modified log2 fold changes using **lfsShrink()** function.

``` r
forRNK <- Activated_minus_Resting$stat
names(forRNK) <- Activated_minus_Resting$ENTREZID

forRNK <- forRNK[order(forRNK, decreasing = T)]

forRNK[1:6]
```

```
##    14939   110454    12772    16407    20198    54167 
## 59.24250 56.23897 47.31871 44.47817 43.67879 43.32388
```

---
## clusterProfiler inputs

**clusterProfiler** though which can just use a simple data frame. For this we will look at the MSigDb enrichment of C7 (immunological signature). clusterProfiler needs a 2 column data frame with the geneset names and gene IDs, just as we used for our over representation analysis we did earlier.

``` r
mm_c7 <- msigdbr(species = "Mus musculus", category = "C7")[, c("gs_name", "entrez_gene")]
head(mm_c7)
```

```
## # A tibble: 6 x 2
##   gs_name                                                            entrez_gene
##   <chr>                                                              <chr>      
## 1 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_DN 497652     
## 2 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_DN 104112     
## 3 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_DN 11429      
## 4 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_DN 11461      
## 5 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_DN 11518      
## 6 ANDERSON_BLOOD_CN54GP140_ADJUVANTED_WITH_GLA_AF_AGE_18_45YO_1DY_DN 93736
```

---
## Running clusterProfiler
To run a GSEA there is a *GSEA()* function. We just provide our ranked list from differential gene expression analysis and the MSigDb Term2Gene dataframe.

``` r
sig_gene_enr <- GSEA(forRNK, TERM2GENE = mm_c7)
```

---
## Visualizing the result

We can easily visualize these GSEA results using the same methodology as GSAs to get dotplots and emaps.

``` r
clusterProfiler::dotplot(sig_gene_enr, showCategory = 6) + theme(axis.text.y = element_text(size = 7))
```

![](RU_RNAseq_p4_files/figure-html/unnamed-chunk-23-1.png)

---
## Visualizing the result

``` r
library(enrichplot)
sig_gene_enr <- pairwise_termsim(sig_gene_enr)
emapplot(sig_gene_enr, showCategory = 10, cex_label_category = 0.6) + theme(text = element_text(size = 7))
```

![](RU_RNAseq_p4_files/figure-html/unnamed-chunk-24-1.png)

---
## Visualizing the result

An additional visualization for GSEA is the Running Score plot. These are plotted for individual gene sets. here we are looking at the most signifcant group:

``` r
gseaplot(sig_gene_enr, geneSetID = 1, by = "runningScore", title = "GSE15330_HSC_VS_LYMPHOID_PRIMED_MULTIPOTENT_PROGENITOR_DN")
```

![](RU_RNAseq_p4_files/figure-html/unnamed-chunk-25-1.png)

---
## fgsea

[**fgsea**](https://bioconductor.org/packages/release/bioc/html/fgsea.html) is an alternative tool to run GSEA. If you are running a lot of these tests or want a greater level of customization of the testing parameters it could be a better option.

It doesn't have the same amount of plotting options, but it it very very fast and also gives lower level control.

---
## GSEA vs Fisher test

We have presented two options for testing for gene set enrichment.

Fisher:
* Great for curated gene sets like clusters or filtered DGE results

GSEA: 
* Great for subtle coordinated changes and avoids using arbitrary cutoffs

If you're unsure, try both and review the results.

---
## Time for an exercise

[Link_to_exercises](../../exercises/exercises/RNAseq_part4_exercise.html)

[Link_to_answers](../../exercises/answers/RNAseq_part4_answers.html)