Analysis of RNAseq data in R and Bioconductor (part 2)

class: middle, inverse, title-slide

.title[
# Analysis of RNAseq data in R and Bioconductor (part 2)
]
.subtitle[
## <html><br />
<br />
<hr color='#EB811B' size=1px width=796px><br />
</html><br />
Bioinformatics Resource Center - Rockefeller University
]
.author[
### <a href="http://rockefelleruniversity.github.io/RU_RNAseq/" class="uri">http://rockefelleruniversity.github.io/RU_RNAseq/</a>
]
.author[
### <a href="mailto:brc@rockefeller.edu" class="email">brc@rockefeller.edu</a>
]

---

## What we will cover

* Session 1: Alignment and counting

* _Session 2: Differential gene expression analysis_

* Session 3: Visualizing results through PCA and clustering

* Session 4: Gene Set Analysis

* Session 5: Differential Transcript Utilization analysis

---
## The data

Over these sessions we will review data from Christina Leslie's lab at MSKCC on T-Reg cells. This can be found on the Encode portal. T-Reg data [here](https://www.encodeproject.org/experiments/ENCSR486LMB/) and activated T-Reg data [here](https://www.encodeproject.org/experiments/ENCSR726DNP/).

---
## The data

Several intermediate files have already been made for you to use. You can download the course content from GitHub [here](https://github.com/RockefellerUniversity/RU_RNAseq/archive/master.zip). Once downloaded you will need to unzip the folder and then navigate into the *r_course* directory of the download. This will mean the paths in the code are correct.

``` r
setwd("Path/to/Download/RU_RNAseq-master/r_course")
```

---
## The data

All Gene and Exon counts can be found as .RData obects in **data** directory

- Counts in genes can be found at - **data/GeneCounts.Rdata**
- Counts in disjoint exons can be found at - **data/ExonCounts.Rdata**

- Salmon transcript quantification output directories can be found under - **data/Salmon/**

---
class: inverse, center, middle

# Counting with multiple RNAseq datasets

---

## Counts from SummarizeOverlaps

In our last session we counted our reads in genes using the summarize overlaps function to generate our **RangedSummarizedExperiment** object. We did this with a single BAM file.

To be able to do differntial gene expression analysis we need replicates. We can use similar approaches for handle multiple files. First I am using a **BamFileList** of all the T-Reg samples to easily organize my samples and save memory with the **yieldSize** parameter.

``` r
library(Rsamtools)
bamFilesToCount <- c("Sorted_Treg_1.bam", "Sorted_Treg_2.bam", "Sorted_Treg_act_1.bam",
    "Sorted_Treg_act_2.bam", "Sorted_Treg_act_3.bam")
names(bamFilesToCount) <- c("Sorted_Treg_1", "Sorted_Treg_2", "Sorted_Treg_act_1",
    "Sorted_Treg_act_2", "Sorted_Treg_act_3")
myBams <- BamFileList(bamFilesToCount, yieldSize = 10000)
```

---
## Counts from SummarizeOverlaps

We can count over this BamFileList in the same way we did for a single sample to produce our **RangedSummarizedExperiment** object of counts in genes across all samples.

``` r
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(GenomicAlignments)
geneExons <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
geneCounts <- summarizeOverlaps(geneExons, myBams, ignore.strand = TRUE)
geneCounts
```

```
## class: RangedSummarizedExperiment 
## dim: 24100 5 
## metadata(0):
## assays(1): ''
## rownames(24100): 497097 19888 ... 100040911 170942
## rowData names(0):
## colnames(5): Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1
##   Sorted_T_reg_act_2 Sorted_T_reg_act_3
## colData names(0):
```

---
## Speeding up SummarizeOverlaps

Bioconductor has a simple, unified system for parallelization through the [**BiocParallel** package.]("https://bioconductor.org/packages/release/bioc/html/BiocParallel.html") The BiocParallel package allows for differing parallel approaches including multiple cores on the same machine and distributed computing across large cluster environments (i.e a HPC).

Simply by installing and loading library, we are ready for parallelization.

``` r
library(BiocParallel)
```

---
## Speeding up SummarizeOverlaps

We can control the parallelization in BiocParallel often choosing either a serial mode (no parallelization) with **SerialParam()** function or the number of cores to use with **MulticoreParam(workers=NUMBEROFCORES)**.

We use the **register()** function to set the desired parallelization. Once registered you can then start counting using a parallelization.

``` r
paramMulti <- MulticoreParam(workers = 2)
paramSerial <- SerialParam()
register(paramSerial)
```

---
## Counts from SummarizeOverlaps

The counts object has already been made for you to save time. It is in the *data* directory

The counrs object is a **RangedSummarizedExperiment**, which contains our counts as a matrix accessible by the **assay()** function.

Each row is named by its Entrez gene ID.

``` r
load("data/GeneCounts.RData")
```

``` r
assay(geneCounts)[1:2, ]
```

```
##        Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1 Sorted_T_reg_act_2
## 497097              0              1                  0                  0
## 19888               0              0                  0                  0
##        Sorted_T_reg_act_3
## 497097                  0
## 19888                  11
```

---
## Counts from SummarizeOverlaps

We can retrieve the GRangesList we counted by the **rowRanges()** accessor function.

Each GRangesList element contains exons used in counting, has Entrez gene ID and is in matching order as rows in our count table.

``` r
rowRanges(geneCounts)[1:2, ]
```

```
## GRangesList object of length 2:
## $`497097`
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames          ranges strand |   exon_id   exon_name
##          <Rle>       <IRanges>  <Rle> | <integer> <character>
##   [1]     chr1 3214482-3216968      - |      8008        <NA>
##   [2]     chr1 3421702-3421901      - |      8009        <NA>
##   [3]     chr1 3670552-3671498      - |      8012        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
## 
## $`19888`
## GRanges object with 6 ranges and 2 metadata columns:
##       seqnames          ranges strand |   exon_id   exon_name
##          <Rle>       <IRanges>  <Rle> | <integer> <character>
##   [1]     chr1 4290846-4293012      - |      8013        <NA>
##   [2]     chr1 4343507-4350091      - |      8014        <NA>
##   [3]     chr1 4351910-4352081      - |      8015        <NA>
##   [4]     chr1 4352202-4352837      - |      8016        <NA>
##   [5]     chr1 4360200-4360314      - |      8017        <NA>
##   [6]     chr1 4409170-4409241      - |      8018        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
```

---
class: inverse, center, middle

# Differential gene expression analysis

---

## Differential expression with RNAseq

To perform differential expression between our sample groups we will need to perform some essential steps.

- Transform and/or normalize our counts.

- Calculate significance of differences between groups from normalized/transformed counts.

---
## Log transform or as counts

Two broadly different approaches to RNAseq analysis could be.

- Log2 transform, normalize, use standard test for normal distribution to find differential expression.

- Normalize and use test suitable for count data to find differential expression.

---
## Normalization

As with microarrays, for the majority of normalization approaches we assume that the composition of RNA species across samples is similar and total RNA levels as the same.

To account for technical differences in library composition, many of these approached identify scaling factors for each sample which may be used to normalized observed/measured counts.

Simple measures such as reads per million (RPM) are scaled to total mapped reads. More sophisticated methods, such as used in the package **DESeq2** find scaling factors from the distribution of genes relative log expression versus their respective average across all samples.

.pull-left[
<div align="center">
<img src="imgs/beforeNorm.png" alt="offset" height="200" width="400">
</div>

]
.pull-right[
<div align="center">
<img src="imgs/afterNorm.png" alt="offset" height="200" width="400">
</div>

]

---
## Mean and variance relationship

Most approaches seek to model the relationship between mean expression and variance of expression in order to model and shrink variance of individual genes. This will assist us to detect differential expression with a small number of replicates.

![](imgs/meanVariance.png)

---
## Differential Expression Packages

Options for differential expression analysis in RNAseq data are available in R including DESeq2, EdgeR and Voom/Limma.

Both EdgeR and DESeq2 works with non-transformed count data to detect differential expression.

Voom transforms count data into log2 values with associated mean dependent weighting for use in Limma differential expression functions.

We will use the DESeq2 software to identify differential expression.

---
## DESeq2

To use DESeq2 we must first produce a dataframe of our Sample groups.

The DESeq2 package contains a workflow for assessing local changes in fragment/read abundance between replicated conditions. This workflow includes normalization, variance estimation, outlier removal/replacement as well as significance testing suited to high throughput sequencing data (i.e. integer counts).

To use DESeq2 we must first produce a dataframe of our Sample groups.

``` r
metaData <- data.frame(Group = c("Naive", "Naive", "Act", "Act", "Act"), row.names = colnames(geneCounts))
metaData
```

```
##                    Group
## Sorted_T_reg_1     Naive
## Sorted_T_reg_2     Naive
## Sorted_T_reg_act_1   Act
## Sorted_T_reg_act_2   Act
## Sorted_T_reg_act_3   Act
```

---
## Creating a DESeq2 object

We can use the **DESeqDataSetFromMatrix()** function to create a DESeq2 object.

We must provide our matrix of counts to countData parameter, our metadata data.frame to colData parameter and we include to an optional parameter of rowRanges the non-redundant peak set we can counted on.

Finally we provide the name of the column in our metadata data.frame within which we wish to test to the design parameter.

``` r
countMatrix <- assay(geneCounts)
countGRanges <- rowRanges(geneCounts)
dds <- DESeqDataSetFromMatrix(countMatrix, colData = metaData, design = ~Group, rowRanges = countGRanges)
dds
```

```
## class: DESeqDataSet 
## dim: 24100 5 
## metadata(1): version
## assays(1): counts
## rownames(24100): 497097 19888 ... 100040911 170942
## rowData names(0):
## colnames(5): Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1
##   Sorted_T_reg_act_2 Sorted_T_reg_act_3
## colData names(1): Group
```

---
## Creating a DESeq2 object

As an alternative input, we can first update our **RangedSummarizedExperiment** object to include our metadata using the **colData** accessor.

``` r
colData(geneCounts)$Group <- metaData$Group
geneCounts
```

---
## Creating a DESeq2 object

The most recent version of the DESeq2 tools have a bug at the moment caused by a dependency: SummarizedExperiment. It is a little complicated, but you may see this error when creating DESeq2 objects, because R doesn't fully understand what they are.

```
Error in validObject(.Object) : 
  invalid class “DESeqDataSet” object: superclass "ExpData" not defined in the environment of the object's class
```

There is a fix by running:

``` r
setClassUnion("ExpData", c("matrix", "SummarizedExperiment"))
```

---
## Creating a DESeq2 object

Now we can make use of the  **DESeqDataSet()** function to build directly from our **RangedSummarizedExperiment** object.

We simply have to specify the **design** parameter to specify metadata column to test on.

``` r
dds <- DESeqDataSet(geneCounts, design = ~Group)
```

```
## renaming the first element in assays to 'counts'
```

``` r
dds
```

---
## DESeq function

We can now run the DESeq2 workflow on our DESeq2 object using the **DESeq()** function. This function will normalize library sizes, estimate and shrink variance and test our data in a single step.

Our DESeq2 object is updated to include useful statistics such our normalized values and variance of signal within each gene.

``` r
dds <- DESeq(dds)
```

```
## estimating size factors
```

```
## estimating dispersions
```

```
## gene-wise dispersion estimates
```

```
## mean-dispersion relationship
```

```
## final dispersion estimates
```

```
## fitting model and testing
```

---
## Normalization

In its first step the  **DEseq()** function normalizes our data by evaluating the median expression of genes across all samples to produce a per sample normalization factor (library size).

We can retrieve normalized and unnormalized values from our DESeq2 object using the **counts()** function and specifying the **normalized** parameter as TRUE. Normalized counts are counts divided by the library scaling factor.

``` r
normCounts <- counts(dds, normalized = TRUE)
normCounts[1:2, ]
```

```
##        Sorted_T_reg_1 Sorted_T_reg_2 Sorted_T_reg_act_1 Sorted_T_reg_act_2
## 497097              0      0.9202987                  0                  0
## 19888               0      0.0000000                  0                  0
##        Sorted_T_reg_act_3
## 497097            0.00000
## 19888            10.12247
```

---
## Variance Estimation

In the second step the **DEseq()** function estimates variances and importantly shrinks variance depending on the mean. This shrinking of variance allows us to detect significant changes with low replicate number.

We can review the Variance/mean relationship and shrinkage using **plotDispEsts()** function and our **DESeq2** object. We can see the <span style="color:blue">**adjusted dispersions in blue**</span> and <span style="color:black">**original dispersion for genes in black**</span>.

``` r
plotDispEsts(dds)
```

![](RU_RNAseq_p2_files/figure-html/de6-1.png)

---
## DESeq results

Finally we can extract our contrast of interest using the **results()** function.

We must specify the **contrast** parameter, listing the metadata column of interest and the two groups to compare.

The resulting **DESeqResults** can be sorted by pvalue to list most significant changes at top of table.

``` r
myRes <- results(dds, contrast = c("Group", "Act", "Naive"))
myRes <- myRes[order(myRes$pvalue), ]
myRes[1:3, ]
```

```
## log2 fold change (MLE): Group Act vs Naive 
## Wald test p-value: Group Act vs Naive 
## DataFrame with 3 rows and 6 columns
##        baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##       <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## 54167  20595.52        2.22416 0.0513373   43.3244         0         0
## 20198   4896.29        3.41735 0.0782338   43.6812         0         0
## 20200   6444.62        3.19637 0.0748192   42.7212         0         0
```

---
## DESeqResults objects

The **DESeqResults** object contains important information from our differential expression testing.

- **baseMean** - Mean normalised values across all samples.
- **log2FoldChange** - Log2 of fold change between groups being compared.
- **pvalue** - Significance of change between groups.
- **padj** - Significance of change between groups corrected for multiple testing.

---
## DESeqResults objects

With our **DESeqResults** object we can get a very quick overview of differential expression changes using the **summary** function.

``` r
summary(myRes)
```

```
## 
## out of 19776 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 2819, 14%
## LFC < 0 (down)     : 2373, 12%
## outliers [1]       : 0, 0%
## low counts [2]     : 6443, 33%
## (mean count < 6)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
```

---
## DESeqResults objects

We can review the relationship between fold-changes and expression levels with a MA-Plot by using the **plotMA()** function and our DESeqResults object.

``` r
plotMA(myRes)
```

![](RU_RNAseq_p2_files/figure-html/dr1a-1.png)

---
## DESeqResults objects

We downweight genes with high fold change but low significance (due to low counts/high dispersion) by using the **lfcShrink()** function as we have the **results()** function.  This allows us to now use the log2FC as a measure of significance of change in our ranking for analysis and in programs such as **GSEA**.

``` r
myRes_lfc <- lfcShrink(dds, coef = "Group_Naive_vs_Act")
DESeq2::plotMA(myRes_lfc)
```

![](RU_RNAseq_p2_files/figure-html/dr1b-1.png)

---
## DESeqResults objects

We can also convert our DESeqResults objects into a standard data frame using the **as.data.frame** function.

``` r
myResAsDF <- as.data.frame(myRes)
myResAsDF[1:2, ]
```

```
##         baseMean log2FoldChange    lfcSE       stat    pvalue padj
## 497097 0.1840597      -1.353988 4.558285 -0.2970389 0.7664368   NA
## 19888  2.0244933       4.254714 3.557199  1.1960856 0.2316632   NA
```

---
class: inverse, center, middle

# Significance and Multiple Testing

---

## Multiple testing 
.pull-left[
In our analysis we are testing 1000s of genes and identifing significance of each gene separately at a set confidence interval.

If we are 95% confident a gene is differentially expressed when we look at all genes we can expect 5% to be false.

When looking across all genes we apply a multiple testing correction to account for this.
  ]
.pull-right[
<div align="center">
<img src="imgs/significant.png" alt="offset" height="600" width="400">
</div>

]

---

## Multiple testing

We could apply a multiple testing to our p-values ourselves using either the Bonferroni or Benjamini-Hockberg correction.

-  Bonferonni = (p-value of gene)*(Total genes tested)
-  Benjamini-Hockberg = ((p-value of gene)*(Total genes tested))/(rank of p-value of gene)

---

## Multiple testing

The total number of genes tested has a large effect on these multiple correction methods.

*Independent* filtering may be applied to reduce number of genes tested such as removing genes with low expression and/or variance.

**We can not simply filter to differentially expressed genes and reapply correction. This is not independent!**

---
## NA in padj

By default DEseq2 filters out low expressed genes to assist in multiple testing correction. This results in **NA** values in **padj** column for low expressed genes.

``` r
table(is.na(myResAsDF$padj))
```

```
## 
## FALSE  TRUE 
## 13333 10767
```

---
## NA in padj

We can add a column of our own adjusted p-values for all genes with a pvalue using the **p.adjust()** function.

``` r
myResAsDF$newPadj <- p.adjust(myResAsDF$pvalue)
myResAsDF[1:3, ]
```

```
##         baseMean log2FoldChange    lfcSE       stat    pvalue padj newPadj
## 497097 0.1840597      -1.353988 4.558285 -0.2970389 0.7664368   NA       1
## 19888  2.0244933       4.254714 3.557199  1.1960856 0.2316632   NA       1
## 20671  2.0122042      -3.633158 2.891907 -1.2563189 0.2090004   NA       1
```

---
## NA in padj

Typcially genes with **NA** padj values should be filtered from the table for later evaluation and functional testing.

``` r
myResAsDF <- myResAsDF[!is.na(myResAsDF$padj), ]
myResAsDF <- myResAsDF[order(myResAsDF$pvalue), ]
myResAsDF[1:3, ]
```

```
##        baseMean log2FoldChange      lfcSE     stat pvalue padj newPadj
## 54167 20595.525       2.224161 0.05133731 43.32445      0    0       0
## 20198  4896.288       3.417348 0.07823384 43.68120      0    0       0
## 20200  6444.617       3.196367 0.07481916 42.72124      0    0       0
```

---
class: inverse, center, middle

# DEseq2 and Salmon

---

## DESeq2 from Salmon

We saw at the end of last session how we can gain transcript quantification from FASTQ using Salmon.

Salmon offers a fast alternative to alignment and counting for transcript expression estimation so we would be keen to make use of this in our differential expression analysis.

To make use of Salmon transcript quantifications for differential gene expression changes we must however summarize our transcript expression estimates to gene expression estimates.

---
## tximport

The tximport package, written by same author as DESeq2, offers functions to import data from a wide range of transcript counting and quantification software and summarizes transcript expression to genes.

We can first load the Bioconductor [**tximport** package.](https://bioconductor.org/packages/release/bioc/html/tximport.html)

``` r
library(tximport)
```

---
## tximport

In order to summarize our transcripts to genes we will need to provide a data.frame of transcripts to their respective gene names.

First we read in a single Salmon quantification report to get a table containing all transcript names.

``` r
temp <- read.delim("data/Salmon/TReg_2_Quant/quant.sf")

temp[1:3, ]
```

```
##                   Name Length EffectiveLength     TPM NumReads
## 1 ENSMUST00000193812.1   1070         820.000 0.00000        0
## 2 ENSMUST00000082908.1    110           3.749 0.00000        0
## 3 ENSMUST00000192857.1    480         230.000 0.24391        2
```

---
## tximport

We can now use the **select()** function from AnnotationDbi to retrieve a transcript names (TXNAME) to gene ids (GENEID) map from our TxDb.Mmusculus.UCSC.mm10.knownGene object.

Remember that available columns and keys can be found using the **columns()** and **keytype** functions respectively.

``` r
Tx2Gene <- AnnotationDbi::select(TxDb.Mmusculus.UCSC.mm10.knownGene, keys = as.vector(temp[,
    1]), keytype = "TXNAME", columns = c("GENEID", "TXNAME"))
Tx2Gene <- Tx2Gene[!is.na(Tx2Gene$GENEID), ]
Tx2Gene[1:10, ]
```

```
##                   TXNAME GENEID
## 14  ENSMUST00000134384.7  18777
## 15 ENSMUST00000027036.10  18777
## 16  ENSMUST00000150971.7  18777
## 17  ENSMUST00000155020.1  18777
## 18  ENSMUST00000119612.8  18777
## 19  ENSMUST00000137887.7  18777
## 20  ENSMUST00000115529.7  18777
## 21  ENSMUST00000131119.1  18777
## 22  ENSMUST00000141278.1  18777
## 23 ENSMUST00000081551.13  21399
```

---
## tximport

Now we can use the **tximport** function to import and summarize our Salmon quantification files to gene level expression estimates.

We must provide the paths to Salmon *quant.sf* files, the type of file to import (here "salmon") to the **type** argument and our data.frame of transcript to gene mapping using the **tx2gene** argument.

``` r
salmonQ <- dir("data/Salmon/", recursive = T, pattern = "quant.sf", full.names = T)
salmonCounts <- tximport(salmonQ, type = "salmon", tx2gene = Tx2Gene)
```

```
## reading in files with read_tsv
```

```
## 1 2 3 4 5 
## transcripts missing from tx2gene: 38851
## summarizing abundance
## summarizing counts
## summarizing length
```

---
## tximport

The result from tximport is a list containing our summarized gene expression estimates

**Abundance** - Number of transcripts per million transcripts (TPM).  
**Counts** - Estimated counts.

``` r
salmonCounts$abundance[1:2, ]
```

```
##               [,1]     [,2]     [,3]     [,4]     [,5]
## 100009600 1.193915 1.664778 0.623252 1.117103 1.255239
## 100009609 0.079162 0.090154 0.062317 0.067593 0.041142
```

``` r
salmonCounts$counts[1:2, ]
```

```
##             [,1]    [,2]   [,3]   [,4]   [,5]
## 100009600 73.602 100.600 34.117 74.385 75.221
## 100009609 27.280  30.453 19.069 25.160 13.782
```

---
## DESeq2 from Tximport

We can now use the **DESeqDataSetFromTximport()** to build our DESeq2 object from salmon counts. We must specify the metadata to **colData** parameter and specify the column of interest in **design** parameter as we have done for **DESeqDataSetFromMatrix()**.

``` r
ddsSalmon <- DESeqDataSetFromTximport(salmonCounts, colData = metaData, design = ~Group)
```

```
## using counts and average transcript lengths from tximport
```

---
## DESeq2 from Tximport

We can then proceed as we did for previous analysis using summarizeOverlaps counts.

``` r
ddsSalmon <- DESeq(ddsSalmon)
myResS <- results(ddsSalmon, contrast = c("Group", "Act", "Naive"))
myResS <- myResS[order(myResS$pvalue), ]
myResS[1:3, ]
```

```
## log2 fold change (MLE): Group Act vs Naive 
## Wald test p-value: Group Act vs Naive 
## DataFrame with 3 rows and 6 columns
##         baseMean log2FoldChange     lfcSE      stat    pvalue      padj
##        <numeric>      <numeric> <numeric> <numeric> <numeric> <numeric>
## 110454  20032.40        2.81696 0.0569895   49.4295         0         0
## 12772    4452.24        3.87109 0.1007898   38.4075         0         0
## 14939    6154.51        5.75391 0.1076499   53.4502         0         0
```

---
## Salmon and Rsubread

We can now compare the differential following Salmon quantification and Rsubread/summarizeOverlaps. Here we are showing the overlaps between genes that have a padj <0.05 and also a fold change greater then 2.

![](RU_RNAseq_p2_files/figure-html/dr1ss-1.png)

---
class: inverse, center, middle

# Adding Annotation

---

## Annotation of results table

By whatever method we have chosen to analyze differential expression we will want to add some sensible gene names or symbols to our data frame of results.

We can use the **org.db** packages to retrieve Gene Symbols for our Entrez IDs using the **select** function. Here we use the **org.db** package for mouse, **org.db**

``` r
library(org.Mm.eg.db)
eToSym <- AnnotationDbi::select(org.Mm.eg.db,
                 keys = rownames(myResAsDF),
                 keytype = "ENTREZID",
                 columns="SYMBOL")
eToSym[1:10,]
```

```
##    ENTREZID    SYMBOL
## 1     54167      Icos
## 2     20198    S100a4
## 3     20200    S100a6
## 4     12772      Ccr2
## 5     16407     Itgae
## 6     14939      Gzmb
## 7    110454      Ly6a
## 8     20715 Serpina3g
## 9     93692      Glrx
## 10    12766     Cxcr3
```

---
## Annotation of results table

Now we can merge the Entrez ID to Symbol table into our table of differential expression results.

``` r
annotatedRes <- merge(eToSym,myResAsDF,
                      by.x=1,
                      by.y=0,
                      all.x=FALSE,
                      all.y=TRUE)
annotatedRes <- annotatedRes[order(annotatedRes$pvalue),]
annotatedRes[1:3,]
```

```
##      ENTREZID SYMBOL  baseMean log2FoldChange      lfcSE     stat pvalue padj
## 1027   110454   Ly6a 17468.571       2.809022 0.04994700 56.24006      0    0
## 1690    12772   Ccr2  8529.015       4.011062 0.08476422 47.32022      0    0
## 2339    14939   Gzmb  6260.651       5.733767 0.09678177 59.24428      0    0
##      newPadj
## 1027       0
## 1690       0
## 2339       0
```

---
## Time for an exercise

[Link_to_exercises](../../exercises/exercises/RNAseq_part2_exercise.html)

[Link_to_answers](../../exercises/answers/RNAseq_part2_answers.html)