Analysis of RNAseq data in R and Bioconductor (part 3)

class: middle, inverse, title-slide

.title[
# Analysis of RNAseq data in R and Bioconductor (part 3)
]
.subtitle[
## <html><br />
<br />
<hr color='#EB811B' size=1px width=796px><br />
</html><br />
Bioinformatics Resource Center - Rockefeller University
]
.author[
### <a href="http://rockefelleruniversity.github.io/RU_RNAseq/" class="uri">http://rockefelleruniversity.github.io/RU_RNAseq/</a>
]
.author[
### <a href="mailto:brc@rockefeller.edu" class="email">brc@rockefeller.edu</a>
]

---

## What we will cover

* Session 1: Alignment and counting

* Session 2: Differential gene expression analysis

* _Session 3: Visualizing results through PCA and clustering_

* Session 4: Gene Set Analysis

* Session 5: Differential Transcript Utilization analysis

---
## Vizualising high dimensional data

In todays session we will work with some of the RNAseq data of adult mouse tissues from Bing Ren's lab, Liver and Heart. 
- More information on liver data [can be found here ](https://www.encodeproject.org/experiments/ENCSR000CHB/)
- More information on heart data [can be found here ](https://www.encodeproject.org/experiments/ENCSR000CGZ/)
- More information on kidney data [can be found here ](https://www.encodeproject.org/experiments/ENCSR000CGZ/)

---
## The data

The Tissue RNAseq dataset have been pre-aligned and counted. The **RangedSummarizedExperiment** object with the gene counts has been saved as a .RData object in **data** directory:

**data/gC_TissueFull.RData**

---
class: inverse, center, middle

# RNAseq with multiple groups

---

## Vizualizing Biological Data

Biological data is often considered to have high dimensionality. Common techniques used to visualize genomics data include dimension reduction and/or clustering followed by the graphical representation of data as a heatmap.

These techniques makes interpretation of the data simpler to better identify patterns within our data i.e. reproducibility of replicates within groups and magnitude of changes in signal between groups.

---
## Multiple Groups

In this experiment we will be comparing three tissues. This represent a more complex experimental design than the two group comparison we have often used i.e. between activated and naive t-cells.

We still use some similar approaches, but there will also be some additional approaches that will help i.e. clustering and dimensional reduction techniques to interrogate this data.

---
## Loading in our data

The data has already been mapped and counted with *Rsubread* and *SummarizedExperiment* respectively. The *RangedSummarizedExperiment* loaded in here has the counts for all 3 tissues. The grouping metadata is the Tissue variable.

``` r
load("data/gC_TissueFull.RData")
geneCounts
```

```
## class: RangedSummarizedExperiment 
## dim: 14454 6 
## metadata(0):
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(0):
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
##   Sorted_Liver_2
## colData names(1): Tissue
```

---
## Creating a DESeq2 object

Don't forget that the most recent version of the DESeq2 tools has a bug at the moment caused by a dependency: SummarizedExperiment. It is a little complicated, but you may see this error when creating DESeq2 objects, because R doesn't fully understand what they are.

```
Error in validObject(.Object) : 
  invalid class “DESeqDataSet” object: superclass "ExpData" not defined in the environment of the object's class
```

There is a fix by running:

``` r
setClassUnion("ExpData", c("matrix", "SummarizedExperiment"))
```

---
## DEseq2 input

We can now set up our **DESeq2** object and run the **DESeq** workflow function

``` r
dds <- DESeqDataSet(geneCounts, design = ~Tissue)
dds <- DESeq(dds)
```

---
## Running DEseq2

To extract comparisons we can simply specify the tissues of interest to the results function.

``` r
LiverVskidney <- results(dds, c("Tissue", "Liver", "Kidney"))
heartVsLiver <- results(dds, c("Tissue", "Heart", "Liver"))
heartVskidney <- results(dds, c("Tissue", "Heart", "Kidney"))
heartVskidney
```

```
## log2 fold change (MLE): Tissue Heart vs Kidney 
## Wald test p-value: Tissue Heart vs Kidney 
## DataFrame with 14454 rows and 6 columns
##         baseMean log2FoldChange     lfcSE      stat      pvalue        padj
##        <numeric>      <numeric> <numeric> <numeric>   <numeric>   <numeric>
## 20671    64.5915       1.675554  0.400136   4.18746 2.82090e-05 1.11084e-04
## 27395   354.2077       1.069639  0.276556   3.86771 1.09862e-04 3.93252e-04
## 18777   517.6560      -1.976956  0.342107  -5.77876 7.52525e-09 4.66423e-08
## 21399   107.3182      -0.373288  0.315755  -1.18221 2.37124e-01 3.56834e-01
## 108664  370.2050      -2.062351  0.368121  -5.60237 2.11435e-08 1.24993e-07
## ...          ...            ...       ...       ...         ...         ...
## 20592   150.6694      0.0434984  0.283835  0.153253  0.87819903    0.919818
## 26908   264.7052      0.2109728  0.287845  0.732938  0.46359624    0.592105
## 22290   124.0370      0.0786872  0.459615  0.171203  0.86406453    0.909363
## 26900   375.4670      0.8709573  0.330544  2.634922  0.00841566    0.020596
## 170942   10.3089     -1.3065830  0.873286 -1.496168  0.13460985    0.226951
```

---
## Comparing multiple results

To identify genes specifically upregulated in Heart versus other tissues we just need to do a simple data merge.

First we convert **DESeqResults** objects into a data.frame, remove NAs and extract most interesting columns to us.

``` r
heartVsLiverDF <- as.data.frame(heartVsLiver)
heartVskidneyDF <- as.data.frame(heartVskidney)

heartVsLiverDF <- heartVsLiverDF[!is.na(heartVsLiverDF$padj), ]
heartVskidneyDF <- heartVskidneyDF[!is.na(heartVskidneyDF$padj), ]

heartVsLiverDF <- heartVsLiverDF[, c("log2FoldChange", "padj")]
heartVskidneyDF <- heartVskidneyDF[, c("log2FoldChange", "padj")]
```

---
## Comparing multiple results

We can then update the column names and merge our data.frame to have a single table of most useful information. The *by=0* means that it will use the rownames (which contain the Gene IDs) as the common feature.

``` r
colnames(heartVskidneyDF) <- paste0("HeartVsKidney", "_", colnames(heartVskidneyDF))
colnames(heartVsLiverDF) <- paste0("HeartVsLiver", "_", colnames(heartVsLiverDF))
fullTable <- merge(heartVsLiverDF, heartVskidneyDF, by = 0)
fullTable[1:2, ]
```

```
##   Row.names HeartVsLiver_log2FoldChange HeartVsLiver_padj
## 1    100017                  -0.6209016         0.1688893
## 2    100019                  -0.6718217         0.1137017
##   HeartVsKidney_log2FoldChange HeartVsKidney_padj
## 1                    0.1789343          0.7431574
## 2                   -0.1426205          0.7860500
```

---
## Comparing multiple results

Now we can extract our genes are significantly upregulated in Heart in both conditions.

``` r
upInHeart <- fullTable$HeartVsLiver_log2FoldChange > 0 & fullTable$HeartVsKidney_log2FoldChange >
    0 & fullTable$HeartVsLiver_padj < 0.05 & fullTable$HeartVsKidney_padj < 0.05
upInHeartTable <- fullTable[upInHeart, ]
upInHeartTable[1:2, ]
```

```
##    Row.names HeartVsLiver_log2FoldChange HeartVsLiver_padj
## 12 100038347                    4.927612      2.564094e-49
## 24 100038575                    2.434627      1.389225e-02
##    HeartVsKidney_log2FoldChange HeartVsKidney_padj
## 12                     4.362883       5.756669e-51
## 24                     3.184769       3.529865e-04
```

---
## Do genes overlap?

We can also make a logical data.frame of whether a gene was upregulated in Heart for both Liver and Kidney comparisons.

``` r
forVenn <- data.frame(UpvsLiver = fullTable$HeartVsLiver_log2FoldChange > 0 & fullTable$HeartVsLiver_padj <
    0.05, UpvsKidney = fullTable$HeartVsKidney_log2FoldChange > 0 & fullTable$HeartVsKidney_padj <
    0.05)
forVenn[1:3, ]
```

```
##   UpvsLiver UpvsKidney
## 1     FALSE      FALSE
## 2     FALSE      FALSE
## 3     FALSE       TRUE
```

---
## Do genes overlap?

We can use the **eulerr** package to make a quick plot to show the overlap between our groups. These plots are nice as the area is proportional to the size of the groups.

For comparisons greater than 2, we would recommend switching to a [Upset Plot](https://r-graph-gallery.com/upset-plot.html).

``` r
library(eulerr)
fit <- euler(forVenn)
plot(fit)
```

![](RU_RNAseq_p3_files/figure-html/euler-1.png)

---
## A word of caution

Doing this kind of overlap anlysis can be super useful and illuminating for your research. But you need to be careful.

At this point you may feel happy concluding that there are several genes specific to either Liver/Kidney/Heart.

Be careful with statements about the biology if there could also be a technical component i.e. *"There are 2201 Heart specific genes."*

Often what is overlooked is the variance in each group. If one group is more variable then others, then any test involving that group will have reduced power

---
## All done?

A variety of visualizations will help you understand the data better. This includes MA and volcano plots which you have seen already, but also some additional visualizations and transformations like PCA and heatmaps. We will dig into those shortly.

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-8-1.png)

---
## All done?

What we can see is that there is more variance in the Heart group.

Lets follow a hypothetical:

* If we look at Heart vs Kidney and Liver vs Kidney it is likely that Heart vs Kidney will have less ability to deduce biologically significant genes due to technical differences.

* As a result we could conclude that there are some Liver vs Kidney specific genes relative to Heart vs Kidney, but actually it is a technical bias.

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-9-1.png)

---
class: inverse, center, middle

# Visualizing Counts

---

## Visualizing RNAseq data

One of the first steps of working with count data for visualization is commonly to transform the integer count data to log2 scale. To do this we will need to add some artificial value (pseudocount) to zeros in our counts prior to log transform (since the log2 of zero is infinite).

The DEseq2 **normTransform()** will add a 1 to our normalized counts prior to log2 transform and return a **DESeqTransform** object.

``` r
normLog2Counts <- normTransform(dds)
normLog2Counts
```

```
## class: DESeqTransform 
## dim: 14454 6 
## metadata(1): version
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
##   Sorted_Liver_2
## colData names(2): Tissue sizeFactor
```

---
## Visualizing RNAseq data

We can extract our normalized and transformed counts from the **DESeqTransform** object using the **assay()** function.

``` r
matrixOfNorm <- assay(normLog2Counts)
boxplot(matrixOfNorm, las = 2, names = c("Heart_1", "Heart_2", "Kidney_1", "Kidney_2",
    "Liver_1", "Liver_2"))
```

![](RU_RNAseq_p3_files/figure-html/gnTM-1.png)

---
## Visualizing RNAseq data

When visualizing our signal however we now will have a similar problem to differential analysis: smaller counts having higher variance. This may cause changes in smaller counts to have undue influence in visualization and clustering.

``` r
library(vsn)
vsn::meanSdPlot(matrixOfNorm)
```

![](RU_RNAseq_p3_files/figure-html/gnTMP-1.png)

---
## Visualizing RNAseq data

We can apply an **rlog** transformation to our data using the **rlog()** function which will attempt to shrink the variance for genes based on their mean expression.

``` r
rlogTissue <- rlog(dds)
rlogTissue
```

```
## class: DESeqTransform 
## dim: 14454 6 
## metadata(1): version
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(27): baseMean baseVar ... dispFit rlogIntercept
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
##   Sorted_Liver_2
## colData names(2): Tissue sizeFactor
```

---
## Visualizing RNAseq data

Again we can extract the matrix of transformed counts with the **assay()** function and plot the mean/variance relationship. If we look at the axis we can see the shrinkage of variance for low count genes.

*Check the y-axis limits*

.pull-left[

``` r
rlogMatrix <- assay(rlogTissue)
vsn::meanSdPlot(rlogMatrix)
```

![](RU_RNAseq_p3_files/figure-html/gRLM-1.png)
  ]
  
.pull-right[

``` r
library(vsn)
vsn::meanSdPlot(matrixOfNorm)
```

![](RU_RNAseq_p3_files/figure-html/gnTMP2-1.png)
  ]

---
## Other good visualizations

The most important check and visualization for any kind of genomic data is typically looking at the tracks. This is based on creating bigwig files and using a genome browser like [IGV](https://rockefelleruniversity.github.io/IGV_course/) to visualize them.

Often I will check some housekeeping genes and my favorite significant genes this way.

We won't be going through making these files, but you can create them from your bam files in R by following this guide [here](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/GenomicScores_In_Bioconductor.html).

---
## Other good visualizations

Volcano plots are popular lots as they compare significance to fold change. Quickly you can assess the amplitude of changes in your pairwise comparisons.

There isn't a Volcano plot option in DESeq2, but it is easy to make yourself with ggplot2.

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-11-1.png)

---
## Other good visualizations

``` r
ggplot(as.data.frame(LiverVskidney), aes(x = log2FoldChange, y = -log10(padj))) +
    geom_point() + theme_minimal() + ggtitle("LiverVskidney")
```

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-12-1.png)

---
## Other good visualizations

*EnhancedVolcano* is a package designed specifically for making Volcanoes. You can see a lot of additional information right away. There is a lot of customization.

You can just check *?EnhancedVolcano* to review customization options.

``` r
library(EnhancedVolcano)

EnhancedVolcano(LiverVskidney, lab = rownames(LiverVskidney), x = "log2FoldChange",
    y = "padj")
```

---
## Other good visualizations

``` r
EnhancedVolcano(LiverVskidney, lab = rownames(LiverVskidney), x = "log2FoldChange",
    y = "padj")
```

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-15-1.png)

---
class: inverse, center, middle

# Dimension reduction

---

## Dimension reduction

Since we have often have measured 1000s of genes over multiple samples/groups we will often try and simplify this too a few dimensions or meta/eigen genes which represent major patterns of signal across samples found.

We hope the strongest patterns or sources of variation in our data are correlated with sample groups and so dimension reduction offers a methods to method to visually identify reproducibility of samples.

Common methods of dimension reduction include Principal Component Analysis, MultiFactorial Scaling and Non-negative Matrix Factorization.

---
## PCA

We can see PCA in action with our data simply by using the DESeq2's **plotPCA()** function and our **DESeqTransform** object from our rlog transformation.

We must also provide a metadata column to colour samples by to the **intgroup** parameter and we set the **ntop** parameter to use all genes in PCA (by default it is top 500).

``` r
plotPCA(rlogTissue, intgroup = "Tissue", ntop = nrow(rlogTissue))
```

![](RU_RNAseq_p3_files/figure-html/gPCA-1.png)

---
## PCA

This PCA show the separation of samples by their group and the localization of samples with groups.

Since PC1 here explains 51% of total variances between samples and PC2 explains 44%, the reduction of dimensions can be seen to explain much of the changes among samples in 2 dimensions.

Of further note is the separation of samples across PC1 but the lack of separation of Heart and Kidney samples along PC2.

![](RU_RNAseq_p3_files/figure-html/gPCA3-1.png)

---
## PCA

PCA is often used to simply visualize sample similarity, but we can extract further information of the patterns of expression corresponding PCs by performing the PCA analysis ourselves instead of using the easy DESeq2 function.

We can use the **prcomp()** function with a transposition of our matrix to perform our prinicipal component analysis. The mappings of samples to PCs can be found in the **x** slot of the **prcomp** object.

``` r
pcRes <- prcomp(t(rlogMatrix))
class(pcRes)
```

```
## [1] "prcomp"
```

``` r
pcRes$x[1:2, ]
```

```
##                      PC1      PC2        PC3       PC4        PC5          PC6
## Sorted_Heart_1 -103.8999 52.45800 -0.2401797  25.22797  1.0609927 4.039824e-14
## Sorted_Heart_2 -117.8517 46.07786  0.4445612 -23.96120 -0.7919848 3.580122e-14
```

---
## PCA

We can now reproduce the previous plot from DEseq2 in basic graphics from this.

``` r
plot(pcRes$x, col = colData(rlogTissue)$Tissue, pch = 20, cex = 2)
legend("top", legend = c("Heart", "Kidney", "Liver"), fill = unique(colData(rlogTissue)$Tissue))
```

![](RU_RNAseq_p3_files/figure-html/gPRcosmp-1.png)

---
## PCA

Now we have constructed the PCA ourselves we can investigate which genes' expression profiles influence the relative PCs.

The influence (rotation/loadings) for all genes to each PC can be found in the **rotation** slot of the **prcomp** object.

``` r
pcRes$rotation[1:5, 1:4]
```

```
##                 PC1          PC2          PC3           PC4
## 20671  -0.005479648  0.003577376 -0.006875591  0.0048625659
## 27395  -0.002020427  0.003325506 -0.002302364 -0.0045132749
## 18777   0.004615068 -0.005413345  0.008975098 -0.0028857868
## 21399   0.005568549  0.002485067  0.002615072 -0.0001255119
## 108664  0.005729029 -0.004912143  0.009991580 -0.0031039532
```

---
## PCA

To investigate the separation of Kidney samples along the negative axis of PC2 I can then look at which genes most negatively contribute to PC2.

Here we order by the most negative values for PC2 and select the top 100.

``` r
PC2markers <- sort(pcRes$rotation[, 2], decreasing = FALSE)[1:100]
PC2markers[1:10]
```

```
##       20505       22242       16483       56727       77337       57394 
## -0.06497365 -0.06001728 -0.05896360 -0.05896220 -0.05595648 -0.05505202 
##       18399       20495       22598       20730 
## -0.05307506 -0.05256188 -0.05205661 -0.05175604
```

---
## PCA

To investigate the gene expression profiles associated with PC2 we can now plot the log2 foldchanges (or directional statistics) from our pairwise comparisons for our PC2 most influencial genes.

``` r
PC2_hVsl <- heartVsLiver$stat[rownames(heartVsLiver) %in% names(PC2markers)]
PC2_hVsk <- heartVskidney$stat[rownames(heartVskidney) %in% names(PC2markers)]
PC2_LVsk <- LiverVskidney$stat[rownames(LiverVskidney) %in% names(PC2markers)]
```

---
## PCA

From the boxplot it is clear to see that the top100 genes are all specifically upregulated in Kidney tissue.

``` r
boxplot(PC2_hVsl, PC2_hVsk, PC2_LVsk, names = c("Heart/Liver", "Heart/Kidney", "Liver/Kidney"),
    ylab = "log2FC")
```

![](RU_RNAseq_p3_files/figure-html/gPRcompRotf-1.png)

---
class: inverse, center, middle

# Sample-to-Sample correlation

---

## Sample-to-Sample correlation

Another common step in quality control is to assess the correlation between expression profiles of samples.

We can assess correlation between all samples in a matrix by using the **cor()** function.

``` r
sampleCor <- cor(rlogMatrix)
sampleCor
```

```
##                 Sorted_Heart_1 Sorted_Heart_2 Sorted_Kidney_1 Sorted_Kidney_2
## Sorted_Heart_1       1.0000000      0.9835321       0.7144527       0.7189597
## Sorted_Heart_2       0.9835321      1.0000000       0.7190675       0.7229253
## Sorted_Kidney_1      0.7144527      0.7190675       1.0000000       0.9929131
## Sorted_Kidney_2      0.7189597      0.7229253       0.9929131       1.0000000
## Sorted_Liver_1       0.7156978      0.6883444       0.7344165       0.7336117
## Sorted_Liver_2       0.7186525      0.6918428       0.7366287       0.7396193
##                 Sorted_Liver_1 Sorted_Liver_2
## Sorted_Heart_1       0.7156978      0.7186525
## Sorted_Heart_2       0.6883444      0.6918428
## Sorted_Kidney_1      0.7344165      0.7366287
## Sorted_Kidney_2      0.7336117      0.7396193
## Sorted_Liver_1       1.0000000      0.9714750
## Sorted_Liver_2       0.9714750      1.0000000
```

---
## Sample-to-Sample correlation

We can visualize the the correlation matrix using a heatmap following sample clustering.

First, we need to convert our correlation matrix into a distance measure to be use in clustering by subtracting from 1 to give dissimilarity measure and converting with the **as.dist()** to a **dist** object.

We then create a matrix of distance values to plot in the heatmap from this using **as.matrix()** function.

``` r
sampleDists <- as.dist(1 - cor(rlogMatrix))
sampleDistMatrix <- as.matrix(sampleDists)
```

---
## Sample-to-Sample correlation

We can use the **pheatmap** library's pheatmap function to cluster our data by similarity and produce our heatmaps. We provide our matrix of sample distances as well as our **dist** object to the **clustering_distance_rows** and **clustering_distance_cols** function.

By default hierarchical clustering will group samples based on their gene expression similarity into a dendrogram with between sample similarity illustrated by branch length.

``` r
library(pheatmap)
pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists)
```

![](RU_RNAseq_p3_files/figure-html/gSampleDistc-1.png)

---
## Sample-to-Sample correlation

We can use the **brewer.pal** and **colorRampPalette()** function to create a white to blue scale. We cover this in more depth for using with [**ggplot** scales](https://rockefelleruniversity.github.io/Plotting_In_R/r_course/presentations/slides/ggplot2.html#60).

``` r
library(RColorBrewer)
blueColours <- brewer.pal(9, "Blues")
colors <- colorRampPalette(rev(blueColours))(255)
plot(1:255, rep(1, 255), col = colors, pch = 20, cex = 20, ann = FALSE, yaxt = "n")
```

![](RU_RNAseq_p3_files/figure-html/gSampleDistca-1.png)

---
## Sample-to-Sample correlation

We can provide a slightly nicer scale for our distance measure heatmap to the **color** parameter in the **pheatmap** function.

``` r
pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists,
    color = colors)
```

![](RU_RNAseq_p3_files/figure-html/gSampleDistd-1.png)

---
## Sample-to-Sample correlation

Finally we can add some column annotation to highlight group membership. We must provide annotation as a data.frame of metadata we wish to include with rownames matching column names.

Fortunately that is exactly as we have set up from DEseq2. We can extract metadata from the DESeq2 object with **colData()** function and provide it to the **annotation_col** parameter.

``` r
annoCol <- as.data.frame(colData(dds))
pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists,
    color = colors, annotation_col = annoCol)
```

![](RU_RNAseq_p3_files/figure-html/gSampleDiste-1.png)

---
class: inverse, center, middle

# Batch issues

---

## Batch Issues

When we look at our PCA and Sample-to-Sample correlation we may start to see patterns in the data that could indicate some QC issues. Thus far we have been looking at pretty clean datasets.

This is not always the case. Sometimes technical artifacts are the dominant trend in the data:

* Sequencing/Experiment date
* Donor
* RNA quality
* etc..

---
## Our data

This dataset is from a recent paper out of the Fuchs lab and has a clear problem with batch - a large part of the variance in the data is associated simply with the replicate. This can be salvageable though (and they handled this appropriately during data analysis in the [paper](https://pubmed.ncbi.nlm.nih.gov/36450983/)).

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-28-1.png)

---
## Our data

Here we can see that the experimental variables of *Cell Type* and *Treatment* have trends across our PCA, but it is not crystal clear. Some of the experimental variance is being clouded by technical batch issues.

.pull-left[

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-29-1.png)

]

.pull-right[

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-30-1.png)

]

---
## Our data

The data is accession [GSE190411](https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE190411) and a counts table has been uploaded which we can use for our analysis: *GSE190411_Yuan2021_PAP_SCC_RNAseq_counts.txt.gz*

You can also use a downloaded version of this file in our *data* directory.

**NOTE: The data is already been normalized. DESeq2 needs integers. We can round all our data up to integers with the ceiling() function**

``` r
library(ggplot2)

my_counts <- read.delim("data/GSE190411_Yuan2021_PAP_SCC_RNAseq_counts.txt")
rownames(my_counts) <- make.unique(my_counts$gene)
my_counts <- data.matrix(my_counts[, -1])
my_counts <- ceiling(my_counts)
head(my_counts)
```

```
##               PAPmneg1 PAPmneg2 PAPmpos1 PAPmpos2 SCCmneg1 SCCmneg2 SCCmpos1
## 0610005C13Rik        0        0        0        0        2        0        0
## 0610006L08Rik        0        0        0        0        0        0        0
## 0610009B22Rik      427      405      767      424      461      586      500
## 0610009E02Rik       17        2      176       16      260       84        8
## 0610009L18Rik       11        4        6        4       27        6       30
## 0610009O20Rik      615      754      517      694      715      825      874
##               SCCmpos2
## 0610005C13Rik        1
## 0610006L08Rik        0
## 0610009B22Rik      606
## 0610009E02Rik        6
## 0610009L18Rik        5
## 0610009O20Rik      980
```

---
## Meta Data

Let's now construct a data frame containing the meta data for this data.

``` r
CellLine <- factor(c("PAP", "PAP", "PAP", "PAP", "SCC", "SCC", "SCC", "SCC"))
Treatment <- factor(c("Neg", "Neg", "Pos", "Pos", "Neg", "Neg", "Pos", "Pos"))
Rep <- factor(c("R1", "R2", "R1", "R2", "R1", "R2", "R1", "R2"))

coldata <- data.frame(CellLine = CellLine, Treatment = Treatment, Batch = Rep, row.names = colnames(my_counts))
head(coldata)
```

```
##          CellLine Treatment Batch
## PAPmneg1      PAP       Neg    R1
## PAPmneg2      PAP       Neg    R2
## PAPmpos1      PAP       Pos    R1
## PAPmpos2      PAP       Pos    R2
## SCCmneg1      SCC       Neg    R1
## SCCmneg2      SCC       Neg    R2
```

---
## Build our DEseq2 object

Let's now combine our matrix and metadata into a DESeq2 object. We can then recreate the PCA with the batch issues.

``` r
dds <- DESeqDataSetFromMatrix(countData = my_counts, colData = coldata, design = ~Treatment)
myrlog <- rlog(dds)
```

``` r
plotPCA(myrlog, intgroup = "Batch") + ggtitle("Batch Effects Across Replicates")
```

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-36-1.png)

---
## Batch Correction Tools

Now we are ready to run batch correction. There are multiple tools that are popular for batch correction:

* [limma](https://www.bioconductor.org/packages/release/bioc/html/limma.html)
* [ComBat](https://bioconductor.org/packages/release/bioc/html/sva.html)
* [RUVSeq](https://bioconductor.org/packages/release/bioc/html/RUVSeq.html)

They all use slightly different inputs (normalized vs raw counts), and have different nuances in use case i.e. is it clear where the batch is coming from?

We will focus on limma as it is simplest and very effective for most data sets. Using these tools we can correct our count matrix.

---
## Limma

To correct our count matrix we can use the *removeBatchEffect()* function. We provide normalized counts and also batch information. In this case we provide categorical information about what replicate the sample belongs to. But in other cases you could provide numerical information like QC metrics.

``` r
library(limma)

corrected_matrix <- removeBatchEffect(assay(myrlog), batch = coldata$Batch)
```

---
## Limma

We can now assign our corrected matrix back into our rlog object. This means we will be able to use the deseq2 *plotPCA()* function. We will do this in a duplicate of the rlog so we can compare the results.

``` r
my_corrected_rlog <- myrlog
assay(my_corrected_rlog) <- corrected_matrix
```

.pull-left[

``` r
plotPCA(my_corrected_rlog, intgroup = "Batch") + ggtitle("Batch Effects Across Replicates - Corrected")
```

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-39-1.png)

]

.pull-right[

``` r
plotPCA(myrlog, intgroup = "Batch") + ggtitle("Batch Effects Across Replicates - Uncorrected")
```

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-40-1.png)

]

---
## Limma

We can see much better seperation of the PCA between Cell Lines.

.pull-left[

``` r
plotPCA(my_corrected_rlog, intgroup = "CellLine") + ggtitle("Batch Effects Across Replicates - Corrected")
```

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-41-1.png)

]

.pull-right[

``` r
plotPCA(myrlog, intgroup = "CellLine") + ggtitle("Batch Effects Across Replicates - Uncorrected")
```

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-42-1.png)

]

---
## Limma

Treatment is maybe mildly improved (seperate better on PC1 the major source of variation). Though maybe there are avenues for further improvement i.e. other include other confounders or use other correction tools.

.pull-left[

``` r
plotPCA(my_corrected_rlog, intgroup = "Treatment") + ggtitle("Batch Effects Across Replicates - Corrected")
```

```
## using ntop=500 top features by variance
```

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-43-1.png)

]

.pull-right[

``` r
plotPCA(myrlog, intgroup = "Treatment") + ggtitle("Batch Effects Across Replicates - Uncorrected")
```

```
## using ntop=500 top features by variance
```

![](RU_RNAseq_p3_files/figure-html/unnamed-chunk-44-1.png)

]

---
## DESeq2 and batch

We have successfully corrected our counts. This is useful for visualizations of the data like PCA. Intuitively it would follow that you would then just run DESeq2. But for running differentials there are better ways to handle batch issues. We can do this by providing a slightly more complex design model to DEseq2.

For standard DESeq2 we simply provide the group information that we want to run differentials across i.e. *~Treatment*

When there is additional batch we incorporate our confounder information into our model i.e. *~Batch+Treatment*

---
## DESeq2 and batch

.pull-left[
**Standard**

``` r
dds <- DESeqDataSetFromMatrix(countData = my_counts, colData = coldata, design = ~Treatment)
dds <- DESeq(dds)
```

]
  
.pull-right[  
**Corrected**

``` r
dds_corrected <- DESeqDataSetFromMatrix(countData = my_counts, colData = coldata,
    design = ~Batch + Treatment)
dds_corrected <- DESeq(dds_corrected)
```

]
  
It also worth noting that a similar approach can be used to ask much more complex questions about your data i.e. testing for additive affects of multiple groups. There is some good material on using linear models in R [here](https://rafalab.dfci.harvard.edu/pages/harvardx.html).

---
## DESeq2 and batch

We can then run our differentials. Typically you are not expecting massive changes, but we can clearly see there are more significant genes and less outliers. This suggest that despite the more complex model, we have increased our ability to find statistical significance.

.pull-left[
**Standard**

``` r
my_res <- results(dds, contrast = c("Treatment", "Pos", "Neg"))
summary(my_res)
```

```
## 
## out of 31327 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 22, 0.07%
## LFC < 0 (down)     : 82, 0.26%
## outliers [1]       : 214, 0.68%
## low counts [2]     : 9382, 30%
## (mean count < 1)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
```

]
  
.pull-right[  
**Corrected**

``` r
my_res_corrected <- results(dds_corrected, contrast = c("Treatment", "Pos", "Neg"))
summary(my_res_corrected)
```

```
## 
## out of 31327 with nonzero total read count
## adjusted p-value < 0.1
## LFC > 0 (up)       : 130, 0.41%
## LFC < 0 (down)     : 238, 0.76%
## outliers [1]       : 0, 0%
## low counts [2]     : 12900, 41%
## (mean count < 4)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
```

]

---
## Findng the batch

We have guided you through our example, but when you are working on your own data it can be hard to figure out what is driving issues in your data.

We can use *plotPCA()* and *prcomp()* to get into our principle components and label metadata features. But sometimes it can be more intuitive to work interactively as opposed to making hundreds of PCA plots with different colors.

The pcaEpxlorer package can help with this. It is a shiny app designed for helping understand the trends in your data by allowing you to quickly and easily work with your datasets.

---
## pcaExplorer

To use *pcaExplorer* we will want a dds object and the transformed dataset (i.e. the rlogs).

``` r
library(pcaExplorer)

pcaExplorer(dds = dds, dst = myrlog)
```

---
class: inverse, center, middle

# Clustering Analysis

---

## Clustering genes and samples

We can use the same methods of clustering samples to cluster genes with similar expression patterns together. Clustering genes will allow us to identify the major patterns of gene expressions within our data and to group genes with similar expression profiles for review and functional analysis.

---
## Clustering Inputs

Minimizing the number of genes considered for clustering helps speed things up. To reduce the dataset we can subset to genes that are highly variable using something akin to an ANOVA test.

With DESeq2 we can identify genes significantly changing across groups by comparing our models with and without our groups of interest using the **results** function.

- Tissue model

```
~ Tissue
```

- No groups model

```
~ 1
```

---
## Our Data

Lets use our kidney, heart and liver data  from Bing Ren's lab again.

``` r
load("data/gC_TissueFull.RData")
dds <- DESeqDataSet(geneCounts, design = ~Tissue)
```

```
## renaming the first element in assays to 'counts'
```

```
## Warning in DESeqDataSet(geneCounts, design = ~Tissue): some variables in design
## formula are characters, converting to factors
```

---
## Likelihood-ratio test

To run a LRT we must set the parameter of **reduced** to our alternative model of no groups. We also set the **test** parameter to **LRT** to allow us to compare the models.

``` r
dds2 <- DESeq(dds, test = "LRT", reduced = ~1)
acrossGroups <- results(dds2)
acrossGroups <- acrossGroups[order(acrossGroups$pvalue), ]
acrossGroups[1:3, ]
```

```
## log2 fold change (MLE): Tissue Liver vs Heart 
## LRT p-value: '~ Tissue' vs '~ 1' 
## DataFrame with 3 rows and 6 columns
##        baseMean log2FoldChange     lfcSE      stat       pvalue         padj
##       <numeric>      <numeric> <numeric> <numeric>    <numeric>    <numeric>
## 16483  124208.2       -2.27414  1.584829   1543.84  0.00000e+00  0.00000e+00
## 17888   43684.0      -14.85201  0.654258   1524.73  0.00000e+00  0.00000e+00
## 21956   16543.9      -15.03634  1.059070   1389.39 1.98399e-302 9.55889e-299
```

---
## Likelihood-ratio test

We can use the **plotCounts()** to get review expression profile of a gene, one at a time. We define the gene of interest to **gene** parameter and **intgroup** to specify metadata column to group counts by.

``` r
plotCounts(dds2, gene = "17888", intgroup = "Tissue")
```

![](RU_RNAseq_p3_files/figure-html/tissue12-1.png)

---
## Filter with LRT results

Clustering is typically done on the transformed normalized counts. To make it easier to find interesting clusters we can subset our rlog transformed gene expression matrix to those genes significant in our LRT test. This filters the ~45% of genes that are not changing across our experiment.

``` r
sigChanges <- rownames(acrossGroups)[acrossGroups$padj < 0.01 & !is.na(acrossGroups$padj)]
sigMat <- rlogMatrix[rownames(rlogMatrix) %in% sigChanges, ]
nrow(rlogMatrix)
```

```
## [1] 14454
```

``` r
nrow(sigMat)
```

```
## [1] 8094
```