Viz_part_3.utf8

class: center, middle, inverse, title-slide

# Visualizing Genomics Data (part 3)
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=796px>
</html>
### <a href="http://rockefelleruniversity.github.io/RU_VisualizingGenomicsData/" class="uri">http://rockefelleruniversity.github.io/RU_VisualizingGenomicsData/</a>

---

##The Course

* Dimension reduction
* PCA and loadings
* Clustering
* Grouping by kmeans
* Testing gene clusters

---
##Reminder of file types

In this session we will be dealing with a range of data types. For more information on file types you can revisit our material. 
 
* [File Formats](https://rockefelleruniversity.github.io/Genomic_Data/). 
 
For more information on visualizing genomics data in browsers you can visit our IGV course. 
 
* [IGV](https://rockefelleruniversity.github.io/IGV_course/). 
---
##Reminder of data types in Bioconductor

We will also encounter and make use of many data structures and data types which we have seen throughout our courses on HTS data. You can revisit this material to refresh on HTS data analysis in Bioconductor and R below. 
 
* [Bioconductor](https://rockefelleruniversity.github.io/Bioconductor_Introduction/)
* [Genomic Intervals](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/GenomicIntervals_In_Bioconductor.html)
* [Genomic Scores](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/GenomicScores_In_Bioconductor.html)
* [Sequences](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/SequencesInBioconductor.html)
* [Gene Models](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/GenomicFeatures_In_Bioconductor.html)
* [Alignments](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/AlignedDataInBioconductor.html)
* [ChIP-seq](http://rockefelleruniversity.github.io/RU_ChIPseq/)
* [ATAC-seq](http://rockefelleruniversity.github.io/RU_ATACseq/)
* [RNA-seq](http://rockefelleruniversity.github.io/RU_RNAseq/)
 
---
##Materials

All material for this course can be found on github. 
* [Visualizing Genomics Data](https://github.com/RockefellerUniversity/RU_VisualizingGenomicsData) 
 
Or can be downloaded as a zip archive from here.  
* [Download zip](https://github.com/RockefellerUniversity/RU_VisualizingGenomicsData/archive/master.zip)

---
##Materials

Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath. 
 
* **r_course/presentations/Slides/** 
Presentations as an HTML slide show. 
* **r_course/presentations/exercises/** 
Some tasks/examples to work through.

---
##Materials

All data to run code in the presentations and in the practicals is available in the zip archive. This includes coverage as bigWig files, aligned reads as BAM files and genomic intervals stored as BED files. 
 
We also include some RData files containing precompiled results from querying database (in case of external server downtime).

All data can be found under the **data** directory 
 
**data/** 
 
---
##Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.  
 
You may navigate to the unarchived visualizingGenomicsData folder in the Rstudio menu 
 
**Session -> Set Working Directory -> Choose Directory** 
 
or in the console.

```r
setwd("/PathToMyDownload/VisualizingGenomicsData/r_course/") 
# e.g. setwd("~/Downloads/VisualizingGenomicsData/r_course") 
```
 
 
 
---
## Recap

Previously we have reviewed how we can review signal and annotation over individual sites using and genome browser such as IGV and or programmatically using Gviz.

.pull-left[
![](imgs/IGV_SplicingExample.png)
]

.pull-right[
![](imgs/cc-1.png)
]

---
## Vizualising Biological data

With many high throughput sequencing experiments we are able to gain a genome wide view of our assays (in RNAseq we measure every gene's expression, in ChIPseq we identify all events for a user-defined transcription factor/mark, in ATACseq we evaluate all open/accessible regions in the genome.)
 
![](imgs/igvUsles.png)

---
## Vizualizing Biological Data

Biological data is often considered to have high dimensionality. Common techniques used to visualize genomics data include dimension reduction and/or clustering followed by the graphical representation of data as a heatmap.

These techniques makes interpretation of the data simpler to better identify patterns within our data i.e.  reproducibility of replicates within groups and magnitude of changes in signal between groups.

---
## Vizualising high dimensional Biological data.

In todays session we will work with some of the RNAseq data of adult mouse tissues from Bing Ren's lab, Liver and Heart. 
- More information on liver data [can be found here ](https://www.encodeproject.org/experiments/ENCSR000CHB/)
- More information on heart data [can be found here ](https://www.encodeproject.org/experiments/ENCSR000CGZ/)
- More information on Kidney data [can be found here ](https://www.encodeproject.org/experiments/ENCSR000CGZ/)

This represent a more complex experimental design than the two group comparison we have often used i.e. between activated and naive t-cells. We will use some clustering and dimensional reduction techniques to interrogate this data.

---
## Counts from SummariseOverlaps

In our RNAseq session we counted our reads in genes using the summarise overlaps function to generate our **RangedSummarizedExperiment** object.

First we create our BamFileList object to control memort usage.

```r
library(Rsamtools)
bamFilesToCount <- c("Sorted_Heart_1.bam","Sorted_Heart_2.bam",
                     "Sorted_Kidney_1.bam","Sorted_Kidney_2.bam",
                     "Sorted_Liver_1.bam","Sorted_Liver_2.bam")
myBams <- BamFileList(bamFilesToCount,yieldSize = 10000)
```

---
## Counts from SummariseOverlaps

With this BamFileList we can summarise from BAMs to produce our **RangedSummarizedExperiment** object of counts in genes across all samples.

We add in some metadata information for use with DESeq2 to our object with the **colData()** function.

```r
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(GenomicAlignments)
geneExons <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene,by="gene")
geneCounts <- summarizeOverlaps(geneExons,myBams,
                                    ignore.strand = TRUE)
colData(geneCounts)$tissue <- c("Heart", "Heart","Kidney","Kidney","Liver","Liver") 
```

---
## DESeq2

From here we can  **DESeqDataSet()** function to build directly from our **RangedSummarizedExperiment** object specifying the  metadata column to test on to the **design** parameter.

We then use the **DESeq()** fuction to normalise, fit variance, remove outliers and we extract comparisons of interest using the **results()** function.

```r
dds <- DESeqDataSet(geneCounts,design = ~Tissue)
dds <- DESeq(dds)
heartVsLiver <- results(dds,c("Tissue","Heart","Liver"))
heartVskidney <- results(dds,c("Tissue","Heart","Kidney"))
LiverVskidney <- results(dds,c("Tissue","Liver","Kidney"))
```

---
## DESeq2

Further to our pair-wise comparisons we can extract an anova-like statistic identifying genes changing expression between conditions using the *LRT* test with our reduced model (no groups model).

```r
dds2 <- DESeq(dds,test="LRT",reduced = ~1)
AllChanges <- results(dds2)
```

---
## Visualizing RNAseq data

One of the first steps of working with count data for visualization is commonly to transform the integer count data to log2 scale. To do this we will need to add some artificial value (pseudocount) to zeros in our counts prior to log transform (since the log2 of zero is infinite).

The DEseq2 **normTransform()** will add a 1 to our normalized counts prior to log2 transform and return a **DESeqTransform** object.

```r
normLog2Counts <- normTransform(dds)
normLog2Counts
```

```
## class: DESeqTransform 
## dim: 14454 6 
## metadata(1): version
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
##   Sorted_Liver_2
## colData names(2): Tissue sizeFactor
```

---
## Visualizing RNAseq data

We can extract our normalized and transformed counts from the **DESeqTransform** object using the **assay()** function.

```r
matrixOfNorm <- assay(normLog2Counts)
boxplot(matrixOfNorm, las=2, names=c("Heart_1","Heart_2", "Kidney_1","Kidney_2", "Liver_1", "Liver_2"))
```

![](Viz_part_3_files/figure-html/gnTM-1.png)

---
## Visualizing RNAseq data

When visualizing our signal however we now will have a similar problem with smaller counts having higher variance. This may cause changes in smaller counts to have undue influence in visualization and clustering.

```r
library(vsn)
vsn::meanSdPlot(matrixOfNorm)
```

![](Viz_part_3_files/figure-html/gnTMP-1.png)

---
## Visualizing RNAseq data

We can apply an **rlog** transformation to our data using the **rlog()** function which will attempt to shrink the variance for genes based on their mean expression.

```r
rlogTissue <- rlog(dds)
rlogTissue
```

```
## class: DESeqTransform 
## dim: 14454 6 
## metadata(1): version
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(27): baseMean baseVar ... dispFit rlogIntercept
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
##   Sorted_Liver_2
## colData names(2): Tissue sizeFactor
```

---
## Visualizing RNAseq data.

Again we can extract the matrix of transformed counts with the **assay()** function and plot the mean/variance relationship. If we look at the axis we can see the shrinkage of variance for low count genes.

.pull-left[

```r
rlogMatrix <- assay(rlogTissue)
vsn::meanSdPlot(rlogMatrix)
```

![](Viz_part_3_files/figure-html/gRLM-1.png)
  ]
  
.pull-right[

```r
library(vsn)
vsn::meanSdPlot(matrixOfNorm)
```

![](Viz_part_3_files/figure-html/gnTMP2-1.png)
  ]

---
## Dimension reduction

Since we have often have measured 1000s of genes over multiple samples/groups we will often try and simplify this too a few dimensions or meta/eigen genes which represent major patterns of signal across samples found.

We hope the strongest patterns or sources of variation in our data are correlated with sample groups and so dimension reduction offers a methods to method to visually identify reproducibility of samples.

Common methods of dimension reduction include Principal Component Analysis, MultiFactorial Scaling and Non-negative Matrix Factorization.

---
## PCA

We can see PCA in action with our data simply by using the DESeq2's **plotPCA()** function and our **DESeqTransform** object from our rlog transformation.

We must also provide a metadata column to colour samples by to the **intgroup** parameter and we set the **ntop** parameter to use all genes in PCA (by default it is top 500).

```r
plotPCA(rlogTissue,
        intgroup="Tissue",
        ntop=nrow(rlogTissue))
```

![](Viz_part_3_files/figure-html/gPCA-1.png)

---
## PCA

This PCA show the separation of samples by their group and the localization of samples with groups.

Since PC1 here explains 51% of total variances between samples and PC2 explains 44%, the reduction of dimensions can be seen to explain much of the changes among samples in 2 dimensions.

Of further note is the separation of samples across PC1 but the lack of separation of Heart and Kidney samples along PC2.

![](Viz_part_3_files/figure-html/gPCA3-1.png)

---
## PCA

PCA is often used to simply visualize sample similarity, but we can extract further information of the patterns of expression corresponding PCs by performing the PCA analysis ourselves.

We can use the **prcomp()** function with a transposition of our matrix to perform our prinicipal component analysis. The mappings of samples to PCs can be found in the **x** slot of the **prcomp** object.

```r
pcRes <- prcomp(t(rlogMatrix))
class(pcRes)
```

```
## [1] "prcomp"
```

```r
pcRes$x[1:2,]
```

```
##                      PC1      PC2        PC3       PC4        PC5          PC6
## Sorted_Heart_1 -103.8999 52.45800 -0.2401797  25.22797  1.0609927 2.437217e-13
## Sorted_Heart_2 -117.8517 46.07786  0.4445612 -23.96120 -0.7919848 2.426392e-13
```

---
## PCA

We can now reproduce the previous plot from DEseq2 in basic graphics from this.

```r
plot(pcRes$x,
     col=colData(rlogTissue)$Tissue,
     pch=20,
     cex=2)
legend("top",legend = c("Heart","Kidney","Liver"),
       fill=unique(colData(rlogTissue)$Tissue))
```

![](Viz_part_3_files/figure-html/gPRcosmp-1.png)

---
## PCA

Now we have constucted the PCA ourselves we can investigate which genes' expression profiles influence the relative PCs.

The influence (rotation/loadings) for all genes to each PC can be found in the **rotation** slot of the **prcomp** object.

```r
pcRes$rotation[1:5,1:4]
```

```
##                 PC1          PC2          PC3           PC4
## 20671  -0.005479648  0.003577376 -0.006875591  0.0048625659
## 27395  -0.002020427  0.003325506 -0.002302364 -0.0045132749
## 18777   0.004615068 -0.005413345  0.008975098 -0.0028857868
## 21399   0.005568549  0.002485067  0.002615072 -0.0001255119
## 108664  0.005729029 -0.004912143  0.009991580 -0.0031039532
```
---
## PCA

To investigate the seperation of Kidney samples along the negative axis of PC2 I can then look at which genes most negatively contribute to PC2.

Here we order by the most negative values for PC2 and select the top 100.

```r
PC2markers <- sort(pcRes$rotation[,2],decreasing = FALSE)[1:100]
PC2markers[1:10]
```

```
##       20505       22242       16483       56727       77337       57394 
## -0.06497365 -0.06001728 -0.05896360 -0.05896220 -0.05595648 -0.05505202 
##       18399       20495       22598       20730 
## -0.05307506 -0.05256188 -0.05205661 -0.05175604
```

---
## PCA

To investigate the gene expression profiles associated with PC2 we can now plot the log2 foldchanges ( or directional statistics) from our pairwise comparisons for our PC2 most influencial genes.

From the boxplot it is clear to see that the top 100 genes are all specifically upregulated in Kidney tissue.

```r
PC2_hVsl <- heartVsLiver$stat[rownames(heartVsLiver) %in% names(PC2markers)]
PC2_hVsk <- heartVskidney$stat[rownames(heartVskidney) %in% names(PC2markers)]
PC2_LVsk <- LiverVskidney$stat[rownames(LiverVskidney) %in% names(PC2markers)]
```

---
## PCA

From the boxplot it is clear to see that the top100 genes are all specifically upregulated in Kidney tissue.

```r
boxplot(PC2_hVsl,PC2_hVsk,PC2_LVsk,
        names=c("Heart/Liver","Heart/Kidney","Liver/Kidney"),ylab="log2FC")
```

![](Viz_part_3_files/figure-html/gPRcompRotf-1.png)

---
## Sample-to-Sample correlation

Another common step in quality control is to assess the correlation between expression profiles of samples.

We can assess correlation between all samples in a matrix by using the **cor()** function.

```r
sampleCor <- cor(rlogMatrix)
sampleCor
```

```
##                 Sorted_Heart_1 Sorted_Heart_2 Sorted_Kidney_1 Sorted_Kidney_2
## Sorted_Heart_1       1.0000000      0.9835321       0.7144527       0.7189597
## Sorted_Heart_2       0.9835321      1.0000000       0.7190675       0.7229253
## Sorted_Kidney_1      0.7144527      0.7190675       1.0000000       0.9929131
## Sorted_Kidney_2      0.7189597      0.7229253       0.9929131       1.0000000
## Sorted_Liver_1       0.7156978      0.6883444       0.7344165       0.7336117
## Sorted_Liver_2       0.7186525      0.6918428       0.7366287       0.7396193
##                 Sorted_Liver_1 Sorted_Liver_2
## Sorted_Heart_1       0.7156978      0.7186525
## Sorted_Heart_2       0.6883444      0.6918428
## Sorted_Kidney_1      0.7344165      0.7366287
## Sorted_Kidney_2      0.7336117      0.7396193
## Sorted_Liver_1       1.0000000      0.9714750
## Sorted_Liver_2       0.9714750      1.0000000
```

---
## Sample-to-Sample correlation

We can visualize the the correlation matrix using a heatmap following sample clustering.

First, we need to convert our correlation matrix into a distance measure to be use in clustering by subtracting from 1 to give dissimilarity measure and converting with the **as.dist()** to a **dist** object.

We then create a matrix of distance values to plot in the heatmap from this using **as.matrix()** function.

```r
library(pheatmap)
sampleDists <- as.dist(1-cor(rlogMatrix))
sampleDistMatrix <- as.matrix(sampleDists)
```

---
## Sample-to-Sample correlation

We can use the **pheatmap** library's pheatmap function to cluster our data by similarity and produce our heatmaps. We provide our matrix of sample distances as well as our **dist** object to the **clustering_distance_rows** and **clustering_distance_cols** function.

By default hierarchical clustering will group samples based on their gene expression similarity into a dendrogram with between sample similarity illustrated by branch length.

```r
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists)
```

![](Viz_part_3_files/figure-html/gSampleDistc-1.png)

---
## Sample-to-Sample correlation

We can use the **brewer.pal** and **colorRampPalette()** function to create a white to blue scale [as we have with **ggplot** scales](https://rockefelleruniversity.github.io/Plotting_In_R/r_course/presentations/slides/ggplot2.html#60).

```r
library(RColorBrewer)
blueColours <- brewer.pal(9, "Blues")
colors <- colorRampPalette(rev(blueColours))(255)
plot(1:255,rep(1,255),
     col=colors,pch=20,cex=20,ann=FALSE,
     yaxt="n")
```

![](Viz_part_3_files/figure-html/gSampleDistca-1.png)

---
## Sample-to-Sample correlation

We can provide a slightly nicer scale for our distance measure heatmap to the **color** parameter in the **pheatmap** function.

```r
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         color = colors)
```

![](Viz_part_3_files/figure-html/gSampleDistd-1.png)

---
## Sample-to-Sample correlation

Finally we can add some column annotation to highlight group membership. We must provide annotation as a data.frame of metadata we wish to include with rownames matching column names.

Fortunetely that is exactly as we have set up from DEseq2. We can extract metadata from the DESeq2 object with **colData()** function  and provide to the **annotation_col** parameter.

```r
annoCol <- as.data.frame(colData(dds))
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         color = colors,annotation_col = annoCo)
```

![](Viz_part_3_files/figure-html/gSampleDistl-1.png)