##The Course

  • Dimension reduction
  • PCA and loadings
  • Clustering
  • Grouping by kmeans
  • Testing gene clusters

##Reminder of file types

In this session we will be dealing with a range of data types. For more information on file types you can revisit our material.

For more information on visualizing genomics data in browsers you can visit our IGV course.

##Reminder of data types in Bioconductor

We will also encounter and make use of many data structures and data types which we have seen throughout our courses on HTS data. You can revisit this material to refresh on HTS data analysis in Bioconductor and R below.

##Materials

All material for this course can be found on github. * Visualizing Genomics Data

Or can be downloaded as a zip archive from here.
* Download zip

##Materials

Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.

  • r_course/presentations/Slides/ Presentations as an HTML slide show.
  • r_course/presentations/exercises/ Some tasks/examples to work through.

##Materials

All data to run code in the presentations and in the practicals is available in the zip archive. This includes coverage as bigWig files, aligned reads as BAM files and genomic intervals stored as BED files.

We also include some RData files containing precompiled results from querying database (in case of external server downtime).

All data can be found under the data directory

data/

##Set the Working directory

Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.

You may navigate to the unarchived visualizingGenomicsData folder in the Rstudio menu

Session -> Set Working Directory -> Choose Directory

or in the console.

setwd("/PathToMyDownload/VisualizingGenomicsData/r_course/") 
# e.g. setwd("~/Downloads/VisualizingGenomicsData/r_course") 

Recap

Previously we have reviewed how we can review signal and annotation over individual sites using and genome browser such as IGV and or programmatically using Gviz.

]

]

Vizualising Biological data

With many high throughput sequencing experiments we are able to gain a genome wide view of our assays (in RNAseq we measure every gene’s expression, in ChIPseq we identify all events for a user-defined transcription factor/mark, in ATACseq we evaluate all open/accessible regions in the genome.)

Vizualizing Biological Data

Biological data is often considered to have high dimensionality. Common techniques used to visualize genomics data include dimension reduction and/or clustering followed by the graphical representation of data as a heatmap.

These techniques makes interpretation of the data simpler to better identify patterns within our data i.e. reproducibility of replicates within groups and magnitude of changes in signal between groups.

igv

Vizualising high dimensional Biological data.

In todays session we will work with some of the RNAseq data of adult mouse tissues from Bing Ren’s lab, Liver and Heart. - More information on liver data can be found here - More information on heart data can be found here - More information on Kidney data can be found here

This represent a more complex experimental design than the two group comparison we have often used i.e. between activated and naive t-cells. We will use some clustering and dimensional reduction techniques to interrogate this data.

Counts from SummariseOverlaps

In our RNAseq session we counted our reads in genes using the summarise overlaps function to generate our RangedSummarizedExperiment object.

First we create our BamFileList object to control memort usage.

library(Rsamtools)
bamFilesToCount <- c("Sorted_Heart_1.bam","Sorted_Heart_2.bam",
                     "Sorted_Kidney_1.bam","Sorted_Kidney_2.bam",
                     "Sorted_Liver_1.bam","Sorted_Liver_2.bam")
myBams <- BamFileList(bamFilesToCount,yieldSize = 10000)

Counts from SummariseOverlaps

With this BamFileList we can summarise from BAMs to produce our RangedSummarizedExperiment object of counts in genes across all samples.

We add in some metadata information for use with DESeq2 to our object with the colData() function.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(GenomicAlignments)
geneExons <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene,by="gene")
geneCounts <- summarizeOverlaps(geneExons,myBams,
                                    ignore.strand = TRUE)
colData(geneCounts)$tissue <- c("Heart", "Heart","Kidney","Kidney","Liver","Liver") 

DESeq2

From here we can DESeqDataSet() function to build directly from our RangedSummarizedExperiment object specifying the metadata column to test on to the design parameter.

We then use the DESeq() fuction to normalise, fit variance, remove outliers and we extract comparisons of interest using the results() function.

dds <- DESeqDataSet(geneCounts,design = ~Tissue)
dds <- DESeq(dds)
heartVsLiver <- results(dds,c("Tissue","Heart","Liver"))
heartVskidney <- results(dds,c("Tissue","Heart","Kidney"))
LiverVskidney <- results(dds,c("Tissue","Liver","Kidney"))

DESeq2

Further to our pair-wise comparisons we can extract an anova-like statistic identifying genes changing expression between conditions using the LRT test with our reduced model (no groups model).

dds2 <- DESeq(dds,test="LRT",reduced = ~1)
AllChanges <- results(dds2)

Visualizing RNAseq data

One of the first steps of working with count data for visualization is commonly to transform the integer count data to log2 scale. To do this we will need to add some artificial value (pseudocount) to zeros in our counts prior to log transform (since the log2 of zero is infinite).

The DEseq2 normTransform() will add a 1 to our normalized counts prior to log2 transform and return a DESeqTransform object.

normLog2Counts <- normTransform(dds)
normLog2Counts
## class: DESeqTransform 
## dim: 14454 6 
## metadata(1): version
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
##   Sorted_Liver_2
## colData names(2): Tissue sizeFactor

Visualizing RNAseq data

We can extract our normalized and transformed counts from the DESeqTransform object using the assay() function.

matrixOfNorm <- assay(normLog2Counts)
boxplot(matrixOfNorm, las=2, names=c("Heart_1","Heart_2", "Kidney_1","Kidney_2", "Liver_1", "Liver_2"))

Visualizing RNAseq data

When visualizing our signal however we now will have a similar problem with smaller counts having higher variance. This may cause changes in smaller counts to have undue influence in visualization and clustering.

library(vsn)
vsn::meanSdPlot(matrixOfNorm)

Visualizing RNAseq data

We can apply an rlog transformation to our data using the rlog() function which will attempt to shrink the variance for genes based on their mean expression.

rlogTissue <- rlog(dds)
rlogTissue
## class: DESeqTransform 
## dim: 14454 6 
## metadata(1): version
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(27): baseMean baseVar ... dispFit rlogIntercept
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
##   Sorted_Liver_2
## colData names(2): Tissue sizeFactor

Visualizing RNAseq data.

Again we can extract the matrix of transformed counts with the assay() function and plot the mean/variance relationship. If we look at the axis we can see the shrinkage of variance for low count genes.

rlogMatrix <- assay(rlogTissue)
vsn::meanSdPlot(rlogMatrix)

library(vsn)
vsn::meanSdPlot(matrixOfNorm)

Dimension reduction

Since we have often have measured 1000s of genes over multiple samples/groups we will often try and simplify this too a few dimensions or meta/eigen genes which represent major patterns of signal across samples found.

We hope the strongest patterns or sources of variation in our data are correlated with sample groups and so dimension reduction offers a methods to method to visually identify reproducibility of samples.

Common methods of dimension reduction include Principal Component Analysis, MultiFactorial Scaling and Non-negative Matrix Factorization.

igv

PCA

We can see PCA in action with our data simply by using the DESeq2’s plotPCA() function and our DESeqTransform object from our rlog transformation.

We must also provide a metadata column to colour samples by to the intgroup parameter and we set the ntop parameter to use all genes in PCA (by default it is top 500).

plotPCA(rlogTissue,
        intgroup="Tissue",
        ntop=nrow(rlogTissue))

PCA

This PCA show the separation of samples by their group and the localization of samples with groups.

Since PC1 here explains 51% of total variances between samples and PC2 explains 44%, the reduction of dimensions can be seen to explain much of the changes among samples in 2 dimensions.

Of further note is the separation of samples across PC1 but the lack of separation of Heart and Kidney samples along PC2.

PCA

PCA is often used to simply visualize sample similarity, but we can extract further information of the patterns of expression corresponding PCs by performing the PCA analysis ourselves.

We can use the prcomp() function with a transposition of our matrix to perform our prinicipal component analysis. The mappings of samples to PCs can be found in the x slot of the prcomp object.

pcRes <- prcomp(t(rlogMatrix))
class(pcRes)
## [1] "prcomp"
pcRes$x[1:2,]
##                      PC1      PC2        PC3       PC4        PC5          PC6
## Sorted_Heart_1 -103.8999 52.45800 -0.2401797  25.22797  1.0609927 2.437217e-13
## Sorted_Heart_2 -117.8517 46.07786  0.4445612 -23.96120 -0.7919848 2.426392e-13

PCA

We can now reproduce the previous plot from DEseq2 in basic graphics from this.

plot(pcRes$x,
     col=colData(rlogTissue)$Tissue,
     pch=20,
     cex=2)
legend("top",legend = c("Heart","Kidney","Liver"),
       fill=unique(colData(rlogTissue)$Tissue))

PCA

Now we have constucted the PCA ourselves we can investigate which genes’ expression profiles influence the relative PCs.

The influence (rotation/loadings) for all genes to each PC can be found in the rotation slot of the prcomp object.

pcRes$rotation[1:5,1:4]
##                 PC1          PC2          PC3           PC4
## 20671  -0.005479648  0.003577376 -0.006875591  0.0048625659
## 27395  -0.002020427  0.003325506 -0.002302364 -0.0045132749
## 18777   0.004615068 -0.005413345  0.008975098 -0.0028857868
## 21399   0.005568549  0.002485067  0.002615072 -0.0001255119
## 108664  0.005729029 -0.004912143  0.009991580 -0.0031039532

PCA

To investigate the seperation of Kidney samples along the negative axis of PC2 I can then look at which genes most negatively contribute to PC2.

Here we order by the most negative values for PC2 and select the top 100.

PC2markers <- sort(pcRes$rotation[,2],decreasing = FALSE)[1:100]
PC2markers[1:10]
##       20505       22242       16483       56727       77337       57394 
## -0.06497365 -0.06001728 -0.05896360 -0.05896220 -0.05595648 -0.05505202 
##       18399       20495       22598       20730 
## -0.05307506 -0.05256188 -0.05205661 -0.05175604

PCA

To investigate the gene expression profiles associated with PC2 we can now plot the log2 foldchanges ( or directional statistics) from our pairwise comparisons for our PC2 most influencial genes.

From the boxplot it is clear to see that the top 100 genes are all specifically upregulated in Kidney tissue.

PC2_hVsl <- heartVsLiver$stat[rownames(heartVsLiver) %in% names(PC2markers)]
PC2_hVsk <- heartVskidney$stat[rownames(heartVskidney) %in% names(PC2markers)]
PC2_LVsk <- LiverVskidney$stat[rownames(LiverVskidney) %in% names(PC2markers)]

PCA

From the boxplot it is clear to see that the top100 genes are all specifically upregulated in Kidney tissue.

boxplot(PC2_hVsl,PC2_hVsk,PC2_LVsk,
        names=c("Heart/Liver","Heart/Kidney","Liver/Kidney"),ylab="log2FC")

Sample-to-Sample correlation

Another common step in quality control is to assess the correlation between expression profiles of samples.

We can assess correlation between all samples in a matrix by using the cor() function.

sampleCor <- cor(rlogMatrix)
sampleCor
##                 Sorted_Heart_1 Sorted_Heart_2 Sorted_Kidney_1 Sorted_Kidney_2
## Sorted_Heart_1       1.0000000      0.9835321       0.7144527       0.7189597
## Sorted_Heart_2       0.9835321      1.0000000       0.7190675       0.7229253
## Sorted_Kidney_1      0.7144527      0.7190675       1.0000000       0.9929131
## Sorted_Kidney_2      0.7189597      0.7229253       0.9929131       1.0000000
## Sorted_Liver_1       0.7156978      0.6883444       0.7344165       0.7336117
## Sorted_Liver_2       0.7186525      0.6918428       0.7366287       0.7396193
##                 Sorted_Liver_1 Sorted_Liver_2
## Sorted_Heart_1       0.7156978      0.7186525
## Sorted_Heart_2       0.6883444      0.6918428
## Sorted_Kidney_1      0.7344165      0.7366287
## Sorted_Kidney_2      0.7336117      0.7396193
## Sorted_Liver_1       1.0000000      0.9714750
## Sorted_Liver_2       0.9714750      1.0000000

Sample-to-Sample correlation

We can visualize the the correlation matrix using a heatmap following sample clustering.

First, we need to convert our correlation matrix into a distance measure to be use in clustering by subtracting from 1 to give dissimilarity measure and converting with the as.dist() to a dist object.

We then create a matrix of distance values to plot in the heatmap from this using as.matrix() function.

library(pheatmap)
sampleDists <- as.dist(1-cor(rlogMatrix))
sampleDistMatrix <- as.matrix(sampleDists)

Sample-to-Sample correlation

We can use the pheatmap library’s pheatmap function to cluster our data by similarity and produce our heatmaps. We provide our matrix of sample distances as well as our dist object to the clustering_distance_rows and clustering_distance_cols function.

By default hierarchical clustering will group samples based on their gene expression similarity into a dendrogram with between sample similarity illustrated by branch length.

pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists)

Sample-to-Sample correlation

We can use the brewer.pal and colorRampPalette() function to create a white to blue scale as we have with ggplot scales.

library(RColorBrewer)
blueColours <- brewer.pal(9, "Blues")
colors <- colorRampPalette(rev(blueColours))(255)
plot(1:255,rep(1,255),
     col=colors,pch=20,cex=20,ann=FALSE,
     yaxt="n")

Sample-to-Sample correlation

We can provide a slightly nicer scale for our distance measure heatmap to the color parameter in the pheatmap function.

pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         color = colors)

Sample-to-Sample correlation

Finally we can add some column annotation to highlight group membership. We must provide annotation as a data.frame of metadata we wish to include with rownames matching column names.

Fortunetely that is exactly as we have set up from DEseq2. We can extract metadata from the DESeq2 object with colData() function and provide to the annotation_col parameter.

annoCol <- as.data.frame(colData(dds))
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         color = colors,annotation_col = annoCo)

Clustering genes and samples

We can use the same methods of clustering samples to cluster genes with similar expression patterns together.Clustering genes will allow us to visually identify the major patterns of gene expressions with our data and to group genes with similar expression profiles for review and functional analysis.

This technique will hopefully allow us to disentangle the complex sets of comparison found in evaluating significant differences found from our DESeq2 LRT test of changes in genes’ expression between any tissue.

First then lets subset our rlog transformed gene expression matrix to those genes significant in our LRT test.

sigChanges <- rownames(AllChanges)[AllChanges$padj < 0.01 & !is.na(AllChanges$padj)]
sigMat <- rlogMatrix[rownames(rlogMatrix) %in% sigChanges,]
nrow(sigMat)
## [1] 8094

Clustering genes and samples

We can pass our filtered matrix of expression to the pheatmap() function and set the scale parameter to row to allow for clustering of relative changes in gene expression. Additionally due to the large number of genes, we turn rowname off with the show_rownames function.

library(pheatmap)
pheatmap(sigMat,
         scale="row",
         show_rownames = FALSE)

Clustering genes and samples

pheatmap(sigMat,
         scale="row",
         show_rownames = FALSE)

Clustering genes and samples

Now we have a visual representation of changes in gene expression across samples we can use the clustering to derive groups of genes with similar expression patterns. Gene with similar expression profiles may share functional roles and so we can use these groups to further evaluate our gene expression data.

Many approaches to identifying clustered groups of genes exist including K-means, SOM and HOPACH.

K-means is implemented with the pheatmap package and so we can simply provide a desired number of clusters to the kmeans_k parameter.

library(pheatmap)
set.seed(153)
k <-   pheatmap(sigMat,
           scale="row",kmeans_k = 7)

Clustering genes and samples

The resulting plot no longer shows our individual genes but the average relative expression of genes within a cluster.

The heatmap rownames show the cluster name and importantly the number of genes within each cluster.

library(pheatmap)
set.seed(153)
k <-   pheatmap(sigMat,
           scale="row",kmeans_k = 7)

Clustering genes and samples

The pheatmap() function returns information on clustering is return as a list and following k-means clustering the assignment of genes to clusters can be extracted from this.

names(k$kmeans)
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
clusterDF <- as.data.frame(factor(k$kmeans$cluster))
colnames(clusterDF) <- "Cluster"
clusterDF[1:10,,drop=FALSE]
##           Cluster
## 20671           1
## 27395           3
## 18777           4
## 21399           5
## 108664          4
## 319263          6
## 76187           4
## 70675           6
## 73824           5
## 100039596       4

Clustering genes and samples

We can now plot our full heatmap highlighting the membership of genes to clusters.

We add an additional row annotation by providing a data.frame of desired annotation with rownames matching between our annotation data.frame and our rlog transformed matrix to the annotation_row parameter.

OrderByCluster <- sigMat[order(clusterDF$Cluster),]

pheatmap(OrderByCluster,
           scale="row",annotation_row = clusterDF,
           show_rownames = FALSE,cluster_rows = FALSE)

Clustering genes and samples

OrderByCluster <- sigMat[order(clusterDF$Cluster),]

pheatmap(OrderByCluster,
           scale="row",annotation_row = clusterDF,
           show_rownames = FALSE,cluster_rows = FALSE)

Identifying optimal clusters

Methods exist to identify the optimal number of clusters within your data.

One such method is to assess the silhoutte score at different successive cluster numbers and choose the cluster number with the highest mean silhoutte score.

The Silhouette method evaluates the similarity of cluster members to the similarity between clusters as below.

For all genes/samples, the dissimilarity for a cluster member to its own cluster , ai, is calculated as the mean distance between a cluster member and all other members of that cluster. Further to this the minimun, mean dissimilarity of the cluster member to members of other clusters is calculated, bi.

S i = ( b i a i ) / m a x ( a i , b i )

Identifying number of clusters

We can use the NbClust package to calculate the Silhoutte scores over successive cluster numbers. We supply the scaled matrix to the NbClust function and set the min and maximum cluster numbers to try using the min.nc and max.nc respectively.

We can retrieve the optimal cluster number from the Best.nc slot of our result list. Here we see the number is lower at 3, a cluster for every sample group’s unique gene expression signature.

library(NbClust)
rowScaledMat <- t(scale(t(sigMat)))
clusterNum <- NbClust(rowScaledMat,distance = "euclidean",
          min.nc = 2, max.nc = 12, 
          method = "kmeans", index ="silhouette")

clusterNum$Best.nc
## Number_clusters     Value_Index 
##          3.0000          0.4956

Identifying number of clusters

We can the use the Best.partition slot to extract the cluster membership as we did with pheatmap.

We can arrange our matrix by using the match function between the row names of our matrix and the names of genes in our new cluster membership vector.

clusterNum$Best.partition[1:10]
##     20671     27395     18777     21399    108664    319263     76187     70675 
##         2         2         1         3         1         3         1         3 
##     73824 100039596 
##         3         1
orderedCluster <- sort(clusterNum$Best.partition)
sigMat <- sigMat[match(names(orderedCluster),rownames(sigMat)),]

Identifying number of clusters

We can now visualize the new clustering alongside our old clustering.

pheatmap(sigMat,
           scale="row",annotation_row = clusterDF,
           show_rownames = FALSE,cluster_rows = FALSE)

Testing gene clusters

Now we have some genes clusters from our expression data, we will want to investigate these clusters for any enrichment of functional terms within them.

Here we extract all genes in cluster 1, which appear to be our heart specific genes.

heartSpecific <- rownames(clusterDF[clusterDF$Cluster == 1,,drop=FALSE])
heartSpecific[1:10]
##  [1] "20671"  "109294" "240725" "240726" "14048"  "29819"  "57339"  "74229" 
##  [9] "13518"  "214855"

Testing gene clusters

We can now build our logical vector of all genes ( from those with Padj values from LRT test) with the genes in cluster marked as TRUE.

We will use the logical vector for geneset testing using goseq.

bckGround <- rownames(AllChanges)[!is.na(AllChanges$padj)]
heartSpecific <- bckGround %in% heartSpecific
names(heartSpecific) <- bckGround
heartSpecific[1:10]
##  20671  27395  18777  21399 108664  18387  12421 319263  59014  76187 
##   TRUE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE  FALSE

Testing gene clusters

We can now pass our named logical vector to the nullp and goseq function seen before in geneset testing in RNAseq and ChIPseq. Here we test against the GO Biological Processes genesets.

library(goseq)
toTest <- nullp(heartSpecific,"mm10",id = "knownGene")

Testing gene clusters

From our GO test we can see enrichment for some heart specific terms such as muscle development.

heartRes <- goseq(toTest,"mm10","knownGene",test.cats = "GO:BP")
heartRes[1:5,]
##         category over_represented_pvalue under_represented_pvalue numDEInCat
## 9597  GO:0061061            5.223231e-38                        1        186
## 10934 GO:0072359            7.728431e-34                        1        256
## 8038  GO:0048646            7.238112e-29                        1        240
## 8977  GO:0060047            4.576488e-28                        1         83
## 973   GO:0003012            5.109725e-28                        1        120
##       numInCat                                                     term
## 9597       581                             muscle structure development
## 10934      982                           circulatory system development
## 8038       960 anatomical structure formation involved in morphogenesis
## 8977       185                                        heart contraction
## 973        341                                    muscle system process
##       ontology
## 9597        BP
## 10934       BP
## 8038        BP
## 8977        BP
## 973         BP

PCA

Finally we can use our heatmaps to get a better idea of the pattern of expression associated with PC1 plot we saw earlier.

plotPCA(rlogTissue,
        intgroup="Tissue",
        ntop=nrow(rlogTissue))

PCA

We can extract and sort our expression matrix by the loadings (influence) of genes on PC1.

Here we use the match() fucntion to sort our rlog transformed gene expression matrix to the order of PC1 influence

PC1_rnk <- sort(pcRes$rotation[,1],decreasing = TRUE)
PC1_mat <- sigMat[match(names(PC1_rnk),rownames(sigMat),nomatch = 0),]
PC1_mat[1:3,]
##           Sorted_Heart_1 Sorted_Heart_2 Sorted_Kidney_1 Sorted_Kidney_2
## 100039206       4.648873       3.247752        2.567747        2.576385
## 100039028       6.094549       3.833855        3.491466        3.326275
## 100041658       4.525257       3.561079        2.328869        2.338274
##           Sorted_Liver_1 Sorted_Liver_2
## 100039206       14.08565       11.12317
## 100039028       15.12364       11.96021
## 100041658       12.81601       10.45862

PCA

Now we plot our heatmap of gene expression ordered by PC1 influence to reveal the PC1’s expression profile.

pheatmap(PC1_mat,
         scale="row",
         cluster_rows=FALSE,
         show_rownames = FALSE,annotation_col = annoCol
         )

##Exercises

Time for exercises! Link here

##Solutions

Time for solutions! Link here