Session 1: Alignment and counting
Session 2: Differential gene expression analysis
Session 3: Visualizing results through PCA and clustering
Session 4: Gene Set Analysis
Session 5: Differential Transcript Utilization analysis
In todays session we will work with some of the RNAseq data of adult mouse tissues from Bing Ren’s lab, Liver and Heart. - More information on liver data can be found here - More information on heart data can be found here - More information on kidney data can be found here
The Tissue RNAseq dataset have been pre-aligned and counted. The RangedSummarizedExperiment object with the gene counts has been saved as a .RData object in data directory:
data/gC_TissueFull.RData
Biological data is often considered to have high dimensionality. Common techniques used to visualize genomics data include dimension reduction and/or clustering followed by the graphical representation of data as a heatmap.
These techniques makes interpretation of the data simpler to better identify patterns within our data i.e. reproducibility of replicates within groups and magnitude of changes in signal between groups.
In this experiment we will be comparing three tissues. This represent a more complex experimental design than the two group comparison we have often used i.e. between activated and naive t-cells.
We still use some similar approaches, but there will also be some additional approaches that will help i.e. clustering and dimensional reduction techniques to interrogate this data.
The data has already been mapped and counted with Rsubread and SummarizedExperiment respectively. The RangedSummarizedExperiment loaded in here has the counts for all 3 tissues. The grouping metadata is the Tissue variable.
load("data/gC_TissueFull.RData")
geneCounts
## class: RangedSummarizedExperiment
## dim: 14454 6
## metadata(0):
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(0):
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
## Sorted_Liver_2
## colData names(1): Tissue
We can now set up our DESeq2 object and run the DESeq workflow function
<- DESeqDataSet(geneCounts, design = ~Tissue)
dds <- DESeq(dds) dds
To extract comparisons we can simply specify the tissues of interest to the results function.
<- results(dds, c("Tissue", "Liver", "Kidney"))
LiverVskidney <- results(dds, c("Tissue", "Heart", "Liver"))
heartVsLiver <- results(dds, c("Tissue", "Heart", "Kidney"))
heartVskidney heartVskidney
## log2 fold change (MLE): Tissue Heart vs Kidney
## Wald test p-value: Tissue Heart vs Kidney
## DataFrame with 14454 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 20671 64.5915 1.675554 0.400136 4.18746 2.82090e-05 1.11084e-04
## 27395 354.2077 1.069639 0.276556 3.86771 1.09862e-04 3.93252e-04
## 18777 517.6560 -1.976956 0.342107 -5.77876 7.52525e-09 4.66423e-08
## 21399 107.3182 -0.373288 0.315755 -1.18221 2.37124e-01 3.56834e-01
## 108664 370.2050 -2.062351 0.368121 -5.60237 2.11435e-08 1.24993e-07
## ... ... ... ... ... ... ...
## 20592 150.6694 0.0434984 0.283835 0.153253 0.87819903 0.919818
## 26908 264.7052 0.2109728 0.287845 0.732938 0.46359624 0.592105
## 22290 124.0370 0.0786872 0.459615 0.171203 0.86406453 0.909363
## 26900 375.4670 0.8709573 0.330544 2.634922 0.00841566 0.020596
## 170942 10.3089 -1.3065830 0.873286 -1.496168 0.13460985 0.226951
To identify genes specifically upregulated in Heart versus other tissues we just need to do a simple data merge.
First we convert DESeqResults objects into a data.frame, remove NAs and extract most interesting columns to us.
<- as.data.frame(heartVsLiver)
heartVsLiverDF <- as.data.frame(heartVskidney)
heartVskidneyDF
<- heartVsLiverDF[!is.na(heartVsLiverDF$padj), ]
heartVsLiverDF <- heartVskidneyDF[!is.na(heartVskidneyDF$padj), ]
heartVskidneyDF
<- heartVsLiverDF[, c("log2FoldChange", "padj")]
heartVsLiverDF <- heartVskidneyDF[, c("log2FoldChange", "padj")] heartVskidneyDF
We can then update the column names and merge our data.frame to have a single table of most useful information. The by=0 means that it will use the rownames (which contain the Gene IDs) as the common feature.
colnames(heartVskidneyDF) <- paste0("HeartVsKidney", "_", colnames(heartVskidneyDF))
colnames(heartVsLiverDF) <- paste0("HeartVsLiver", "_", colnames(heartVsLiverDF))
<- merge(heartVsLiverDF, heartVskidneyDF, by = 0)
fullTable 1:2, ] fullTable[
## Row.names HeartVsLiver_log2FoldChange HeartVsLiver_padj
## 1 100017 -0.6209016 0.1688893
## 2 100019 -0.6718217 0.1137017
## HeartVsKidney_log2FoldChange HeartVsKidney_padj
## 1 0.1789343 0.7431574
## 2 -0.1426205 0.7860500
Now we can extract our genes are significantly upregulated in Heart in both conditions.
<- fullTable$HeartVsLiver_log2FoldChange > 0 & fullTable$HeartVsKidney_log2FoldChange >
upInHeart 0 & fullTable$HeartVsLiver_padj < 0.05 & fullTable$HeartVsKidney_padj < 0.05
<- fullTable[upInHeart, ]
upInHeartTable 1:2, ] upInHeartTable[
## Row.names HeartVsLiver_log2FoldChange HeartVsLiver_padj
## 12 100038347 4.927612 2.564094e-49
## 24 100038575 2.434627 1.389225e-02
## HeartVsKidney_log2FoldChange HeartVsKidney_padj
## 12 4.362883 5.756669e-51
## 24 3.184769 3.529865e-04
We can also make a logical data.frame of whether a gene was upregulated in Heart for both Liver and Kidney comparisons.
<- data.frame(UpvsLiver = fullTable$HeartVsLiver_log2FoldChange > 0 & fullTable$HeartVsLiver_padj <
forVenn 0.05, UpvsKidney = fullTable$HeartVsKidney_log2FoldChange > 0 & fullTable$HeartVsKidney_padj <
0.05)
1:3, ] forVenn[
## UpvsLiver UpvsKidney
## 1 FALSE FALSE
## 2 FALSE FALSE
## 3 FALSE TRUE
We can use limmas’ vennDiagram() to produce a venn diagram of overlap in upregulation in heart vs Liver or Kidney.
library(limma)
vennDiagram(forVenn)
One of the first steps of working with count data for visualization is commonly to transform the integer count data to log2 scale. To do this we will need to add some artificial value (pseudocount) to zeros in our counts prior to log transform (since the log2 of zero is infinite).
The DEseq2 normTransform() will add a 1 to our normalized counts prior to log2 transform and return a DESeqTransform object.
<- normTransform(dds)
normLog2Counts normLog2Counts
## class: DESeqTransform
## dim: 14454 6
## metadata(1): version
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(26): baseMean baseVar ... deviance maxCooks
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
## Sorted_Liver_2
## colData names(2): Tissue sizeFactor
We can extract our normalized and transformed counts from the DESeqTransform object using the assay() function.
<- assay(normLog2Counts)
matrixOfNorm boxplot(matrixOfNorm, las = 2, names = c("Heart_1", "Heart_2", "Kidney_1", "Kidney_2",
"Liver_1", "Liver_2"))
When visualizing our signal however we now will have a similar problem with smaller counts having higher variance. This may cause changes in smaller counts to have undue influence in visualization and clustering.
library(vsn)
::meanSdPlot(matrixOfNorm) vsn
We can apply an rlog transformation to our data using the rlog() function which will attempt to shrink the variance for genes based on their mean expression.
<- rlog(dds)
rlogTissue rlogTissue
## class: DESeqTransform
## dim: 14454 6
## metadata(1): version
## assays(1): ''
## rownames(14454): 20671 27395 ... 26900 170942
## rowData names(27): baseMean baseVar ... dispFit rlogIntercept
## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1
## Sorted_Liver_2
## colData names(2): Tissue sizeFactor
Again we can extract the matrix of transformed counts with the assay() function and plot the mean/variance relationship. If we look at the axis we can see the shrinkage of variance for low count genes.
<- assay(rlogTissue)
rlogMatrix ::meanSdPlot(rlogMatrix) vsn
library(vsn)
::meanSdPlot(matrixOfNorm) vsn
Since we have often have measured 1000s of genes over multiple samples/groups we will often try and simplify this too a few dimensions or meta/eigen genes which represent major patterns of signal across samples found.
We hope the strongest patterns or sources of variation in our data are correlated with sample groups and so dimension reduction offers a methods to method to visually identify reproducibility of samples.
Common methods of dimension reduction include Principal Component Analysis, MultiFactorial Scaling and Non-negative Matrix Factorization.
We can see PCA in action with our data simply by using the DESeq2’s plotPCA() function and our DESeqTransform object from our rlog transformation.
We must also provide a metadata column to colour samples by to the intgroup parameter and we set the ntop parameter to use all genes in PCA (by default it is top 500).
plotPCA(rlogTissue, intgroup = "Tissue", ntop = nrow(rlogTissue))
This PCA show the separation of samples by their group and the localization of samples with groups.
Since PC1 here explains 51% of total variances between samples and PC2 explains 44%, the reduction of dimensions can be seen to explain much of the changes among samples in 2 dimensions.
Of further note is the separation of samples across PC1 but the lack of separation of Heart and Kidney samples along PC2.
PCA is often used to simply visualize sample similarity, but we can extract further information of the patterns of expression corresponding PCs by performing the PCA analysis ourselves.
We can use the prcomp() function with a transposition of our matrix to perform our prinicipal component analysis. The mappings of samples to PCs can be found in the x slot of the prcomp object.
<- prcomp(t(rlogMatrix))
pcRes class(pcRes)
## [1] "prcomp"
$x[1:2, ] pcRes
## PC1 PC2 PC3 PC4 PC5 PC6
## Sorted_Heart_1 -103.8999 52.45800 -0.2401797 25.22797 1.0609927 2.437217e-13
## Sorted_Heart_2 -117.8517 46.07786 0.4445612 -23.96120 -0.7919848 2.426392e-13
We can now reproduce the previous plot from DEseq2 in basic graphics from this.
plot(pcRes$x, col = colData(rlogTissue)$Tissue, pch = 20, cex = 2)
legend("top", legend = c("Heart", "Kidney", "Liver"), fill = unique(colData(rlogTissue)$Tissue))
Now we have constructed the PCA ourselves we can investigate which genes’ expression profiles influence the relative PCs.
The influence (rotation/loadings) for all genes to each PC can be found in the rotation slot of the prcomp object.
$rotation[1:5, 1:4] pcRes
## PC1 PC2 PC3 PC4
## 20671 -0.005479648 0.003577376 -0.006875591 0.0048625659
## 27395 -0.002020427 0.003325506 -0.002302364 -0.0045132749
## 18777 0.004615068 -0.005413345 0.008975098 -0.0028857868
## 21399 0.005568549 0.002485067 0.002615072 -0.0001255119
## 108664 0.005729029 -0.004912143 0.009991580 -0.0031039532
To investigate the separation of Kidney samples along the negative axis of PC2 I can then look at which genes most negatively contribute to PC2.
Here we order by the most negative values for PC2 and select the top 100.
<- sort(pcRes$rotation[, 2], decreasing = FALSE)[1:100]
PC2markers 1:10] PC2markers[
## 20505 22242 16483 56727 77337 57394
## -0.06497365 -0.06001728 -0.05896360 -0.05896220 -0.05595648 -0.05505202
## 18399 20495 22598 20730
## -0.05307506 -0.05256188 -0.05205661 -0.05175604
To investigate the gene expression profiles associated with PC2 we can now plot the log2 foldchanges (or directional statistics) from our pairwise comparisons for our PC2 most influencial genes.
<- heartVsLiver$stat[rownames(heartVsLiver) %in% names(PC2markers)]
PC2_hVsl <- heartVskidney$stat[rownames(heartVskidney) %in% names(PC2markers)]
PC2_hVsk <- LiverVskidney$stat[rownames(LiverVskidney) %in% names(PC2markers)] PC2_LVsk
From the boxplot it is clear to see that the top100 genes are all specifically upregulated in Kidney tissue.
boxplot(PC2_hVsl, PC2_hVsk, PC2_LVsk, names = c("Heart/Liver", "Heart/Kidney", "Liver/Kidney"),
ylab = "log2FC")
Another common step in quality control is to assess the correlation between expression profiles of samples.
We can assess correlation between all samples in a matrix by using the cor() function.
<- cor(rlogMatrix)
sampleCor sampleCor
## Sorted_Heart_1 Sorted_Heart_2 Sorted_Kidney_1 Sorted_Kidney_2
## Sorted_Heart_1 1.0000000 0.9835321 0.7144527 0.7189597
## Sorted_Heart_2 0.9835321 1.0000000 0.7190675 0.7229253
## Sorted_Kidney_1 0.7144527 0.7190675 1.0000000 0.9929131
## Sorted_Kidney_2 0.7189597 0.7229253 0.9929131 1.0000000
## Sorted_Liver_1 0.7156978 0.6883444 0.7344165 0.7336117
## Sorted_Liver_2 0.7186525 0.6918428 0.7366287 0.7396193
## Sorted_Liver_1 Sorted_Liver_2
## Sorted_Heart_1 0.7156978 0.7186525
## Sorted_Heart_2 0.6883444 0.6918428
## Sorted_Kidney_1 0.7344165 0.7366287
## Sorted_Kidney_2 0.7336117 0.7396193
## Sorted_Liver_1 1.0000000 0.9714750
## Sorted_Liver_2 0.9714750 1.0000000
We can visualize the the correlation matrix using a heatmap following sample clustering.
First, we need to convert our correlation matrix into a distance measure to be use in clustering by subtracting from 1 to give dissimilarity measure and converting with the as.dist() to a dist object.
We then create a matrix of distance values to plot in the heatmap from this using as.matrix() function.
<- as.dist(1 - cor(rlogMatrix))
sampleDists <- as.matrix(sampleDists) sampleDistMatrix
We can use the pheatmap library’s pheatmap function to cluster our data by similarity and produce our heatmaps. We provide our matrix of sample distances as well as our dist object to the clustering_distance_rows and clustering_distance_cols function.
By default hierarchical clustering will group samples based on their gene expression similarity into a dendrogram with between sample similarity illustrated by branch length.
library(pheatmap)
pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists)
We can use the brewer.pal and colorRampPalette() function to create a white to blue scale. We cover this in more depth for using with ggplot scales.
library(RColorBrewer)
<- brewer.pal(9, "Blues")
blueColours <- colorRampPalette(rev(blueColours))(255)
colors plot(1:255, rep(1, 255), col = colors, pch = 20, cex = 20, ann = FALSE, yaxt = "n")
We can provide a slightly nicer scale for our distance measure heatmap to the color parameter in the pheatmap function.
pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists,
color = colors)
Finally we can add some column annotation to highlight group membership. We must provide annotation as a data.frame of metadata we wish to include with rownames matching column names.
Fortunately that is exactly as we have set up from DEseq2. We can extract metadata from the DESeq2 object with colData() function and provide it to the annotation_col parameter.
<- as.data.frame(colData(dds))
annoCol pheatmap(sampleDistMatrix, clustering_distance_rows = sampleDists, clustering_distance_cols = sampleDists,
color = colors, annotation_col = annoCol)
We can use the same methods of clustering samples to cluster genes with similar expression patterns together. Clustering genes will allow us to identify the major patterns of gene expressions within our data and to group genes with similar expression profiles for review and functional analysis.
Minimizing the number of genes considered for clustering helps speed things up. To reduce the dataset we can subset to genes that are highly variable using something akin to an ANOVA test.
With DESeq2 we can identify genes significantly changing across groups by comparing our models with and without our groups of interest using the results function.
~ Tissue
~ 1
To run a LRT we must set the parameter of reduced to our alternative model of no groups. We also set the test parameter to LRT to allow us to compare the models.
<- DESeq(dds, test = "LRT", reduced = ~1)
dds2 <- results(dds2)
acrossGroups <- acrossGroups[order(acrossGroups$pvalue), ]
acrossGroups 1:3, ] acrossGroups[
## log2 fold change (MLE): Tissue Liver vs Heart
## LRT p-value: '~ Tissue' vs '~ 1'
## DataFrame with 3 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 16483 124208.2 -2.27414 1.584829 1543.84 0.00000e+00 0.00000e+00
## 17888 43684.0 -14.85201 0.654258 1524.73 0.00000e+00 0.00000e+00
## 21956 16543.9 -15.03634 1.059070 1389.39 1.98399e-302 9.55889e-299
We can use the plotCounts() to get review expression profile of a gene, one at a time. We define the gene of interest to gene parameter and intgroup to specify metadata column to group counts by.
plotCounts(dds2, gene = "17888", intgroup = "Tissue")
Clustering is done on the counts. We can subset our rlog transformed gene expression matrix to those genes significant in our LRT test. This filters the ~45% of genes that are not chganing across our experiment.
<- rownames(acrossGroups)[acrossGroups$padj < 0.01 & !is.na(acrossGroups$padj)]
sigChanges <- rlogMatrix[rownames(rlogMatrix) %in% sigChanges, ]
sigMat nrow(rlogMatrix)
## [1] 14454
nrow(sigMat)
## [1] 8094
We can pass our filtered matrix of expression to the pheatmap() function and set the scale parameter to row to allow for clustering of relative changes in gene expression (this does a by gene Z-score). Additionally due to the large number of genes, we turn rowname off with the show_rownames function.
library(pheatmap)
pheatmap(sigMat, scale = "row", show_rownames = FALSE)
pheatmap(sigMat, scale = "row", show_rownames = FALSE)
Now we have a visual representation of changes in gene expression across samples we can use the clustering to derive groups of genes with similar expression patterns. Gene with similar expression profiles may share functional roles and so we can use these groups to further evaluate our gene expression data.
Many approaches to identifying clustered groups of genes exist including K-means, SOM and HOPACH.
The pheatmap package has in built methods for K means and hierarchical clustering. For K means we can simply provide a desired number of clusters to the kmeans_k parameter. For the moment we will just pick 7.
library(pheatmap)
set.seed(153)
<- pheatmap(sigMat, scale = "row", kmeans_k = 7) k
The resulting plot no longer shows our individual genes but the average relative expression of genes within a cluster.
The heatmap rownames show the cluster name and importantly the number of genes within each cluster.
library(pheatmap)
set.seed(153)
<- pheatmap(sigMat, scale = "row", kmeans_k = 7) k
The pheatmap() function returns information on clustering. This is returned as a list, from which the K-means clustering the assignment of genes to clusters can be extracted.
names(k$kmeans)
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
<- as.data.frame(factor(k$kmeans$cluster))
clusterDF colnames(clusterDF) <- "Cluster"
1:10, , drop = FALSE] clusterDF[
## Cluster
## 20671 1
## 27395 3
## 18777 4
## 21399 5
## 108664 4
## 319263 6
## 76187 4
## 70675 6
## 73824 5
## 100039596 4
We can now plot our full heatmap highlighting the membership of genes to clusters.
We add an additional row annotation by providing a data.frame of desired annotation with rownames matching between our annotation data.frame and our rlog transformed matrix to the annotation_row parameter.
<- sigMat[order(clusterDF$Cluster), ]
OrderByCluster
pheatmap(OrderByCluster, scale = "row", annotation_row = clusterDF, show_rownames = FALSE,
cluster_rows = FALSE)
<- sigMat[order(clusterDF$Cluster), ]
OrderByCluster
pheatmap(OrderByCluster, scale = "row", annotation_row = clusterDF, show_rownames = FALSE,
cluster_rows = FALSE)
When doing clustering you will need to optimize the number of clusters selected. Methods exist to help identify the ideal number.
One such method is to assess the silhoutte score at different successive cluster numbers and choose the cluster number with the highest mean silhoutte score.
The Silhouette method evaluates the similarity of cluster members to the similarity between clusters as below.
For all genes/samples, the dissimilarity for a cluster member to its own cluster , , is calculated as the mean distance between a cluster member and all other members of that cluster. Further to this the minimun, mean dissimilarity of the cluster member to members of other clusters is calculated, .
We can use the NbClust package to calculate the Silhoutte scores over successive cluster numbers. We supply the scaled matrix to the NbClust function and set the min and maximum cluster numbers to try using the min.nc and max.nc respectively.
We can retrieve the optimal cluster number from the Best.nc slot of our result list. Here we see the number is lower at 3. Maybe a cluster for every sample group’s unique gene expression signature.
library(NbClust)
<- t(scale(t(sigMat)))
rowScaledMat <- NbClust(rowScaledMat, distance = "euclidean", min.nc = 2, max.nc = 12,
clusterNum method = "kmeans", index = "silhouette")
$Best.nc clusterNum
## Number_clusters Value_Index
## 3.0000 0.4956
We can the use the Best.partition slot to extract the cluster membership as we did with pheatmap.
We can arrange our matrix by using the match function between the row names of our matrix and the names of genes in our new cluster membership vector.
$Best.partition[1:10] clusterNum
## 20671 27395 18777 21399 108664 319263 76187 70675
## 2 2 1 3 1 3 1 3
## 73824 100039596
## 3 1
<- sort(clusterNum$Best.partition)
orderedCluster <- sigMat[match(names(orderedCluster), rownames(sigMat)), ] sigMat
We can now visualize the new clustering alongside our old clustering. In this case the order of genes is determined by our optimized clustering. While our labeled clusters were derived from our original pheatmap clustering. We can clearly see how these clusters overlap.
pheatmap(sigMat, scale = "row", annotation_row = clusterDF, show_rownames = FALSE,
cluster_rows = FALSE)