class: center, middle, inverse, title-slide # Visualizing Genomics Data (part 3)
###
http://rockefelleruniversity.github.io/RU_VisualizingGenomicsData/
--- ##The Course * Dimension reduction * PCA and loadings * Clustering * Grouping by kmeans * Testing gene clusters <!-- --- --> --- ##Reminder of file types In this session we will be dealing with a range of data types. For more information on file types you can revisit our material. * [File Formats](https://rockefelleruniversity.github.io/Genomic_Data/). For more information on visualizing genomics data in browsers you can visit our IGV course. * [IGV](https://rockefelleruniversity.github.io/IGV_course/). --- ##Reminder of data types in Bioconductor We will also encounter and make use of many data structures and data types which we have seen throughout our courses on HTS data. You can revisit this material to refresh on HTS data analysis in Bioconductor and R below. * [Bioconductor](https://rockefelleruniversity.github.io/Bioconductor_Introduction/) * [Genomic Intervals](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/GenomicIntervals_In_Bioconductor.html) * [Genomic Scores](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/GenomicScores_In_Bioconductor.html) * [Sequences](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/SequencesInBioconductor.html) * [Gene Models](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/GenomicFeatures_In_Bioconductor.html) * [Alignments](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/singlepage/AlignedDataInBioconductor.html) * [ChIP-seq](http://rockefelleruniversity.github.io/RU_ChIPseq/) * [ATAC-seq](http://rockefelleruniversity.github.io/RU_ATACseq/) * [RNA-seq](http://rockefelleruniversity.github.io/RU_RNAseq/) --- ##Materials All material for this course can be found on github. * [Visualizing Genomics Data](https://github.com/RockefellerUniversity/RU_VisualizingGenomicsData) Or can be downloaded as a zip archive from here. * [Download zip](https://github.com/RockefellerUniversity/RU_VisualizingGenomicsData/archive/master.zip) --- ##Materials Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath. * **r_course/presentations/Slides/** Presentations as an HTML slide show. * **r_course/presentations/exercises/** Some tasks/examples to work through. --- ##Materials All data to run code in the presentations and in the practicals is available in the zip archive. This includes coverage as bigWig files, aligned reads as BAM files and genomic intervals stored as BED files. We also include some RData files containing precompiled results from querying database (in case of external server downtime). All data can be found under the **data** directory **data/** --- ##Set the Working directory Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived. You may navigate to the unarchived visualizingGenomicsData folder in the Rstudio menu **Session -> Set Working Directory -> Choose Directory** or in the console. ```r setwd("/PathToMyDownload/VisualizingGenomicsData/r_course/") # e.g. setwd("~/Downloads/VisualizingGenomicsData/r_course") ``` --- ## Recap Previously we have reviewed how we can review signal and annotation over individual sites using and genome browser such as IGV and or programmatically using Gviz. .pull-left[  ] .pull-right[  ] --- ## Vizualising Biological data With many high throughput sequencing experiments we are able to gain a genome wide view of our assays (in RNAseq we measure every gene's expression, in ChIPseq we identify all events for a user-defined transcription factor/mark, in ATACseq we evaluate all open/accessible regions in the genome.)  --- ## Vizualizing Biological Data Biological data is often considered to have high dimensionality. Common techniques used to visualize genomics data include dimension reduction and/or clustering followed by the graphical representation of data as a heatmap. These techniques makes interpretation of the data simpler to better identify patterns within our data i.e. reproducibility of replicates within groups and magnitude of changes in signal between groups. <div align="center"> <img src="imgs/large.png" alt="igv" height="400" width="800"> </div> --- ## Vizualising high dimensional Biological data. In todays session we will work with some of the RNAseq data of adult mouse tissues from Bing Ren's lab, Liver and Heart. - More information on liver data [can be found here ](https://www.encodeproject.org/experiments/ENCSR000CHB/) - More information on heart data [can be found here ](https://www.encodeproject.org/experiments/ENCSR000CGZ/) - More information on Kidney data [can be found here ](https://www.encodeproject.org/experiments/ENCSR000CGZ/) This represent a more complex experimental design than the two group comparison we have often used i.e. between activated and naive t-cells. We will use some clustering and dimensional reduction techniques to interrogate this data. --- ## Counts from SummariseOverlaps In our RNAseq session we counted our reads in genes using the summarise overlaps function to generate our **RangedSummarizedExperiment** object. First we create our BamFileList object to control memort usage. ```r library(Rsamtools) bamFilesToCount <- c("Sorted_Heart_1.bam","Sorted_Heart_2.bam", "Sorted_Kidney_1.bam","Sorted_Kidney_2.bam", "Sorted_Liver_1.bam","Sorted_Liver_2.bam") myBams <- BamFileList(bamFilesToCount,yieldSize = 10000) ``` --- ## Counts from SummariseOverlaps With this BamFileList we can summarise from BAMs to produce our **RangedSummarizedExperiment** object of counts in genes across all samples. We add in some metadata information for use with DESeq2 to our object with the **colData()** function. ```r library(TxDb.Mmusculus.UCSC.mm10.knownGene) library(GenomicAlignments) geneExons <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene,by="gene") geneCounts <- summarizeOverlaps(geneExons,myBams, ignore.strand = TRUE) colData(geneCounts)$tissue <- c("Heart", "Heart","Kidney","Kidney","Liver","Liver") ``` --- ## DESeq2 From here we can **DESeqDataSet()** function to build directly from our **RangedSummarizedExperiment** object specifying the metadata column to test on to the **design** parameter. We then use the **DESeq()** fuction to normalise, fit variance, remove outliers and we extract comparisons of interest using the **results()** function. ```r dds <- DESeqDataSet(geneCounts,design = ~Tissue) dds <- DESeq(dds) heartVsLiver <- results(dds,c("Tissue","Heart","Liver")) heartVskidney <- results(dds,c("Tissue","Heart","Kidney")) LiverVskidney <- results(dds,c("Tissue","Liver","Kidney")) ``` --- ## DESeq2 Further to our pair-wise comparisons we can extract an anova-like statistic identifying genes changing expression between conditions using the *LRT* test with our reduced model (no groups model). ```r dds2 <- DESeq(dds,test="LRT",reduced = ~1) AllChanges <- results(dds2) ``` --- ## Visualizing RNAseq data One of the first steps of working with count data for visualization is commonly to transform the integer count data to log2 scale. To do this we will need to add some artificial value (pseudocount) to zeros in our counts prior to log transform (since the log2 of zero is infinite). The DEseq2 **normTransform()** will add a 1 to our normalized counts prior to log2 transform and return a **DESeqTransform** object. ```r normLog2Counts <- normTransform(dds) normLog2Counts ``` ``` ## class: DESeqTransform ## dim: 14454 6 ## metadata(1): version ## assays(1): '' ## rownames(14454): 20671 27395 ... 26900 170942 ## rowData names(26): baseMean baseVar ... deviance maxCooks ## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1 ## Sorted_Liver_2 ## colData names(2): Tissue sizeFactor ``` --- ## Visualizing RNAseq data We can extract our normalized and transformed counts from the **DESeqTransform** object using the **assay()** function. ```r matrixOfNorm <- assay(normLog2Counts) boxplot(matrixOfNorm, las=2, names=c("Heart_1","Heart_2", "Kidney_1","Kidney_2", "Liver_1", "Liver_2")) ``` <!-- --> --- ## Visualizing RNAseq data When visualizing our signal however we now will have a similar problem with smaller counts having higher variance. This may cause changes in smaller counts to have undue influence in visualization and clustering. ```r library(vsn) vsn::meanSdPlot(matrixOfNorm) ``` <!-- --> --- ## Visualizing RNAseq data We can apply an **rlog** transformation to our data using the **rlog()** function which will attempt to shrink the variance for genes based on their mean expression. ```r rlogTissue <- rlog(dds) rlogTissue ``` ``` ## class: DESeqTransform ## dim: 14454 6 ## metadata(1): version ## assays(1): '' ## rownames(14454): 20671 27395 ... 26900 170942 ## rowData names(27): baseMean baseVar ... dispFit rlogIntercept ## colnames(6): Sorted_Heart_1 Sorted_Heart_2 ... Sorted_Liver_1 ## Sorted_Liver_2 ## colData names(2): Tissue sizeFactor ``` --- ## Visualizing RNAseq data. Again we can extract the matrix of transformed counts with the **assay()** function and plot the mean/variance relationship. If we look at the axis we can see the shrinkage of variance for low count genes. .pull-left[ ```r rlogMatrix <- assay(rlogTissue) vsn::meanSdPlot(rlogMatrix) ``` <!-- --> ] .pull-right[ ```r library(vsn) vsn::meanSdPlot(matrixOfNorm) ``` <!-- --> ] --- ## Dimension reduction Since we have often have measured 1000s of genes over multiple samples/groups we will often try and simplify this too a few dimensions or meta/eigen genes which represent major patterns of signal across samples found. We hope the strongest patterns or sources of variation in our data are correlated with sample groups and so dimension reduction offers a methods to method to visually identify reproducibility of samples. Common methods of dimension reduction include Principal Component Analysis, MultiFactorial Scaling and Non-negative Matrix Factorization. <div align="center"> <img src="imgs/metaGeneFull.png" alt="igv" height="300" width="800"> </div> --- ## PCA We can see PCA in action with our data simply by using the DESeq2's **plotPCA()** function and our **DESeqTransform** object from our rlog transformation. We must also provide a metadata column to colour samples by to the **intgroup** parameter and we set the **ntop** parameter to use all genes in PCA (by default it is top 500). ```r plotPCA(rlogTissue, intgroup="Tissue", ntop=nrow(rlogTissue)) ``` <!-- --> --- ## PCA This PCA show the separation of samples by their group and the localization of samples with groups. Since PC1 here explains 51% of total variances between samples and PC2 explains 44%, the reduction of dimensions can be seen to explain much of the changes among samples in 2 dimensions. Of further note is the separation of samples across PC1 but the lack of separation of Heart and Kidney samples along PC2. <!-- --> --- ## PCA PCA is often used to simply visualize sample similarity, but we can extract further information of the patterns of expression corresponding PCs by performing the PCA analysis ourselves. We can use the **prcomp()** function with a transposition of our matrix to perform our prinicipal component analysis. The mappings of samples to PCs can be found in the **x** slot of the **prcomp** object. ```r pcRes <- prcomp(t(rlogMatrix)) class(pcRes) ``` ``` ## [1] "prcomp" ``` ```r pcRes$x[1:2,] ``` ``` ## PC1 PC2 PC3 PC4 PC5 PC6 ## Sorted_Heart_1 -103.8999 52.45800 -0.2401797 25.22797 1.0609927 2.437217e-13 ## Sorted_Heart_2 -117.8517 46.07786 0.4445612 -23.96120 -0.7919848 2.426392e-13 ``` --- ## PCA We can now reproduce the previous plot from DEseq2 in basic graphics from this. ```r plot(pcRes$x, col=colData(rlogTissue)$Tissue, pch=20, cex=2) legend("top",legend = c("Heart","Kidney","Liver"), fill=unique(colData(rlogTissue)$Tissue)) ``` <!-- --> --- ## PCA Now we have constucted the PCA ourselves we can investigate which genes' expression profiles influence the relative PCs. The influence (rotation/loadings) for all genes to each PC can be found in the **rotation** slot of the **prcomp** object. ```r pcRes$rotation[1:5,1:4] ``` ``` ## PC1 PC2 PC3 PC4 ## 20671 -0.005479648 0.003577376 -0.006875591 0.0048625659 ## 27395 -0.002020427 0.003325506 -0.002302364 -0.0045132749 ## 18777 0.004615068 -0.005413345 0.008975098 -0.0028857868 ## 21399 0.005568549 0.002485067 0.002615072 -0.0001255119 ## 108664 0.005729029 -0.004912143 0.009991580 -0.0031039532 ``` --- ## PCA To investigate the seperation of Kidney samples along the negative axis of PC2 I can then look at which genes most negatively contribute to PC2. Here we order by the most negative values for PC2 and select the top 100. ```r PC2markers <- sort(pcRes$rotation[,2],decreasing = FALSE)[1:100] PC2markers[1:10] ``` ``` ## 20505 22242 16483 56727 77337 57394 ## -0.06497365 -0.06001728 -0.05896360 -0.05896220 -0.05595648 -0.05505202 ## 18399 20495 22598 20730 ## -0.05307506 -0.05256188 -0.05205661 -0.05175604 ``` --- ## PCA To investigate the gene expression profiles associated with PC2 we can now plot the log2 foldchanges ( or directional statistics) from our pairwise comparisons for our PC2 most influencial genes. From the boxplot it is clear to see that the top 100 genes are all specifically upregulated in Kidney tissue. ```r PC2_hVsl <- heartVsLiver$stat[rownames(heartVsLiver) %in% names(PC2markers)] PC2_hVsk <- heartVskidney$stat[rownames(heartVskidney) %in% names(PC2markers)] PC2_LVsk <- LiverVskidney$stat[rownames(LiverVskidney) %in% names(PC2markers)] ``` --- ## PCA From the boxplot it is clear to see that the top100 genes are all specifically upregulated in Kidney tissue. ```r boxplot(PC2_hVsl,PC2_hVsk,PC2_LVsk, names=c("Heart/Liver","Heart/Kidney","Liver/Kidney"),ylab="log2FC") ``` <!-- --> --- ## Sample-to-Sample correlation Another common step in quality control is to assess the correlation between expression profiles of samples. We can assess correlation between all samples in a matrix by using the **cor()** function. ```r sampleCor <- cor(rlogMatrix) sampleCor ``` ``` ## Sorted_Heart_1 Sorted_Heart_2 Sorted_Kidney_1 Sorted_Kidney_2 ## Sorted_Heart_1 1.0000000 0.9835321 0.7144527 0.7189597 ## Sorted_Heart_2 0.9835321 1.0000000 0.7190675 0.7229253 ## Sorted_Kidney_1 0.7144527 0.7190675 1.0000000 0.9929131 ## Sorted_Kidney_2 0.7189597 0.7229253 0.9929131 1.0000000 ## Sorted_Liver_1 0.7156978 0.6883444 0.7344165 0.7336117 ## Sorted_Liver_2 0.7186525 0.6918428 0.7366287 0.7396193 ## Sorted_Liver_1 Sorted_Liver_2 ## Sorted_Heart_1 0.7156978 0.7186525 ## Sorted_Heart_2 0.6883444 0.6918428 ## Sorted_Kidney_1 0.7344165 0.7366287 ## Sorted_Kidney_2 0.7336117 0.7396193 ## Sorted_Liver_1 1.0000000 0.9714750 ## Sorted_Liver_2 0.9714750 1.0000000 ``` --- ## Sample-to-Sample correlation We can visualize the the correlation matrix using a heatmap following sample clustering. First, we need to convert our correlation matrix into a distance measure to be use in clustering by subtracting from 1 to give dissimilarity measure and converting with the **as.dist()** to a **dist** object. We then create a matrix of distance values to plot in the heatmap from this using **as.matrix()** function. ```r library(pheatmap) sampleDists <- as.dist(1-cor(rlogMatrix)) sampleDistMatrix <- as.matrix(sampleDists) ``` --- ## Sample-to-Sample correlation We can use the **pheatmap** library's pheatmap function to cluster our data by similarity and produce our heatmaps. We provide our matrix of sample distances as well as our **dist** object to the **clustering_distance_rows** and **clustering_distance_cols** function. By default hierarchical clustering will group samples based on their gene expression similarity into a dendrogram with between sample similarity illustrated by branch length. ```r pheatmap(sampleDistMatrix, clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists) ``` <!-- --> --- ## Sample-to-Sample correlation We can use the **brewer.pal** and **colorRampPalette()** function to create a white to blue scale [as we have with **ggplot** scales](https://rockefelleruniversity.github.io/Plotting_In_R/r_course/presentations/slides/ggplot2.html#60). ```r library(RColorBrewer) blueColours <- brewer.pal(9, "Blues") colors <- colorRampPalette(rev(blueColours))(255) plot(1:255,rep(1,255), col=colors,pch=20,cex=20,ann=FALSE, yaxt="n") ``` <!-- --> --- ## Sample-to-Sample correlation We can provide a slightly nicer scale for our distance measure heatmap to the **color** parameter in the **pheatmap** function. ```r pheatmap(sampleDistMatrix, clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists, color = colors) ``` <!-- --> --- ## Sample-to-Sample correlation Finally we can add some column annotation to highlight group membership. We must provide annotation as a data.frame of metadata we wish to include with rownames matching column names. Fortunetely that is exactly as we have set up from DEseq2. We can extract metadata from the DESeq2 object with **colData()** function and provide to the **annotation_col** parameter. ```r annoCol <- as.data.frame(colData(dds)) pheatmap(sampleDistMatrix, clustering_distance_rows=sampleDists, clustering_distance_cols=sampleDists, color = colors,annotation_col = annoCo) ``` <!-- --> --- ## Clustering genes and samples We can use the same methods of clustering samples to cluster genes with similar expression patterns together.Clustering genes will allow us to visually identify the major patterns of gene expressions with our data and to group genes with similar expression profiles for review and functional analysis. This technique will hopefully allow us to disentangle the complex sets of comparison found in evaluating significant differences found from our DESeq2 LRT test of changes in genes' expression between any tissue. First then lets subset our rlog transformed gene expression matrix to those genes significant in our LRT test. ```r sigChanges <- rownames(AllChanges)[AllChanges$padj < 0.01 & !is.na(AllChanges$padj)] sigMat <- rlogMatrix[rownames(rlogMatrix) %in% sigChanges,] nrow(sigMat) ``` ``` ## [1] 8094 ``` --- ## Clustering genes and samples We can pass our filtered matrix of expression to the **pheatmap()** function and set the **scale** parameter to *row* to allow for clustering of relative changes in gene expression. Additionally due to the large number of genes, we turn rowname off with the **show_rownames** function. ```r library(pheatmap) pheatmap(sigMat, scale="row", show_rownames = FALSE) ``` <!-- --> --- ## Clustering genes and samples ```r pheatmap(sigMat, scale="row", show_rownames = FALSE) ``` <!-- --> --- ## Clustering genes and samples Now we have a visual representation of changes in gene expression across samples we can use the clustering to derive groups of genes with similar expression patterns. Gene with similar expression profiles may share functional roles and so we can use these groups to further evaluate our gene expression data. Many approaches to identifying clustered groups of genes exist including K-means, SOM and HOPACH. K-means is implemented with the pheatmap package and so we can simply provide a desired number of clusters to the **kmeans_k** parameter. ```r library(pheatmap) set.seed(153) k <- pheatmap(sigMat, scale="row",kmeans_k = 7) ``` --- ## Clustering genes and samples The resulting plot no longer shows our individual genes but the average relative expression of genes within a cluster. The heatmap rownames show the cluster name and importantly the number of genes within each cluster. ```r library(pheatmap) set.seed(153) k <- pheatmap(sigMat, scale="row",kmeans_k = 7) ``` <!-- --> --- ## Clustering genes and samples The **pheatmap()** function returns information on clustering is return as a list and following k-means clustering the assignment of genes to clusters can be extracted from this. ```r names(k$kmeans) ``` ``` ## [1] "cluster" "centers" "totss" "withinss" "tot.withinss" ## [6] "betweenss" "size" "iter" "ifault" ``` ```r clusterDF <- as.data.frame(factor(k$kmeans$cluster)) colnames(clusterDF) <- "Cluster" clusterDF[1:10,,drop=FALSE] ``` ``` ## Cluster ## 20671 1 ## 27395 3 ## 18777 4 ## 21399 5 ## 108664 4 ## 319263 6 ## 76187 4 ## 70675 6 ## 73824 5 ## 100039596 4 ``` --- ## Clustering genes and samples We can now plot our full heatmap highlighting the membership of genes to clusters. We add an additional row annotation by providing a data.frame of desired annotation with rownames matching between our annotation data.frame and our rlog transformed matrix to the **annotation_row** parameter. ```r OrderByCluster <- sigMat[order(clusterDF$Cluster),] pheatmap(OrderByCluster, scale="row",annotation_row = clusterDF, show_rownames = FALSE,cluster_rows = FALSE) ``` --- ## Clustering genes and samples ```r OrderByCluster <- sigMat[order(clusterDF$Cluster),] pheatmap(OrderByCluster, scale="row",annotation_row = clusterDF, show_rownames = FALSE,cluster_rows = FALSE) ``` <!-- --> --- ## Identifying optimal clusters Methods exist to identify the optimal number of clusters within your data. One such method is to assess the **silhoutte** score at different successive cluster numbers and choose the cluster number with the highest mean **silhoutte** score. The **Silhouette** method evaluates the similarity of cluster members to the similarity between clusters as below. For all genes/samples, the dissimilarity for a cluster member to its own cluster , <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>a</mi><mi>i</mi></msub></math>, is calculated as the mean distance between a cluster member and all other members of that cluster. Further to this the minimun, mean dissimilarity of the cluster member to members of other clusters is calculated, <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>b</mi><mi>i</mi></msub></math>. <math xmlns="http://www.w3.org/1998/Math/MathML"> <msub> <mi>S</mi> <mi>i</mi> </msub> <mo>=</mo> <mo stretchy="false">(</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo>−<!-- − --></mo> <msub> <mi>a</mi> <mi>i</mi> </msub> <mo stretchy="false">)</mo> <mrow class="MJX-TeXAtom-ORD"> <mo>/</mo> </mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> <mo stretchy="false">(</mo> <msub> <mi>a</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>b</mi> <mi>i</mi> </msub> <mo stretchy="false">)</mo> </math> --- ## Identifying number of clusters We can use the NbClust package to calculate the **Silhoutte** scores over successive cluster numbers. We supply the scaled matrix to the *NbClust* function and set the min and maximum cluster numbers to try using the **min.nc** and **max.nc** respectively. We can retrieve the optimal cluster number from the **Best.nc** slot of our result list. Here we see the number is lower at 3, a cluster for every sample group's unique gene expression signature. ```r library(NbClust) rowScaledMat <- t(scale(t(sigMat))) clusterNum <- NbClust(rowScaledMat,distance = "euclidean", min.nc = 2, max.nc = 12, method = "kmeans", index ="silhouette") clusterNum$Best.nc ``` ``` ## Number_clusters Value_Index ## 3.0000 0.4956 ``` --- ## Identifying number of clusters We can the use the **Best.partition** slot to extract the cluster membership as we did with pheatmap. We can arrange our matrix by using the **match** function between the row names of our matrix and the names of genes in our new cluster membership vector. ```r clusterNum$Best.partition[1:10] ``` ``` ## 20671 27395 18777 21399 108664 319263 76187 70675 ## 2 2 1 3 1 3 1 3 ## 73824 100039596 ## 3 1 ``` ```r orderedCluster <- sort(clusterNum$Best.partition) sigMat <- sigMat[match(names(orderedCluster),rownames(sigMat)),] ``` --- ## Identifying number of clusters We can now visualize the new clustering alongside our old clustering. ```r pheatmap(sigMat, scale="row",annotation_row = clusterDF, show_rownames = FALSE,cluster_rows = FALSE) ``` <!-- --> --- ## Testing gene clusters Now we have some genes clusters from our expression data, we will want to investigate these clusters for any enrichment of functional terms within them. Here we extract all genes in cluster 1, which appear to be our heart specific genes. ```r heartSpecific <- rownames(clusterDF[clusterDF$Cluster == 1,,drop=FALSE]) heartSpecific[1:10] ``` ``` ## [1] "20671" "109294" "240725" "240726" "14048" "29819" "57339" "74229" ## [9] "13518" "214855" ``` --- ## Testing gene clusters We can now build our logical vector of all genes ( from those with Padj values from LRT test) with the genes in cluster marked as TRUE. We will use the logical vector for geneset testing using **goseq**. ```r bckGround <- rownames(AllChanges)[!is.na(AllChanges$padj)] heartSpecific <- bckGround %in% heartSpecific names(heartSpecific) <- bckGround heartSpecific[1:10] ``` ``` ## 20671 27395 18777 21399 108664 18387 12421 319263 59014 76187 ## TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ``` --- ## Testing gene clusters We can now pass our named logical vector to the **nullp** and **goseq** function seen before in geneset testing in RNAseq and ChIPseq. Here we test against the GO Biological Processes genesets. ```r library(goseq) toTest <- nullp(heartSpecific,"mm10",id = "knownGene") ``` <!-- --> --- ## Testing gene clusters From our GO test we can see enrichment for some heart specific terms such as *muscle development*. ```r heartRes <- goseq(toTest,"mm10","knownGene",test.cats = "GO:BP") heartRes[1:5,] ``` ``` ## category over_represented_pvalue under_represented_pvalue numDEInCat ## 9597 GO:0061061 5.223231e-38 1 186 ## 10934 GO:0072359 7.728431e-34 1 256 ## 8038 GO:0048646 7.238112e-29 1 240 ## 8977 GO:0060047 4.576488e-28 1 83 ## 973 GO:0003012 5.109725e-28 1 120 ## numInCat term ## 9597 581 muscle structure development ## 10934 982 circulatory system development ## 8038 960 anatomical structure formation involved in morphogenesis ## 8977 185 heart contraction ## 973 341 muscle system process ## ontology ## 9597 BP ## 10934 BP ## 8038 BP ## 8977 BP ## 973 BP ``` --- ## PCA Finally we can use our heatmaps to get a better idea of the pattern of expression associated with PC1 plot we saw earlier. ```r plotPCA(rlogTissue, intgroup="Tissue", ntop=nrow(rlogTissue)) ``` <!-- --> --- ## PCA We can extract and sort our expression matrix by the loadings (influence) of genes on PC1. Here we use the **match()** fucntion to sort our rlog transformed gene expression matrix to the order of PC1 influence ```r PC1_rnk <- sort(pcRes$rotation[,1],decreasing = TRUE) PC1_mat <- sigMat[match(names(PC1_rnk),rownames(sigMat),nomatch = 0),] PC1_mat[1:3,] ``` ``` ## Sorted_Heart_1 Sorted_Heart_2 Sorted_Kidney_1 Sorted_Kidney_2 ## 100039206 4.648873 3.247752 2.567747 2.576385 ## 100039028 6.094549 3.833855 3.491466 3.326275 ## 100041658 4.525257 3.561079 2.328869 2.338274 ## Sorted_Liver_1 Sorted_Liver_2 ## 100039206 14.08565 11.12317 ## 100039028 15.12364 11.96021 ## 100041658 12.81601 10.45862 ``` --- ## PCA Now we plot our heatmap of gene expression ordered by PC1 influence to reveal the PC1's expression profile. ```r pheatmap(PC1_mat, scale="row", cluster_rows=FALSE, show_rownames = FALSE,annotation_col = annoCol ) ``` <!-- --> --- ##Exercises Time for exercises! [Link here](../../exercises/exercises/Viz_part3_exercise.html) --- ##Solutions Time for solutions! [Link here](../../exercises/answers/Viz_part3_answers.html)