class: center, middle, inverse, title-slide # Genomic Variants ~ Session 3
### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/RU_GenomicVariants/
--- ## About this session In this session, we will demonstrate how to use several advanced analysis approaches to decipher variants from multiple samples. In this situation, the Mutation Annotation Format (MAF) would be a good choice for storing and presenting your data. For a detailesd overview of MAF file structure please refer to our previous course on [Genomic Formats](http://rockefelleruniversity.github.io/Genomic_Data//presentations/slides/GenomicsData.html#36). Today we will beusing an acute myeloid leukemia (AML) dataset (tcga-laml), which was fetched from [The Cancer Genome Atlas data portal](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). The dataset is in the following path "data/tcga_laml.maf". --- ## Outline of this session - Handle MAF with maftools - Get summary by samples and genes - Functional analysis - Mutational signatures --- class: inverse, center, middle # Working with MAF files <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## MAF format As with many data files we can very simply read MAF files into R. ```r laml_tab <- read.delim("data/tcga_laml.maf", sep = "\t") laml_tab[1:2, ] ``` ``` ## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome ## 1 ABCA10 10349 genome.wustl.edu 37 17 ## 2 ABCA4 24 genome.wustl.edu 37 1 ## Start_Position End_position Strand Variant_Classification Variant_Type ## 1 67170917 67170917 + Splice_Site SNP ## 2 94490594 94490594 + Missense_Mutation SNP ## Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode ## 1 T T C TCGA-AB-2988 ## 2 C C T TCGA-AB-2869 ## Protein_Change i_TumorVAF_WU i_transcript_name ## 1 p.K960R 45.66 NM_080282.3 ## 2 p.R1517H 38.12 NM_000350.2 ``` --- ## How many mutations per sample ```r tbl <- table(laml_tab$Tumor_Sample_Barcode) hist(tbl, breaks = 10, xlab = "Mutations") ``` <img src="RU_GenomicVariant_session3_files/figure-html/mult_samInfo_advan-1.png" style="display: block; margin: auto;" /> --- ## [maftools](https://www.bioconductor.org/packages/release/bioc/html/maftools.html) Bioconductor has a specialist package for handling these files. <img src="imgs/vcfMan_fig7r.png" width="50%" style="display: block; margin: auto;" /> --- ## Load MAF file ```r library(maftools) laml <- read.maf("data/tcga_laml.maf.gz") ``` ``` ## -Reading ## -Validating ## -Silent variants: 475 ## -Summarizing ## -Processing clinical data ## --Missing clinical data ## -Finished in 2.700s elapsed (0.371s cpu) ``` ```r class(laml) ``` ``` ## [1] "MAF" ## attr(,"package") ## [1] "maftools" ``` --- ## Summary by samples or genes maftools provides two functions to generate summary table for by samples or by genes - getSampleSummary: a matrix, count of mutation classes in each sample - getGeneSummary: a matrix, count of mutation classes in each gene --- ## How many mutations in each sample - *getSampleSummary()* - a matrix of mutation classes in each sample ```r sample_sum <- getSampleSummary(laml) sample_sum[1:2, ] ``` ``` ## Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del ## 1: TCGA-AB-3009 0 5 0 ## 2: TCGA-AB-2807 1 0 1 ## In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total ## 1: 1 25 2 1 34 ## 2: 0 16 3 4 25 ``` --- ## Mutation classes in each sample - convert into table for ggplot - proportion: count of mutation / total mutations for each sample ```r var_to <- sample_sum$total names(var_to) <- sample_sum$Tumor_Sample_Barcode sample_sum <- dplyr::select(sample_sum, -total) melt_dat <- reshape2::melt(sample_sum, id = "Tumor_Sample_Barcode") melt_dat[1:3, ] ``` ``` ## Tumor_Sample_Barcode variable value ## 1 TCGA-AB-3009 Frame_Shift_Del 0 ## 2 TCGA-AB-2807 Frame_Shift_Del 1 ## 3 TCGA-AB-2959 Frame_Shift_Del 0 ``` --- ## Mutation classes in each sample - convert into table for ggplot - proportion: count of mutation / total mutations for each sample ```r melt_dat$totalVar <- var_to[match(melt_dat$Tumor_Sample_Barcode, names(var_to))] melt_dat$prop <- melt_dat$value/melt_dat$totalVar head(melt_dat) ``` ``` ## Tumor_Sample_Barcode variable value totalVar prop ## 1 TCGA-AB-3009 Frame_Shift_Del 0 34 0.00 ## 2 TCGA-AB-2807 Frame_Shift_Del 1 25 0.04 ## 3 TCGA-AB-2959 Frame_Shift_Del 0 23 0.00 ## 4 TCGA-AB-3002 Frame_Shift_Del 0 21 0.00 ## 5 TCGA-AB-2849 Frame_Shift_Del 0 20 0.00 ## 6 TCGA-AB-2923 Frame_Shift_Del 1 20 0.05 ``` --- ## Mutation classes in each sample ```r ggplot(melt_dat,aes(x=Tumor_Sample_Barcode,y=value,fill=variable))+ geom_bar(stat='identity',position = 'stack')+ labs(x="",y="Mutations",fill="")+ theme(axis.text.x=element_blank()) ``` --- ## Mutation classes in each sample <img src="RU_GenomicVariant_session3_files/figure-html/mult_samSumPlot_advan3-1.png" style="display: block; margin: auto;" /> --- ## Mutation classes in each sample ```r ggplot(melt_dat,aes(x=Tumor_Sample_Barcode,y=prop,fill=variable))+ geom_bar(stat='identity',position = 'stack')+ labs(x="",y="Proportion",fill="")+ theme(axis.text.x=element_blank()) ``` --- ## Mutation classes in each sample <img src="RU_GenomicVariant_session3_files/figure-html/mult_samSumPlotPres_advan5-1.png" style="display: block; margin: auto;" /> --- ## How many mutations in each gene - *getGeneSummary()* - A matrix, counts of each mutation classes in each gene ```r gene_sum <- getGeneSummary(laml) gene_sum[1:2, ] ``` ``` ## Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins ## 1: FLT3 0 0 1 33 ## 2: DNMT3A 4 0 0 0 ## Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples ## 1: 15 0 3 52 52 ## 2: 39 5 6 54 48 ## AlteredSamples ## 1: 52 ## 2: 48 ``` --- class: inverse, center, middle # Functional analysis of mutations <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Any mutation hotspot? Mutation hotspot means mutation in a given position detected in more samples than in other positions. Most activating mutations have mutation hotspots. A classic example are the *G12*, *G13*, or *Q61* mutation in KRAS. In contrast, loss of function mutations are usually with no mutation hotspot. The **lollipop plot** is a fancy plot of mutations in each position of a given gene. It's very intuitive for demonstrating mutation hotspots. --- ## Mutation hotspots? - *lollipopPlot()* ```r lollipopPlot(maf = laml, gene = 'NRAS', AACol = 'Protein_Change', showMutationRate = TRUE, labelPos = "all") ``` --- ## Mutation hotspots? <img src="RU_GenomicVariant_session3_files/figure-html/mult_lolli_advan_eval2-1.png" style="display: block; margin: auto;" /> --- ## Interaction between mutations When mutations are send to interact, this means any concordance or exclusivity of two given gene mutations. If two gene mutations concordantly identified in the same patient, it indicates the two mutations could have synergistic effect in cancer or they are in causality. If two gene mutations are exclusive (i.e. not present in the same patient), this could suggest that they lead to similar consequence, therefore there is no selective pressure for both to mutate. The **oncoplot** allows comparison between genes to look for interactions between mutants. --- ## Oncoplot for top 5 mutated genes ```r oncoplot(maf = laml, top = 5) ``` <img src="RU_GenomicVariant_session3_files/figure-html/mult_oncoplot_advan-1.png" style="display: block; margin: auto;" /> **IDH1 and IDH2 mutations could be exlusive.** --- ## Pathway analysis maftools provides two functions for pathway analysis. - **OncogenicPathways()**: calculate mutations detected in each pathway and the fraction of samples affected - **PlotOncogenicPathways()**: make a waterfall plot for a given pathway --- ## Any enriched pathways? ```r OncogenicPathways(maf = laml) ``` ``` ## Pathway N n_affected_genes fraction_affected Mutated_samples ## 1: PI3K 29 1 0.03448276 1 ## 2: NRF2 3 1 0.33333333 1 ## 3: TP53 6 2 0.33333333 15 ## 4: WNT 68 3 0.04411765 4 ## 5: MYC 13 3 0.23076923 3 ## 6: NOTCH 71 6 0.08450704 8 ## 7: Hippo 38 7 0.18421053 7 ## 8: RTK-RAS 85 18 0.21176471 97 ## Fraction_mutated_samples ## 1: 0.005181347 ## 2: 0.005181347 ## 3: 0.077720207 ## 4: 0.020725389 ## 5: 0.015544041 ## 6: 0.041450777 ## 7: 0.036269430 ## 8: 0.502590674 ``` --- ## Any enriched pathways? <img src="RU_GenomicVariant_session3_files/figure-html/mult_pathPlotWF_advan2-1.png" style="display: block; margin: auto;" /> Mutations in **RTK-RAS pathway** is much higher than others --- ## Mutations enriched in pathways? ```r PlotOncogenicPathways(maf = laml, pathways = "RTK-RAS") ``` <img src="RU_GenomicVariant_session3_files/figure-html/mult_pathPlotWF_advan3-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Mutation Signatures <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- [Mutational signatures](https://cancer.sanger.ac.uk/cosmic/signatures) mean the patterrns of trinucleotide substitution. As mentioned in [session 1](https://rockefelleruniversity.github.io/RU_GenomicVariants/presentations/slides/RU_GenomicVariant_session1.html#44), the pattern may reflect to exposure of spontaneous/enzymatic deamination and defect of DNA repair, etc. This is another common area of study for cancer and genome integrity research. - Example: T-W Chen, C-C Lee, and Y-S Chang et al, *Nat Commun*(2017), 8:465 [link](https://www.nature.com/articles/s41467-017-00493-9) --- ## Steps for mutational signature analysis - Evaluate nucleotide substitutions - Calculate trinucleotide matrix - Estimate Signatures - Compare to databases - Compare signature and gene mutations --- ## Ti/Tv plot - Evaluate single-nucleotide substitution - **titv()** ```r laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE) plotTiTv(res = laml.titv) ``` --- ## Ti/Tv plot <img src="RU_GenomicVariant_session3_files/figure-html/mult_mutSig_TiTv_advan2-1.png" style="display: block; margin: auto;" /> The majority of mutations are **C>T**. --- ## Trinucleotide matrix - calcualte trinucleotide substituion pattern - **trinucleotideMatrix()** ```r library(BSgenome.Hsapiens.UCSC.hg19, quietly = TRUE) laml.tnm = trinucleotideMatrix(maf = laml, prefix = 'chr', add = TRUE, ref_genome = "BSgenome.Hsapiens.UCSC.hg19") ``` ``` ## -Extracting 5' and 3' adjacent bases ## -Extracting +/- 20bp around mutated bases for background C>T estimation ## -Estimating APOBEC enrichment scores ## --Performing one-way Fisher's test for APOBEC enrichment ## ---APOBEC related mutations are enriched in 3.315 % of samples (APOBEC enrichment score > 2 ; 6 of 181 samples) ## -Creating mutation matrix ## --matrix of dimension 188x96 ``` --- ## Trinucleotide matrix ```r dim(laml.tnm$nmf_matrix) ``` ``` ## [1] 188 96 ``` ```r laml.tnm$nmf_matrix[1, ] ``` ``` ## A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A C[C>A]C C[C>A]G C[C>A]T G[C>A]A G[C>A]C ## 0 0 0 0 0 0 0 0 0 0 ## G[C>A]G G[C>A]T T[C>A]A T[C>A]C T[C>A]G T[C>A]T A[C>G]A A[C>G]C A[C>G]G A[C>G]T ## 0 0 0 0 0 0 0 0 0 0 ## C[C>G]A C[C>G]C C[C>G]G C[C>G]T G[C>G]A G[C>G]C G[C>G]G G[C>G]T T[C>G]A T[C>G]C ## 0 0 0 0 0 0 0 0 0 0 ## T[C>G]G T[C>G]T A[C>T]A A[C>T]C A[C>T]G A[C>T]T C[C>T]A C[C>T]C C[C>T]G C[C>T]T ## 0 0 0 0 1 0 1 0 1 1 ## G[C>T]A G[C>T]C G[C>T]G G[C>T]T T[C>T]A T[C>T]C T[C>T]G T[C>T]T A[T>A]A A[T>A]C ## 0 0 1 1 1 0 0 0 0 0 ## A[T>A]G A[T>A]T C[T>A]A C[T>A]C C[T>A]G C[T>A]T G[T>A]A G[T>A]C G[T>A]G G[T>A]T ## 1 0 0 0 1 0 0 0 0 0 ## T[T>A]A T[T>A]C T[T>A]G T[T>A]T A[T>C]A A[T>C]C A[T>C]G A[T>C]T C[T>C]A C[T>C]C ## 0 0 0 0 0 0 0 0 0 0 ## C[T>C]G C[T>C]T G[T>C]A G[T>C]C G[T>C]G G[T>C]T T[T>C]A T[T>C]C T[T>C]G T[T>C]T ## 1 0 0 0 0 0 0 0 0 0 ## A[T>G]A A[T>G]C A[T>G]G A[T>G]T C[T>G]A C[T>G]C C[T>G]G C[T>G]T G[T>G]A G[T>G]C ## 0 0 0 0 0 0 0 0 0 0 ## G[T>G]G G[T>G]T T[T>G]A T[T>G]C T[T>G]G T[T>G]T ## 0 0 0 0 0 0 ``` --- ## Trinucleotide pattern in TCGA-AB-3009 Lets focus on a single sample ```r tarSam_triNuc <- laml.tnm$nmf_matrix["TCGA-AB-3009", ] tarSam_triNuc[1:2] ``` ``` ## A[C>A]A A[C>A]C ## 1 1 ``` ```r yd <- data.frame(triNuc = names(tarSam_triNuc), count = tarSam_triNuc, stringsAsFactors = FALSE) yd$cat <- gsub("(.*)\\[(.*)\\](.*)", "\\2", yd$triNuc) yd$num <- seq(1, length(yd$triNuc)) ``` --- ## Trinucleotide pattern in TCGA-AB-3009 ```r ggplot(yd,aes(x=num,y=count,fill=cat))+ geom_bar(stat='identity')+ labs(x="",y="Counts",fill="")+ theme(axis.text.x=element_blank()) ``` --- ## Trinucleotide pattern in TCGA-AB-3009 <img src="RU_GenomicVariant_session3_files/figure-html/mult_mutSig_triMutPat_advan4-1.png" style="display: block; margin: auto;" /> **What does this pattern mean?** --- ## Estimate number of signatures In this step, [Cophenetic correlation](https://en.wikipedia.org/wiki/Cophenetic_correlation), *which represents how faithfully a clustering results represent the original data*, would be used to estimate how many signatures could be identified in this dataset. ```r library('NMF') laml.sign <- estimateSignatures(mat = laml.tnm, nTry = 6, pConstant = 0.1, parallel = 1) ``` --- ## Estimate number of signautres <img src="RU_GenomicVariant_session3_files/figure-html/mult_mutSig_sigEst_advan3-1.png" style="display: block; margin: auto;" /> --- ## Extract signautres - Based on cophenetic plot - In this case, number of signatures = 3 ```r laml.sig.ext <- extractSignatures(mat = laml.tnm, n = 3, pConstant = 0.1, parallel = 1) laml.sig.ext$signatures[1:5,] # use for mapping to mutational signature database ``` ``` ## Signature_1 Signature_2 Signature_3 ## A[C>A]A 0.009675202 0.004069562 0.011908054 ## A[C>A]C 0.025910582 0.005584385 0.008636836 ## A[C>A]G 0.010011263 0.002505450 0.005986448 ## A[C>A]T 0.007482971 0.007651267 0.006813378 ## C[C>A]A 0.012005658 0.007071996 0.010397992 ``` --- ## What do the signtures stand for? - Map to mutational signatures databases eg. *COSMIC* ```r laml.og30.cosm = compareSignatures(nmfRes = laml.sig.ext, sig_db = "legacy") laml.og30.cosm$cosine_similarities[,1:5] ``` ``` ## COSMIC_1 COSMIC_2 COSMIC_3 COSMIC_4 COSMIC_5 ## Signature_1 0.840 0.099 0.231 0.198 0.375 ## Signature_2 0.577 0.153 0.313 0.253 0.405 ## Signature_3 0.328 0.229 0.803 0.586 0.851 ``` --- ## Map the signatures to COSMIC database ```r pheatmap::pheatmap(mat = laml.og30.cosm$cosine_similarities, cluster_rows = FALSE) ``` <img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_mapSigPres_advan-1.png" style="display: block; margin: auto;" /> --- ## plot signatures ~ COSMIC - plots signature pattern and annotate basing on COSMIC database ```r plotSignatures(nmfRes = laml.sig.ext, title_size = 1.2, contributions = FALSE, show_title = TRUE, sig_db = 'legacy') ``` --- ## plot signatures ~ COSMIC <img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_plotSigCOS_advan2-1.png" style="display: block; margin: auto;" /> --- ## Map the signatures to SBS database ```r laml.sign.sbs = compareSignatures(nmfRes = laml.sig.ext, sig_db = "SBS") laml.sign.sbs$cosine_similarities[, 1:5] ``` ``` ## SBS1 SBS2 SBS3 SBS4 SBS5 ## Signature_1 0.858 0.084 0.215 0.155 0.298 ## Signature_2 0.482 0.124 0.314 0.203 0.440 ## Signature_3 0.075 0.186 0.836 0.487 0.823 ``` --- ## Map the signatures to SBS database ```r pheatmap::pheatmap(mat = laml.sign.sbs$cosine_similarities, cluster_rows = FALSE) ``` <img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_mapSigSBS_advan2-1.png" style="display: block; margin: auto;" /> --- ## plot signatures ~ SBS - plots signature pattern and annotate basing on SBS database ```r plotSignatures(nmfRes = laml.sig.ext, title_size = 1.2, contributions = FALSE, show_title = TRUE, sig_db = 'SBS') ``` --- ## plot signatures ~ SBS <img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_plotSigSBS_advan2-1.png" style="display: block; margin: auto;" /> --- ## Mutational signautes in each sample - Display mutational signatures exposure in each sample ```r plotSignatures(nmfRes = laml.sig.ext, title_size = 0.8, contributions = TRUE, show_title = TRUE) ``` --- ## Mutational signautes exposure in each sample <img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_plotSigSAM_advan2-1.png" style="display: block; margin: auto;" /> --- ## Enrichment analysis - k-mean clustering based on the exposure of mutational signatures - Assign dominate signatures based on k-mean clustering - Mutations in samples with different signatures ```r laml.se = signatureEnrichment(maf = laml, sig_res = laml.sig.ext) ``` --- ## Enrichment analysis <img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_enrich_advan2-1.png" style="display: block; margin: auto;" /> --- ## Genes associated with signatures - Genes mutated in samples with particular mutational signature ```r laml.se$groupwise_comparision[1:2, ] ``` ``` ## Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2 ## 1: IDH2 Signature_2 Rest 14 of 65 6 of 123 ## 2: NPM1 Signature_3 Rest 18 of 63 14 of 125 ## p_value OR OR_low OR_high fdr ## 1: 0.0008227612 5.299864 1.790632 17.818258 0.04936567 ## 2: 0.0039456214 3.149946 1.351675 7.491285 0.08958865 ``` --- ## Genes associated with signatures ```r plotEnrichmentResults(enrich_res = laml.se, pVal = 0.05) ``` <img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_enrichGenePlot_advan1-1.png" height="70%" style="display: block; margin: auto;" /> --- ## Exercise Time - [Exercise](../../exercises/exercises/GenomicVariant_session3_exercise.html) - [Answer](../../exercises/answers/GenomicVariant_session3_answers.html)