RU_GenomicVariant_session3.utf8

class: center, middle, inverse, title-slide

# Genomic Variants ~ Session 3
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=796px>
</html>
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/RU_GenomicVariants/" class="uri">http://rockefelleruniversity.github.io/RU_GenomicVariants/</a>

---

## About this session
In this session, we will demonstrate how to use several advanced analysis approaches to decipher variants from multiple samples. In this situation, the Mutation Annotation Format (MAF) would be a good choice for storing and presenting your data. For a detailesd overview of MAF file structure please refer to our previous course on [Genomic Formats](http://rockefelleruniversity.github.io/Genomic_Data//presentations/slides/GenomicsData.html#36).

Today we will beusing an acute myeloid leukemia (AML) dataset (tcga-laml), which was fetched from [The Cancer Genome Atlas data portal](https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga). The dataset is in the following path "data/tcga_laml.maf".

---
## Outline of this session
- Handle MAF with maftools
- Get summary by samples and genes
- Functional analysis
- Mutational signatures

---
class: inverse, center, middle

# Working with MAF files

---

## MAF format
As with many data files we can very simply read MAF files into R.

```r
laml_tab <- read.delim("data/tcga_laml.maf", sep = "\t")
laml_tab[1:2, ]
```

```
##   Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
## 1      ABCA10          10349 genome.wustl.edu         37         17
## 2       ABCA4             24 genome.wustl.edu         37          1
##   Start_Position End_position Strand Variant_Classification Variant_Type
## 1       67170917     67170917      +            Splice_Site          SNP
## 2       94490594     94490594      +      Missense_Mutation          SNP
##   Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
## 1                T                 T                 C         TCGA-AB-2988
## 2                C                 C                 T         TCGA-AB-2869
##   Protein_Change i_TumorVAF_WU i_transcript_name
## 1        p.K960R         45.66       NM_080282.3
## 2       p.R1517H         38.12       NM_000350.2
```

---
## How many mutations per sample

```r
tbl <- table(laml_tab$Tumor_Sample_Barcode)
hist(tbl, breaks = 10, xlab = "Mutations")
```

---
## [maftools](https://www.bioconductor.org/packages/release/bioc/html/maftools.html)
Bioconductor has a specialist package for handling these files. 
<img src="imgs/vcfMan_fig7r.png" width="50%" style="display: block; margin: auto;" />

---
## Load MAF file

```r
library(maftools)
laml <- read.maf("data/tcga_laml.maf.gz")
```

```
## -Reading
## -Validating
## -Silent variants: 475 
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 2.700s elapsed (0.371s cpu)
```

```r
class(laml)
```

```
## [1] "MAF"
## attr(,"package")
## [1] "maftools"
```

---
## Summary by samples or genes
maftools provides two functions to generate summary table for by samples or by genes
- getSampleSummary: a matrix, count of mutation classes in each sample
- getGeneSummary: a matrix, count of mutation classes in each gene

---
## How many mutations in each sample
- *getSampleSummary()*
- a matrix of mutation classes in each sample

```r
sample_sum <- getSampleSummary(laml)
sample_sum[1:2, ]
```

```
##    Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1:         TCGA-AB-3009               0               5            0
## 2:         TCGA-AB-2807               1               0            1
##    In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1:            1                25                 2           1    34
## 2:            0                16                 3           4    25
```

---
## Mutation classes in each sample
- convert into table for ggplot
- proportion: count of mutation / total mutations for each sample

```r
var_to <- sample_sum$total
names(var_to) <- sample_sum$Tumor_Sample_Barcode
sample_sum <- dplyr::select(sample_sum, -total)
melt_dat <- reshape2::melt(sample_sum, id = "Tumor_Sample_Barcode")
melt_dat[1:3, ]
```

```
##   Tumor_Sample_Barcode        variable value
## 1         TCGA-AB-3009 Frame_Shift_Del     0
## 2         TCGA-AB-2807 Frame_Shift_Del     1
## 3         TCGA-AB-2959 Frame_Shift_Del     0
```

---
## Mutation classes in each sample
- convert into table for ggplot
- proportion: count of mutation / total mutations for each sample

```r
melt_dat$totalVar <- var_to[match(melt_dat$Tumor_Sample_Barcode, names(var_to))]
melt_dat$prop <- melt_dat$value/melt_dat$totalVar
head(melt_dat)
```

```
##   Tumor_Sample_Barcode        variable value totalVar prop
## 1         TCGA-AB-3009 Frame_Shift_Del     0       34 0.00
## 2         TCGA-AB-2807 Frame_Shift_Del     1       25 0.04
## 3         TCGA-AB-2959 Frame_Shift_Del     0       23 0.00
## 4         TCGA-AB-3002 Frame_Shift_Del     0       21 0.00
## 5         TCGA-AB-2849 Frame_Shift_Del     0       20 0.00
## 6         TCGA-AB-2923 Frame_Shift_Del     1       20 0.05
```

---
## Mutation classes in each sample

```r
ggplot(melt_dat,aes(x=Tumor_Sample_Barcode,y=value,fill=variable))+
  geom_bar(stat='identity',position = 'stack')+
  labs(x="",y="Mutations",fill="")+
  theme(axis.text.x=element_blank())
```

---
## Mutation classes in each sample
<img src="RU_GenomicVariant_session3_files/figure-html/mult_samSumPlot_advan3-1.png" style="display: block; margin: auto;" />

---
## Mutation classes in each sample

```r
ggplot(melt_dat,aes(x=Tumor_Sample_Barcode,y=prop,fill=variable))+
  geom_bar(stat='identity',position = 'stack')+
  labs(x="",y="Proportion",fill="")+
  theme(axis.text.x=element_blank())
```

---
## Mutation classes in each sample
<img src="RU_GenomicVariant_session3_files/figure-html/mult_samSumPlotPres_advan5-1.png" style="display: block; margin: auto;" />

---
## How many mutations in each gene
- *getGeneSummary()*
- A matrix, counts of each mutation classes in each gene

```r
gene_sum <- getGeneSummary(laml)
gene_sum[1:2, ]
```

```
##    Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
## 1:        FLT3               0               0            1           33
## 2:      DNMT3A               4               0            0            0
##    Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
## 1:                15                 0           3    52             52
## 2:                39                 5           6    54             48
##    AlteredSamples
## 1:             52
## 2:             48
```

---
class: inverse, center, middle

# Functional analysis of mutations

---

## Any mutation hotspot?
Mutation hotspot means mutation in a given position detected in more samples than in other positions. Most activating mutations have mutation hotspots. A classic example  are the *G12*, *G13*, or *Q61* mutation in KRAS. In contrast, loss of function mutations are usually with no mutation hotspot. The **lollipop plot** is a fancy plot of mutations in each position of a given gene. It's very intuitive for demonstrating mutation hotspots.

---
## Mutation hotspots? 
- *lollipopPlot()*

```r
lollipopPlot(maf = laml, 
             gene = 'NRAS', 
             AACol = 'Protein_Change', 
             showMutationRate = TRUE,
             labelPos = "all")
```

---
## Mutation hotspots? 
<img src="RU_GenomicVariant_session3_files/figure-html/mult_lolli_advan_eval2-1.png" style="display: block; margin: auto;" />

---
## Interaction between mutations
When mutations are send to interact, this means any concordance or exclusivity of two given gene mutations. If two gene mutations concordantly identified in the same patient, it indicates the two mutations could have synergistic effect in cancer or they are in causality. If two gene mutations are exclusive (i.e. not present in the same patient), this could suggest that they lead to similar consequence, therefore there is no selective pressure for both to mutate. The **oncoplot** allows comparison between genes to look for interactions between mutants.

---
## Oncoplot for top 5 mutated genes

```r
oncoplot(maf = laml, top = 5)
```

<img src="RU_GenomicVariant_session3_files/figure-html/mult_oncoplot_advan-1.png" style="display: block; margin: auto;" />
**IDH1 and IDH2 mutations could be exlusive.**

---
## Pathway analysis
maftools provides two functions for pathway analysis.
- **OncogenicPathways()**: calculate mutations detected in each pathway and the fraction of samples affected
- **PlotOncogenicPathways()**: make a waterfall plot for a given pathway

---
## Any enriched pathways?

```r
OncogenicPathways(maf = laml)
```

```
##    Pathway  N n_affected_genes fraction_affected Mutated_samples
## 1:    PI3K 29                1        0.03448276               1
## 2:    NRF2  3                1        0.33333333               1
## 3:    TP53  6                2        0.33333333              15
## 4:     WNT 68                3        0.04411765               4
## 5:     MYC 13                3        0.23076923               3
## 6:   NOTCH 71                6        0.08450704               8
## 7:   Hippo 38                7        0.18421053               7
## 8: RTK-RAS 85               18        0.21176471              97
##    Fraction_mutated_samples
## 1:              0.005181347
## 2:              0.005181347
## 3:              0.077720207
## 4:              0.020725389
## 5:              0.015544041
## 6:              0.041450777
## 7:              0.036269430
## 8:              0.502590674
```

---
## Any enriched pathways?
<img src="RU_GenomicVariant_session3_files/figure-html/mult_pathPlotWF_advan2-1.png" style="display: block; margin: auto;" />
Mutations in **RTK-RAS pathway** is much higher than others

---
## Mutations enriched in pathways?

```r
PlotOncogenicPathways(maf = laml, pathways = "RTK-RAS")
```

---
class: inverse, center, middle

# Mutation Signatures

---

[Mutational signatures](https://cancer.sanger.ac.uk/cosmic/signatures) mean the patterrns of trinucleotide substitution. As mentioned in [session 1](https://rockefelleruniversity.github.io/RU_GenomicVariants/presentations/slides/RU_GenomicVariant_session1.html#44), the pattern may reflect to exposure of spontaneous/enzymatic deamination and defect of DNA repair, etc.

This is another common area of study for cancer and genome integrity research.
- Example: T-W Chen, C-C Lee, and Y-S Chang et al, *Nat Commun*(2017), 8:465 [link](https://www.nature.com/articles/s41467-017-00493-9)

---
## Steps for mutational signature analysis
- Evaluate nucleotide substitutions
- Calculate trinucleotide matrix
- Estimate Signatures
- Compare to databases
- Compare signature and gene mutations

---
## Ti/Tv plot
- Evaluate single-nucleotide substitution
- **titv()**

```r
laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv)
```

---
## Ti/Tv plot
<img src="RU_GenomicVariant_session3_files/figure-html/mult_mutSig_TiTv_advan2-1.png" style="display: block; margin: auto;" />
The majority of mutations are **C>T**.

---
## Trinucleotide matrix
- calcualte trinucleotide substituion pattern
- **trinucleotideMatrix()**

```r
library(BSgenome.Hsapiens.UCSC.hg19, quietly = TRUE)
laml.tnm = trinucleotideMatrix(maf = laml,
                               prefix = 'chr', 
                               add = TRUE, 
                               ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
```

```
## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in  3.315 % of samples (APOBEC enrichment score > 2 ;  6  of  181  samples)
## -Creating mutation matrix
## --matrix of dimension 188x96
```

---
## Trinucleotide matrix

```r
dim(laml.tnm$nmf_matrix)
```

```
## [1] 188  96
```

```r
laml.tnm$nmf_matrix[1, ]
```

```
## A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A C[C>A]C C[C>A]G C[C>A]T G[C>A]A G[C>A]C 
##       0       0       0       0       0       0       0       0       0       0 
## G[C>A]G G[C>A]T T[C>A]A T[C>A]C T[C>A]G T[C>A]T A[C>G]A A[C>G]C A[C>G]G A[C>G]T 
##       0       0       0       0       0       0       0       0       0       0 
## C[C>G]A C[C>G]C C[C>G]G C[C>G]T G[C>G]A G[C>G]C G[C>G]G G[C>G]T T[C>G]A T[C>G]C 
##       0       0       0       0       0       0       0       0       0       0 
## T[C>G]G T[C>G]T A[C>T]A A[C>T]C A[C>T]G A[C>T]T C[C>T]A C[C>T]C C[C>T]G C[C>T]T 
##       0       0       0       0       1       0       1       0       1       1 
## G[C>T]A G[C>T]C G[C>T]G G[C>T]T T[C>T]A T[C>T]C T[C>T]G T[C>T]T A[T>A]A A[T>A]C 
##       0       0       1       1       1       0       0       0       0       0 
## A[T>A]G A[T>A]T C[T>A]A C[T>A]C C[T>A]G C[T>A]T G[T>A]A G[T>A]C G[T>A]G G[T>A]T 
##       1       0       0       0       1       0       0       0       0       0 
## T[T>A]A T[T>A]C T[T>A]G T[T>A]T A[T>C]A A[T>C]C A[T>C]G A[T>C]T C[T>C]A C[T>C]C 
##       0       0       0       0       0       0       0       0       0       0 
## C[T>C]G C[T>C]T G[T>C]A G[T>C]C G[T>C]G G[T>C]T T[T>C]A T[T>C]C T[T>C]G T[T>C]T 
##       1       0       0       0       0       0       0       0       0       0 
## A[T>G]A A[T>G]C A[T>G]G A[T>G]T C[T>G]A C[T>G]C C[T>G]G C[T>G]T G[T>G]A G[T>G]C 
##       0       0       0       0       0       0       0       0       0       0 
## G[T>G]G G[T>G]T T[T>G]A T[T>G]C T[T>G]G T[T>G]T 
##       0       0       0       0       0       0
```

---
## Trinucleotide pattern in TCGA-AB-3009
Lets focus on a single sample

```r
tarSam_triNuc <- laml.tnm$nmf_matrix["TCGA-AB-3009", ]
tarSam_triNuc[1:2]
```

```
## A[C>A]A A[C>A]C 
##       1       1
```

```r
yd <- data.frame(triNuc = names(tarSam_triNuc), count = tarSam_triNuc, stringsAsFactors = FALSE)
yd$cat <- gsub("(.*)\\[(.*)\\](.*)", "\\2", yd$triNuc)
yd$num <- seq(1, length(yd$triNuc))
```

---
## Trinucleotide pattern in TCGA-AB-3009

```r
ggplot(yd,aes(x=num,y=count,fill=cat))+
  geom_bar(stat='identity')+
  labs(x="",y="Counts",fill="")+
  theme(axis.text.x=element_blank())
```

---
## Trinucleotide pattern in TCGA-AB-3009
<img src="RU_GenomicVariant_session3_files/figure-html/mult_mutSig_triMutPat_advan4-1.png" style="display: block; margin: auto;" />
**What does this pattern mean?**

---
## Estimate number of signatures
In this step, [Cophenetic correlation](https://en.wikipedia.org/wiki/Cophenetic_correlation), *which represents how faithfully a clustering results represent the original data*, would be used to estimate how many signatures could be identified in this dataset.

```r
library('NMF')
laml.sign <- estimateSignatures(mat = laml.tnm,
                                nTry = 6,
                                pConstant = 0.1,
                                parallel = 1)
```

---
## Estimate number of signautres

---
## Extract signautres
- Based on cophenetic plot
- In this case, number of signatures = 3

```r
laml.sig.ext <- extractSignatures(mat = laml.tnm, 
                                  n = 3,
                                  pConstant = 0.1,
                                  parallel = 1)
laml.sig.ext$signatures[1:5,] # use for mapping to mutational signature database
```

```
##         Signature_1 Signature_2 Signature_3
## A[C>A]A 0.009675202 0.004069562 0.011908054
## A[C>A]C 0.025910582 0.005584385 0.008636836
## A[C>A]G 0.010011263 0.002505450 0.005986448
## A[C>A]T 0.007482971 0.007651267 0.006813378
## C[C>A]A 0.012005658 0.007071996 0.010397992
```

---
## What do the signtures stand for?
- Map to mutational signatures databases eg. *COSMIC*

```r
laml.og30.cosm = compareSignatures(nmfRes = laml.sig.ext,
                                   sig_db = "legacy")
laml.og30.cosm$cosine_similarities[,1:5]
```

```
##             COSMIC_1 COSMIC_2 COSMIC_3 COSMIC_4 COSMIC_5
## Signature_1    0.840    0.099    0.231    0.198    0.375
## Signature_2    0.577    0.153    0.313    0.253    0.405
## Signature_3    0.328    0.229    0.803    0.586    0.851
```

---
## Map the signatures to COSMIC database

```r
pheatmap::pheatmap(mat = laml.og30.cosm$cosine_similarities, cluster_rows = FALSE)
```

---
## plot signatures ~ COSMIC
- plots signature pattern and annotate basing on COSMIC database

```r
plotSignatures(nmfRes = laml.sig.ext, 
               title_size = 1.2,
               contributions = FALSE,
               show_title = TRUE,
               sig_db = 'legacy')
```

---
## plot signatures ~ COSMIC
<img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_plotSigCOS_advan2-1.png" style="display: block; margin: auto;" />

---
## Map the signatures to SBS database

```r
laml.sign.sbs = compareSignatures(nmfRes = laml.sig.ext, sig_db = "SBS")
laml.sign.sbs$cosine_similarities[, 1:5]
```

```
##              SBS1  SBS2  SBS3  SBS4  SBS5
## Signature_1 0.858 0.084 0.215 0.155 0.298
## Signature_2 0.482 0.124 0.314 0.203 0.440
## Signature_3 0.075 0.186 0.836 0.487 0.823
```

---
## Map the signatures to SBS database

```r
pheatmap::pheatmap(mat = laml.sign.sbs$cosine_similarities, cluster_rows = FALSE)
```

---
## plot signatures ~ SBS
- plots signature pattern and annotate basing on SBS database

```r
plotSignatures(nmfRes = laml.sig.ext, 
               title_size = 1.2,
               contributions = FALSE,
               show_title = TRUE,
               sig_db = 'SBS')
```

---
## plot signatures ~ SBS
<img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_plotSigSBS_advan2-1.png" style="display: block; margin: auto;" />

---
## Mutational signautes in each sample
- Display mutational signatures exposure in each sample

```r
plotSignatures(nmfRes = laml.sig.ext,
               title_size = 0.8,
               contributions = TRUE,
               show_title = TRUE)
```

---
## Mutational signautes exposure in each sample
<img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_plotSigSAM_advan2-1.png" style="display: block; margin: auto;" />

---
## Enrichment analysis
- k-mean clustering based on the exposure of mutational signatures
- Assign dominate signatures based on k-mean clustering
- Mutations in samples with different signatures

```r
laml.se = signatureEnrichment(maf = laml, 
                              sig_res = laml.sig.ext)
```

---
## Enrichment analysis
<img src="RU_GenomicVariant_session3_files/figure-html/mult_muSig_enrich_advan2-1.png" style="display: block; margin: auto;" />

---
## Genes associated with signatures
- Genes mutated in samples with particular mutational signature

```r
laml.se$groupwise_comparision[1:2, ]
```

```
##    Hugo_Symbol      Group1 Group2 n_mutated_group1 n_mutated_group2
## 1:        IDH2 Signature_2   Rest         14 of 65         6 of 123
## 2:        NPM1 Signature_3   Rest         18 of 63        14 of 125
##         p_value       OR   OR_low   OR_high        fdr
## 1: 0.0008227612 5.299864 1.790632 17.818258 0.04936567
## 2: 0.0039456214 3.149946 1.351675  7.491285 0.08958865
```

---
## Genes associated with signatures

```r
plotEnrichmentResults(enrich_res = laml.se, pVal = 0.05)
```

---
## Exercise Time
- [Exercise](../../exercises/exercises/GenomicVariant_session3_exercise.html)
- [Answer](../../exercises/answers/GenomicVariant_session3_answers.html)