Genomic Variants (part 3)

About this session

In this session, we will demonstrate how to use several advanced analysis approaches to decipher variants from multiple samples. In this situation, the Mutation Annotation Format (MAF) would be a good choice for storing and presenting your data. For a detailesd overview of MAF file structure please refer to our previous course on Genomic Formats.

Today we will beusing an acute myeloid leukemia (AML) dataset (tcga-laml), which was fetched from The Cancer Genome Atlas data portal. The dataset is in the following path “data/tcga_laml.maf”.

Outline of this session

Handle MAF with maftools
Get summary by samples and genes
Functional analysis
Mutational signatures

Working with MAF files

MAF format

As with many data files we can very simply read MAF files into R.

laml_tab <- read.delim("data/tcga_laml.maf", sep = "\t")
laml_tab[1:2, ]

##   Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
## 1      ABCA10          10349 genome.wustl.edu         37         17
## 2       ABCA4             24 genome.wustl.edu         37          1
##   Start_Position End_position Strand Variant_Classification Variant_Type
## 1       67170917     67170917      +            Splice_Site          SNP
## 2       94490594     94490594      +      Missense_Mutation          SNP
##   Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
## 1                T                 T                 C         TCGA-AB-2988
## 2                C                 C                 T         TCGA-AB-2869
##   Protein_Change i_TumorVAF_WU i_transcript_name
## 1        p.K960R         45.66       NM_080282.3
## 2       p.R1517H         38.12       NM_000350.2

How many mutations per sample

tbl <- table(laml_tab$Tumor_Sample_Barcode)
hist(tbl, breaks = 10, xlab = "Mutations")

maftools

Bioconductor has a specialist package for handling these files.

Load MAF file

library(maftools)
laml <- read.maf("data/tcga_laml.maf.gz")

## -Reading
## -Validating
## -Silent variants: 475 
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.305s elapsed (0.305s cpu)

class(laml)

## [1] "MAF"
## attr(,"package")
## [1] "maftools"

Summary by samples or genes

maftools provides two functions to generate summary table for by samples or by genes - getSampleSummary: a matrix, count of mutation classes in each sample - getGeneSummary: a matrix, count of mutation classes in each gene

How many mutations in each sample

getSampleSummary()
a matrix of mutation classes in each sample

sample_sum <- getSampleSummary(laml)
sample_sum[1:2, ]

##    Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1:         TCGA-AB-3009               0               5            0
## 2:         TCGA-AB-2807               1               0            1
##    In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1:            1                25                 2           1    34
## 2:            0                16                 3           4    25

Mutation classes in each sample

convert into table for ggplot
proportion: count of mutation / total mutations for each sample

var_to <- sample_sum$total
names(var_to) <- sample_sum$Tumor_Sample_Barcode
sample_sum <- dplyr::select(sample_sum, -total)
melt_dat <- reshape2::melt(sample_sum, id = "Tumor_Sample_Barcode")
melt_dat[1:3, ]

##   Tumor_Sample_Barcode        variable value
## 1         TCGA-AB-3009 Frame_Shift_Del     0
## 2         TCGA-AB-2807 Frame_Shift_Del     1
## 3         TCGA-AB-2959 Frame_Shift_Del     0

Mutation classes in each sample

convert into table for ggplot
proportion: count of mutation / total mutations for each sample

melt_dat$totalVar <- var_to[match(melt_dat$Tumor_Sample_Barcode, names(var_to))]
melt_dat$prop <- melt_dat$value/melt_dat$totalVar
head(melt_dat)

##   Tumor_Sample_Barcode        variable value totalVar prop
## 1         TCGA-AB-3009 Frame_Shift_Del     0       34 0.00
## 2         TCGA-AB-2807 Frame_Shift_Del     1       25 0.04
## 3         TCGA-AB-2959 Frame_Shift_Del     0       23 0.00
## 4         TCGA-AB-3002 Frame_Shift_Del     0       21 0.00
## 5         TCGA-AB-2849 Frame_Shift_Del     0       20 0.00
## 6         TCGA-AB-2923 Frame_Shift_Del     1       20 0.05

Mutation classes in each sample

ggplot(melt_dat,aes(x=Tumor_Sample_Barcode,y=value,fill=variable))+
  geom_bar(stat='identity',position = 'stack')+
  labs(x="",y="Mutations",fill="")+
  theme(axis.text.x=element_blank())

Mutation classes in each sample

ggplot(melt_dat,aes(x=Tumor_Sample_Barcode,y=prop,fill=variable))+
  geom_bar(stat='identity',position = 'stack')+
  labs(x="",y="Proportion",fill="")+
  theme(axis.text.x=element_blank())

Mutation classes in each sample

How many mutations in each gene

getGeneSummary()
A matrix, counts of each mutation classes in each gene

gene_sum <- getGeneSummary(laml)
gene_sum[1:2, ]

##    Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
## 1:        FLT3               0               0            1           33
## 2:      DNMT3A               4               0            0            0
##    Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
## 1:                15                 0           3    52             52
## 2:                39                 5           6    54             48
##    AlteredSamples
## 1:             52
## 2:             48

Functional analysis of mutations

Any mutation hotspot?

Mutation hotspot means mutation in a given position detected in more samples than in other positions. Most activating mutations have mutation hotspots. A classic example are the G12, G13, or Q61 mutation in KRAS. In contrast, loss of function mutations are usually with no mutation hotspot. The lollipop plot is a fancy plot of mutations in each position of a given gene. It’s very intuitive for demonstrating mutation hotspots.

Mutation hotspots?

lollipopPlot()

lollipopPlot(maf = laml, 
             gene = 'NRAS', 
             AACol = 'Protein_Change', 
             showMutationRate = TRUE,
             labelPos = "all")

Mutation hotspots?

Interaction between mutations

When mutations are send to interact, this means any concordance or exclusivity of two given gene mutations. If two gene mutations concordantly identified in the same patient, it indicates the two mutations could have synergistic effect in cancer or they are in causality. If two gene mutations are exclusive (i.e. not present in the same patient), this could suggest that they lead to similar consequence, therefore there is no selective pressure for both to mutate. The oncoplot allows comparison between genes to look for interactions between mutants.

Oncoplot for top 5 mutated genes

oncoplot(maf = laml, top = 5)

IDH1 and IDH2 mutations could be exlusive.

Pathway analysis

maftools provides two functions for pathway analysis. - OncogenicPathways(): calculate mutations detected in each pathway and the fraction of samples affected - PlotOncogenicPathways(): make a waterfall plot for a given pathway

Any enriched pathways?

OncogenicPathways(maf = laml)

##    Pathway  N n_affected_genes fraction_affected Mutated_samples
## 1:    PI3K 29                1        0.03448276               1
## 2:    NRF2  3                1        0.33333333               1
## 3:    TP53  6                2        0.33333333              15
## 4:     WNT 68                3        0.04411765               4
## 5:     MYC 13                3        0.23076923               3
## 6:   NOTCH 71                6        0.08450704               8
## 7:   Hippo 38                7        0.18421053               7
## 8: RTK-RAS 85               18        0.21176471              97
##    Fraction_mutated_samples
## 1:              0.005181347
## 2:              0.005181347
## 3:              0.077720207
## 4:              0.020725389
## 5:              0.015544041
## 6:              0.041450777
## 7:              0.036269430
## 8:              0.502590674

Any enriched pathways?

Mutations in RTK-RAS pathway is much higher than others

Mutations enriched in pathways?

PlotOncogenicPathways(maf = laml, pathways = "RTK-RAS")

Mutation Signatures

Mutational signatures mean the patterrns of trinucleotide substitution. As mentioned in session 1, the pattern may reflect to exposure of spontaneous/enzymatic deamination and defect of DNA repair, etc.

This is another common area of study for cancer and genome integrity research. - Example: T-W Chen, C-C Lee, and Y-S Chang et al, Nat Commun(2017), 8:465 link

Steps for mutational signature analysis

Evaluate nucleotide substitutions
Calculate trinucleotide matrix
Estimate Signatures
Compare to databases
Compare signature and gene mutations

Ti/Tv plot

Evaluate single-nucleotide substitution
titv()

laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv)

Ti/Tv plot

The majority of mutations are C>T.

Trinucleotide matrix

calcualte trinucleotide substituion pattern
trinucleotideMatrix()

library(BSgenome.Hsapiens.UCSC.hg19, quietly = TRUE)
laml.tnm = trinucleotideMatrix(maf = laml,
                               prefix = 'chr', 
                               add = TRUE, 
                               ref_genome = "BSgenome.Hsapiens.UCSC.hg19")

## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in  3.315 % of samples (APOBEC enrichment score > 2 ;  6  of  181  samples)
## -Creating mutation matrix
## --matrix of dimension 188x96

Trinucleotide matrix

dim(laml.tnm$nmf_matrix)

## [1] 188  96

laml.tnm$nmf_matrix[1, ]

## A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A C[C>A]C C[C>A]G C[C>A]T G[C>A]A G[C>A]C 
##       0       0       0       0       0       0       0       0       0       0 
## G[C>A]G G[C>A]T T[C>A]A T[C>A]C T[C>A]G T[C>A]T A[C>G]A A[C>G]C A[C>G]G A[C>G]T 
##       0       0       0       0       0       0       0       0       0       0 
## C[C>G]A C[C>G]C C[C>G]G C[C>G]T G[C>G]A G[C>G]C G[C>G]G G[C>G]T T[C>G]A T[C>G]C 
##       0       0       0       0       0       0       0       0       0       0 
## T[C>G]G T[C>G]T A[C>T]A A[C>T]C A[C>T]G A[C>T]T C[C>T]A C[C>T]C C[C>T]G C[C>T]T 
##       0       0       0       0       1       0       1       0       1       1 
## G[C>T]A G[C>T]C G[C>T]G G[C>T]T T[C>T]A T[C>T]C T[C>T]G T[C>T]T A[T>A]A A[T>A]C 
##       0       0       1       1       1       0       0       0       0       0 
## A[T>A]G A[T>A]T C[T>A]A C[T>A]C C[T>A]G C[T>A]T G[T>A]A G[T>A]C G[T>A]G G[T>A]T 
##       1       0       0       0       1       0       0       0       0       0 
## T[T>A]A T[T>A]C T[T>A]G T[T>A]T A[T>C]A A[T>C]C A[T>C]G A[T>C]T C[T>C]A C[T>C]C 
##       0       0       0       0       0       0       0       0       0       0 
## C[T>C]G C[T>C]T G[T>C]A G[T>C]C G[T>C]G G[T>C]T T[T>C]A T[T>C]C T[T>C]G T[T>C]T 
##       1       0       0       0       0       0       0       0       0       0 
## A[T>G]A A[T>G]C A[T>G]G A[T>G]T C[T>G]A C[T>G]C C[T>G]G C[T>G]T G[T>G]A G[T>G]C 
##       0       0       0       0       0       0       0       0       0       0 
## G[T>G]G G[T>G]T T[T>G]A T[T>G]C T[T>G]G T[T>G]T 
##       0       0       0       0       0       0

Trinucleotide pattern in TCGA-AB-3009

Lets focus on a single sample

tarSam_triNuc <- laml.tnm$nmf_matrix["TCGA-AB-3009", ]
tarSam_triNuc[1:2]

## A[C>A]A A[C>A]C 
##       1       1

yd <- data.frame(triNuc = names(tarSam_triNuc), count = tarSam_triNuc, stringsAsFactors = FALSE)
yd$cat <- gsub("(.*)\\[(.*)\\](.*)", "\\2", yd$triNuc)
yd$num <- seq(1, length(yd$triNuc))

Trinucleotide pattern in TCGA-AB-3009

ggplot(yd,aes(x=num,y=count,fill=cat))+
  geom_bar(stat='identity')+
  labs(x="",y="Counts",fill="")+
  theme(axis.text.x=element_blank())

Trinucleotide pattern in TCGA-AB-3009

What does this pattern mean?

Estimate number of signatures

In this step, Cophenetic correlation, which represents how faithfully a clustering results represent the original data, would be used to estimate how many signatures could be identified in this dataset.

library('NMF')
laml.sign <- estimateSignatures(mat = laml.tnm,
                                nTry = 6,
                                pConstant = 0.1,
                                parallel = 1)

Estimate number of signautres

Extract signautres

Based on cophenetic plot
In this case, number of signatures = 3

laml.sig.ext <- extractSignatures(mat = laml.tnm, 
                                  n = 3,
                                  pConstant = 0.1,
                                  parallel = 1)
laml.sig.ext$signatures[1:5,] # use for mapping to mutational signature database

##         Signature_1 Signature_2 Signature_3
## A[C>A]A 0.009675202 0.004069562 0.011908054
## A[C>A]C 0.025910582 0.005584385 0.008636836
## A[C>A]G 0.010011263 0.002505450 0.005986448
## A[C>A]T 0.007482971 0.007651267 0.006813378
## C[C>A]A 0.012005658 0.007071996 0.010397992

What do the signtures stand for?

Map to mutational signatures databases eg. COSMIC

laml.og30.cosm = compareSignatures(nmfRes = laml.sig.ext,
                                   sig_db = "legacy")
laml.og30.cosm$cosine_similarities[,1:5]

##             COSMIC_1 COSMIC_2 COSMIC_3 COSMIC_4 COSMIC_5
## Signature_1    0.840    0.099    0.231    0.198    0.375
## Signature_2    0.577    0.153    0.313    0.253    0.405
## Signature_3    0.328    0.229    0.803    0.586    0.851

Map the signatures to COSMIC database

pheatmap::pheatmap(mat = laml.og30.cosm$cosine_similarities, cluster_rows = FALSE)

plot signatures ~ COSMIC

plots signature pattern and annotate basing on COSMIC database

plotSignatures(nmfRes = laml.sig.ext, 
               title_size = 1.2,
               contributions = FALSE,
               show_title = TRUE,
               sig_db = 'legacy')

plot signatures ~ COSMIC

Map the signatures to SBS database

laml.sign.sbs = compareSignatures(nmfRes = laml.sig.ext, sig_db = "SBS")
laml.sign.sbs$cosine_similarities[, 1:5]

##              SBS1  SBS2  SBS3  SBS4  SBS5
## Signature_1 0.858 0.084 0.215 0.155 0.298
## Signature_2 0.482 0.124 0.314 0.203 0.440
## Signature_3 0.075 0.186 0.836 0.487 0.823

Map the signatures to SBS database

pheatmap::pheatmap(mat = laml.sign.sbs$cosine_similarities, cluster_rows = FALSE)

plot signatures ~ SBS

plots signature pattern and annotate basing on SBS database

plotSignatures(nmfRes = laml.sig.ext, 
               title_size = 1.2,
               contributions = FALSE,
               show_title = TRUE,
               sig_db = 'SBS')

plot signatures ~ SBS

Mutational signautes in each sample

Display mutational signatures exposure in each sample

plotSignatures(nmfRes = laml.sig.ext,
               title_size = 0.8,
               contributions = TRUE,
               show_title = TRUE)

Mutational signautes exposure in each sample

Enrichment analysis

k-mean clustering based on the exposure of mutational signatures
Assign dominate signatures based on k-mean clustering
Mutations in samples with different signatures

laml.se = signatureEnrichment(maf = laml, 
                              sig_res = laml.sig.ext)

Enrichment analysis

Genes associated with signatures

Genes mutated in samples with particular mutational signature

laml.se$groupwise_comparision[1:2, ]

##    Hugo_Symbol      Group1 Group2 n_mutated_group1 n_mutated_group2
## 1:        IDH2 Signature_2   Rest         14 of 65         6 of 123
## 2:        NPM1 Signature_3   Rest         18 of 63        14 of 125
##         p_value       OR   OR_low   OR_high        fdr
## 1: 0.0008227612 5.299864 1.790632 17.818258 0.04936567
## 2: 0.0039456214 3.149946 1.351675  7.491285 0.08958865

Genes associated with signatures

plotEnrichmentResults(enrich_res = laml.se, pVal = 0.05)

Genomic Variants ~ Session 3

Rockefeller University, Bioinformatics Resource Centre

http://rockefelleruniversity.github.io/RU_GenomicVariants/

Genomic Variants (part 3)

About this session

Outline of this session

Working with MAF files

MAF format

How many mutations per sample

maftools

Load MAF file

Summary by samples or genes

How many mutations in each sample

Mutation classes in each sample

Mutation classes in each sample

Mutation classes in each sample

Mutation classes in each sample

Mutation classes in each sample

Mutation classes in each sample

How many mutations in each gene

Functional analysis of mutations

Any mutation hotspot?

Mutation hotspots?

Mutation hotspots?

Interaction between mutations

Oncoplot for top 5 mutated genes

Pathway analysis

Any enriched pathways?

Any enriched pathways?

Mutations enriched in pathways?

Mutation Signatures

Steps for mutational signature analysis

Ti/Tv plot

Ti/Tv plot

Trinucleotide matrix

Trinucleotide matrix

Trinucleotide pattern in TCGA-AB-3009

Trinucleotide pattern in TCGA-AB-3009

Trinucleotide pattern in TCGA-AB-3009

Estimate number of signatures

Estimate number of signautres

Extract signautres

What do the signtures stand for?

Map the signatures to COSMIC database

plot signatures ~ COSMIC

plot signatures ~ COSMIC

Map the signatures to SBS database

Map the signatures to SBS database

plot signatures ~ SBS

plot signatures ~ SBS

Mutational signautes in each sample

Mutational signautes exposure in each sample

Enrichment analysis

Enrichment analysis

Genes associated with signatures

Genes associated with signatures

Exercise Time