Genomic Variants (part 3)


About this session

In this session, we will demonstrate how to use several advanced analysis approaches to decipher variants from multiple samples. In this situation, the Mutation Annotation Format (MAF) would be a good choice for storing and presenting your data. For a detailesd overview of MAF file structure please refer to our previous course on Genomic Formats.

Today we will beusing an acute myeloid leukemia (AML) dataset (tcga-laml), which was fetched from The Cancer Genome Atlas data portal. The dataset is in the following path “data/tcga_laml.maf”.

Outline of this session

  • Handle MAF with maftools
  • Get summary by samples and genes
  • Functional analysis
  • Mutational signatures

Working with MAF files


MAF format

As with many data files we can very simply read MAF files into R.

laml_tab <- read.delim("data/tcga_laml.maf", sep = "\t")
laml_tab[1:2, ]
##   Hugo_Symbol Entrez_Gene_Id           Center NCBI_Build Chromosome
## 1      ABCA10          10349 genome.wustl.edu         37         17
## 2       ABCA4             24 genome.wustl.edu         37          1
##   Start_Position End_position Strand Variant_Classification Variant_Type
## 1       67170917     67170917      +            Splice_Site          SNP
## 2       94490594     94490594      +      Missense_Mutation          SNP
##   Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
## 1                T                 T                 C         TCGA-AB-2988
## 2                C                 C                 T         TCGA-AB-2869
##   Protein_Change i_TumorVAF_WU i_transcript_name
## 1        p.K960R         45.66       NM_080282.3
## 2       p.R1517H         38.12       NM_000350.2

How many mutations per sample

tbl <- table(laml_tab$Tumor_Sample_Barcode)
hist(tbl, breaks = 10, xlab = "Mutations")

maftools

Bioconductor has a specialist package for handling these files.

Load MAF file

library(maftools)
laml <- read.maf("data/tcga_laml.maf.gz")
## -Reading
## -Validating
## -Silent variants: 475 
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.305s elapsed (0.305s cpu)
class(laml)
## [1] "MAF"
## attr(,"package")
## [1] "maftools"

Summary by samples or genes

maftools provides two functions to generate summary table for by samples or by genes - getSampleSummary: a matrix, count of mutation classes in each sample - getGeneSummary: a matrix, count of mutation classes in each gene

How many mutations in each sample

  • getSampleSummary()
  • a matrix of mutation classes in each sample
sample_sum <- getSampleSummary(laml)
sample_sum[1:2, ]
##    Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1:         TCGA-AB-3009               0               5            0
## 2:         TCGA-AB-2807               1               0            1
##    In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1:            1                25                 2           1    34
## 2:            0                16                 3           4    25

Mutation classes in each sample

  • convert into table for ggplot
  • proportion: count of mutation / total mutations for each sample
var_to <- sample_sum$total
names(var_to) <- sample_sum$Tumor_Sample_Barcode
sample_sum <- dplyr::select(sample_sum, -total)
melt_dat <- reshape2::melt(sample_sum, id = "Tumor_Sample_Barcode")
melt_dat[1:3, ]
##   Tumor_Sample_Barcode        variable value
## 1         TCGA-AB-3009 Frame_Shift_Del     0
## 2         TCGA-AB-2807 Frame_Shift_Del     1
## 3         TCGA-AB-2959 Frame_Shift_Del     0

Mutation classes in each sample

  • convert into table for ggplot
  • proportion: count of mutation / total mutations for each sample
melt_dat$totalVar <- var_to[match(melt_dat$Tumor_Sample_Barcode, names(var_to))]
melt_dat$prop <- melt_dat$value/melt_dat$totalVar
head(melt_dat)
##   Tumor_Sample_Barcode        variable value totalVar prop
## 1         TCGA-AB-3009 Frame_Shift_Del     0       34 0.00
## 2         TCGA-AB-2807 Frame_Shift_Del     1       25 0.04
## 3         TCGA-AB-2959 Frame_Shift_Del     0       23 0.00
## 4         TCGA-AB-3002 Frame_Shift_Del     0       21 0.00
## 5         TCGA-AB-2849 Frame_Shift_Del     0       20 0.00
## 6         TCGA-AB-2923 Frame_Shift_Del     1       20 0.05

Mutation classes in each sample

ggplot(melt_dat,aes(x=Tumor_Sample_Barcode,y=value,fill=variable))+
  geom_bar(stat='identity',position = 'stack')+
  labs(x="",y="Mutations",fill="")+
  theme(axis.text.x=element_blank())

Mutation classes in each sample

Mutation classes in each sample

ggplot(melt_dat,aes(x=Tumor_Sample_Barcode,y=prop,fill=variable))+
  geom_bar(stat='identity',position = 'stack')+
  labs(x="",y="Proportion",fill="")+
  theme(axis.text.x=element_blank())

Mutation classes in each sample

How many mutations in each gene

  • getGeneSummary()
  • A matrix, counts of each mutation classes in each gene
gene_sum <- getGeneSummary(laml)
gene_sum[1:2, ]
##    Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
## 1:        FLT3               0               0            1           33
## 2:      DNMT3A               4               0            0            0
##    Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
## 1:                15                 0           3    52             52
## 2:                39                 5           6    54             48
##    AlteredSamples
## 1:             52
## 2:             48

Functional analysis of mutations


Any mutation hotspot?

Mutation hotspot means mutation in a given position detected in more samples than in other positions. Most activating mutations have mutation hotspots. A classic example are the G12, G13, or Q61 mutation in KRAS. In contrast, loss of function mutations are usually with no mutation hotspot. The lollipop plot is a fancy plot of mutations in each position of a given gene. It’s very intuitive for demonstrating mutation hotspots.

Mutation hotspots?

  • lollipopPlot()
lollipopPlot(maf = laml, 
             gene = 'NRAS', 
             AACol = 'Protein_Change', 
             showMutationRate = TRUE,
             labelPos = "all")

Mutation hotspots?

Interaction between mutations

When mutations are send to interact, this means any concordance or exclusivity of two given gene mutations. If two gene mutations concordantly identified in the same patient, it indicates the two mutations could have synergistic effect in cancer or they are in causality. If two gene mutations are exclusive (i.e. not present in the same patient), this could suggest that they lead to similar consequence, therefore there is no selective pressure for both to mutate. The oncoplot allows comparison between genes to look for interactions between mutants.

Oncoplot for top 5 mutated genes

oncoplot(maf = laml, top = 5)

IDH1 and IDH2 mutations could be exlusive.

Pathway analysis

maftools provides two functions for pathway analysis. - OncogenicPathways(): calculate mutations detected in each pathway and the fraction of samples affected - PlotOncogenicPathways(): make a waterfall plot for a given pathway

Any enriched pathways?

OncogenicPathways(maf = laml)
##    Pathway  N n_affected_genes fraction_affected Mutated_samples
## 1:    PI3K 29                1        0.03448276               1
## 2:    NRF2  3                1        0.33333333               1
## 3:    TP53  6                2        0.33333333              15
## 4:     WNT 68                3        0.04411765               4
## 5:     MYC 13                3        0.23076923               3
## 6:   NOTCH 71                6        0.08450704               8
## 7:   Hippo 38                7        0.18421053               7
## 8: RTK-RAS 85               18        0.21176471              97
##    Fraction_mutated_samples
## 1:              0.005181347
## 2:              0.005181347
## 3:              0.077720207
## 4:              0.020725389
## 5:              0.015544041
## 6:              0.041450777
## 7:              0.036269430
## 8:              0.502590674

Any enriched pathways?

Mutations in RTK-RAS pathway is much higher than others

Mutations enriched in pathways?

PlotOncogenicPathways(maf = laml, pathways = "RTK-RAS")

Mutation Signatures


Mutational signatures mean the patterrns of trinucleotide substitution. As mentioned in session 1, the pattern may reflect to exposure of spontaneous/enzymatic deamination and defect of DNA repair, etc.

This is another common area of study for cancer and genome integrity research. - Example: T-W Chen, C-C Lee, and Y-S Chang et al, Nat Commun(2017), 8:465 link

Steps for mutational signature analysis

  • Evaluate nucleotide substitutions
  • Calculate trinucleotide matrix
  • Estimate Signatures
  • Compare to databases
  • Compare signature and gene mutations

Ti/Tv plot

  • Evaluate single-nucleotide substitution
  • titv()
laml.titv = titv(maf = laml, plot = FALSE, useSyn = TRUE)
plotTiTv(res = laml.titv)

Ti/Tv plot

The majority of mutations are C>T.

Trinucleotide matrix

  • calcualte trinucleotide substituion pattern
  • trinucleotideMatrix()
library(BSgenome.Hsapiens.UCSC.hg19, quietly = TRUE)
laml.tnm = trinucleotideMatrix(maf = laml,
                               prefix = 'chr', 
                               add = TRUE, 
                               ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in  3.315 % of samples (APOBEC enrichment score > 2 ;  6  of  181  samples)
## -Creating mutation matrix
## --matrix of dimension 188x96

Trinucleotide matrix

dim(laml.tnm$nmf_matrix)
## [1] 188  96
laml.tnm$nmf_matrix[1, ]
## A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A C[C>A]C C[C>A]G C[C>A]T G[C>A]A G[C>A]C 
##       0       0       0       0       0       0       0       0       0       0 
## G[C>A]G G[C>A]T T[C>A]A T[C>A]C T[C>A]G T[C>A]T A[C>G]A A[C>G]C A[C>G]G A[C>G]T 
##       0       0       0       0       0       0       0       0       0       0 
## C[C>G]A C[C>G]C C[C>G]G C[C>G]T G[C>G]A G[C>G]C G[C>G]G G[C>G]T T[C>G]A T[C>G]C 
##       0       0       0       0       0       0       0       0       0       0 
## T[C>G]G T[C>G]T A[C>T]A A[C>T]C A[C>T]G A[C>T]T C[C>T]A C[C>T]C C[C>T]G C[C>T]T 
##       0       0       0       0       1       0       1       0       1       1 
## G[C>T]A G[C>T]C G[C>T]G G[C>T]T T[C>T]A T[C>T]C T[C>T]G T[C>T]T A[T>A]A A[T>A]C 
##       0       0       1       1       1       0       0       0       0       0 
## A[T>A]G A[T>A]T C[T>A]A C[T>A]C C[T>A]G C[T>A]T G[T>A]A G[T>A]C G[T>A]G G[T>A]T 
##       1       0       0       0       1       0       0       0       0       0 
## T[T>A]A T[T>A]C T[T>A]G T[T>A]T A[T>C]A A[T>C]C A[T>C]G A[T>C]T C[T>C]A C[T>C]C 
##       0       0       0       0       0       0       0       0       0       0 
## C[T>C]G C[T>C]T G[T>C]A G[T>C]C G[T>C]G G[T>C]T T[T>C]A T[T>C]C T[T>C]G T[T>C]T 
##       1       0       0       0       0       0       0       0       0       0 
## A[T>G]A A[T>G]C A[T>G]G A[T>G]T C[T>G]A C[T>G]C C[T>G]G C[T>G]T G[T>G]A G[T>G]C 
##       0       0       0       0       0       0       0       0       0       0 
## G[T>G]G G[T>G]T T[T>G]A T[T>G]C T[T>G]G T[T>G]T 
##       0       0       0       0       0       0

Trinucleotide pattern in TCGA-AB-3009

Lets focus on a single sample

tarSam_triNuc <- laml.tnm$nmf_matrix["TCGA-AB-3009", ]
tarSam_triNuc[1:2]
## A[C>A]A A[C>A]C 
##       1       1
yd <- data.frame(triNuc = names(tarSam_triNuc), count = tarSam_triNuc, stringsAsFactors = FALSE)
yd$cat <- gsub("(.*)\\[(.*)\\](.*)", "\\2", yd$triNuc)
yd$num <- seq(1, length(yd$triNuc))

Trinucleotide pattern in TCGA-AB-3009

ggplot(yd,aes(x=num,y=count,fill=cat))+
  geom_bar(stat='identity')+
  labs(x="",y="Counts",fill="")+
  theme(axis.text.x=element_blank())

Trinucleotide pattern in TCGA-AB-3009

What does this pattern mean?

Estimate number of signatures

In this step, Cophenetic correlation, which represents how faithfully a clustering results represent the original data, would be used to estimate how many signatures could be identified in this dataset.

library('NMF')
laml.sign <- estimateSignatures(mat = laml.tnm,
                                nTry = 6,
                                pConstant = 0.1,
                                parallel = 1)

Estimate number of signautres

Extract signautres

  • Based on cophenetic plot
  • In this case, number of signatures = 3
laml.sig.ext <- extractSignatures(mat = laml.tnm, 
                                  n = 3,
                                  pConstant = 0.1,
                                  parallel = 1)
laml.sig.ext$signatures[1:5,] # use for mapping to mutational signature database
##         Signature_1 Signature_2 Signature_3
## A[C>A]A 0.009675202 0.004069562 0.011908054
## A[C>A]C 0.025910582 0.005584385 0.008636836
## A[C>A]G 0.010011263 0.002505450 0.005986448
## A[C>A]T 0.007482971 0.007651267 0.006813378
## C[C>A]A 0.012005658 0.007071996 0.010397992

What do the signtures stand for?

  • Map to mutational signatures databases eg. COSMIC
laml.og30.cosm = compareSignatures(nmfRes = laml.sig.ext,
                                   sig_db = "legacy")
laml.og30.cosm$cosine_similarities[,1:5]
##             COSMIC_1 COSMIC_2 COSMIC_3 COSMIC_4 COSMIC_5
## Signature_1    0.840    0.099    0.231    0.198    0.375
## Signature_2    0.577    0.153    0.313    0.253    0.405
## Signature_3    0.328    0.229    0.803    0.586    0.851

Map the signatures to COSMIC database

pheatmap::pheatmap(mat = laml.og30.cosm$cosine_similarities, cluster_rows = FALSE)

plot signatures ~ COSMIC

  • plots signature pattern and annotate basing on COSMIC database
plotSignatures(nmfRes = laml.sig.ext, 
               title_size = 1.2,
               contributions = FALSE,
               show_title = TRUE,
               sig_db = 'legacy')

plot signatures ~ COSMIC

Map the signatures to SBS database

laml.sign.sbs = compareSignatures(nmfRes = laml.sig.ext, sig_db = "SBS")
laml.sign.sbs$cosine_similarities[, 1:5]
##              SBS1  SBS2  SBS3  SBS4  SBS5
## Signature_1 0.858 0.084 0.215 0.155 0.298
## Signature_2 0.482 0.124 0.314 0.203 0.440
## Signature_3 0.075 0.186 0.836 0.487 0.823

Map the signatures to SBS database

pheatmap::pheatmap(mat = laml.sign.sbs$cosine_similarities, cluster_rows = FALSE)

plot signatures ~ SBS

  • plots signature pattern and annotate basing on SBS database
plotSignatures(nmfRes = laml.sig.ext, 
               title_size = 1.2,
               contributions = FALSE,
               show_title = TRUE,
               sig_db = 'SBS')

plot signatures ~ SBS

Mutational signautes in each sample

  • Display mutational signatures exposure in each sample
plotSignatures(nmfRes = laml.sig.ext,
               title_size = 0.8,
               contributions = TRUE,
               show_title = TRUE)

Mutational signautes exposure in each sample

Enrichment analysis

  • k-mean clustering based on the exposure of mutational signatures
  • Assign dominate signatures based on k-mean clustering
  • Mutations in samples with different signatures
laml.se = signatureEnrichment(maf = laml, 
                              sig_res = laml.sig.ext)

Enrichment analysis

Genes associated with signatures

  • Genes mutated in samples with particular mutational signature
laml.se$groupwise_comparision[1:2, ]
##    Hugo_Symbol      Group1 Group2 n_mutated_group1 n_mutated_group2
## 1:        IDH2 Signature_2   Rest         14 of 65         6 of 123
## 2:        NPM1 Signature_3   Rest         18 of 63        14 of 125
##         p_value       OR   OR_low   OR_high        fdr
## 1: 0.0008227612 5.299864 1.790632 17.818258 0.04936567
## 2: 0.0039456214 3.149946 1.351675  7.491285 0.08958865

Genes associated with signatures

plotEnrichmentResults(enrich_res = laml.se, pVal = 0.05)

Exercise Time