In this session, we will demonstrate how to use several advanced analysis approaches to decipher variants from multiple samples. In this situation, the Mutation Annotation Format (MAF) would be a good choice for storing and presenting your data. For a detailesd overview of MAF file structure please refer to our previous course on Genomic Formats.
Today we will beusing an acute myeloid leukemia (AML) dataset (tcga-laml), which was fetched from The Cancer Genome Atlas data portal. The dataset is in the following path “data/tcga_laml.maf”.
As with many data files we can very simply read MAF files into R.
## Hugo_Symbol Entrez_Gene_Id Center NCBI_Build Chromosome
## 1 ABCA10 10349 genome.wustl.edu 37 17
## 2 ABCA4 24 genome.wustl.edu 37 1
## Start_Position End_position Strand Variant_Classification Variant_Type
## 1 67170917 67170917 + Splice_Site SNP
## 2 94490594 94490594 + Missense_Mutation SNP
## Reference_Allele Tumor_Seq_Allele1 Tumor_Seq_Allele2 Tumor_Sample_Barcode
## 1 T T C TCGA-AB-2988
## 2 C C T TCGA-AB-2869
## Protein_Change i_TumorVAF_WU i_transcript_name
## 1 p.K960R 45.66 NM_080282.3
## 2 p.R1517H 38.12 NM_000350.2
Bioconductor has a specialist package for handling these files.
## -Reading
## -Validating
## -Silent variants: 475
## -Summarizing
## -Processing clinical data
## --Missing clinical data
## -Finished in 0.305s elapsed (0.305s cpu)
## [1] "MAF"
## attr(,"package")
## [1] "maftools"
maftools provides two functions to generate summary table for by samples or by genes - getSampleSummary: a matrix, count of mutation classes in each sample - getGeneSummary: a matrix, count of mutation classes in each gene
## Tumor_Sample_Barcode Frame_Shift_Del Frame_Shift_Ins In_Frame_Del
## 1: TCGA-AB-3009 0 5 0
## 2: TCGA-AB-2807 1 0 1
## In_Frame_Ins Missense_Mutation Nonsense_Mutation Splice_Site total
## 1: 1 25 2 1 34
## 2: 0 16 3 4 25
var_to <- sample_sum$total
names(var_to) <- sample_sum$Tumor_Sample_Barcode
sample_sum <- dplyr::select(sample_sum, -total)
melt_dat <- reshape2::melt(sample_sum, id = "Tumor_Sample_Barcode")
melt_dat[1:3, ]
## Tumor_Sample_Barcode variable value
## 1 TCGA-AB-3009 Frame_Shift_Del 0
## 2 TCGA-AB-2807 Frame_Shift_Del 1
## 3 TCGA-AB-2959 Frame_Shift_Del 0
melt_dat$totalVar <- var_to[match(melt_dat$Tumor_Sample_Barcode, names(var_to))]
melt_dat$prop <- melt_dat$value/melt_dat$totalVar
head(melt_dat)
## Tumor_Sample_Barcode variable value totalVar prop
## 1 TCGA-AB-3009 Frame_Shift_Del 0 34 0.00
## 2 TCGA-AB-2807 Frame_Shift_Del 1 25 0.04
## 3 TCGA-AB-2959 Frame_Shift_Del 0 23 0.00
## 4 TCGA-AB-3002 Frame_Shift_Del 0 21 0.00
## 5 TCGA-AB-2849 Frame_Shift_Del 0 20 0.00
## 6 TCGA-AB-2923 Frame_Shift_Del 1 20 0.05
## Hugo_Symbol Frame_Shift_Del Frame_Shift_Ins In_Frame_Del In_Frame_Ins
## 1: FLT3 0 0 1 33
## 2: DNMT3A 4 0 0 0
## Missense_Mutation Nonsense_Mutation Splice_Site total MutatedSamples
## 1: 15 0 3 52 52
## 2: 39 5 6 54 48
## AlteredSamples
## 1: 52
## 2: 48
Mutation hotspot means mutation in a given position detected in more samples than in other positions. Most activating mutations have mutation hotspots. A classic example are the G12, G13, or Q61 mutation in KRAS. In contrast, loss of function mutations are usually with no mutation hotspot. The lollipop plot is a fancy plot of mutations in each position of a given gene. It’s very intuitive for demonstrating mutation hotspots.
When mutations are send to interact, this means any concordance or exclusivity of two given gene mutations. If two gene mutations concordantly identified in the same patient, it indicates the two mutations could have synergistic effect in cancer or they are in causality. If two gene mutations are exclusive (i.e. not present in the same patient), this could suggest that they lead to similar consequence, therefore there is no selective pressure for both to mutate. The oncoplot allows comparison between genes to look for interactions between mutants.
IDH1 and IDH2 mutations could be exlusive.
maftools provides two functions for pathway analysis. - OncogenicPathways(): calculate mutations detected in each pathway and the fraction of samples affected - PlotOncogenicPathways(): make a waterfall plot for a given pathway
## Pathway N n_affected_genes fraction_affected Mutated_samples
## 1: PI3K 29 1 0.03448276 1
## 2: NRF2 3 1 0.33333333 1
## 3: TP53 6 2 0.33333333 15
## 4: WNT 68 3 0.04411765 4
## 5: MYC 13 3 0.23076923 3
## 6: NOTCH 71 6 0.08450704 8
## 7: Hippo 38 7 0.18421053 7
## 8: RTK-RAS 85 18 0.21176471 97
## Fraction_mutated_samples
## 1: 0.005181347
## 2: 0.005181347
## 3: 0.077720207
## 4: 0.020725389
## 5: 0.015544041
## 6: 0.041450777
## 7: 0.036269430
## 8: 0.502590674
Mutations in RTK-RAS pathway is much higher than others
Mutational signatures mean the patterrns of trinucleotide substitution. As mentioned in session 1, the pattern may reflect to exposure of spontaneous/enzymatic deamination and defect of DNA repair, etc.
This is another common area of study for cancer and genome integrity research. - Example: T-W Chen, C-C Lee, and Y-S Chang et al, Nat Commun(2017), 8:465 link
The majority of mutations are C>T.
library(BSgenome.Hsapiens.UCSC.hg19, quietly = TRUE)
laml.tnm = trinucleotideMatrix(maf = laml,
prefix = 'chr',
add = TRUE,
ref_genome = "BSgenome.Hsapiens.UCSC.hg19")
## -Extracting 5' and 3' adjacent bases
## -Extracting +/- 20bp around mutated bases for background C>T estimation
## -Estimating APOBEC enrichment scores
## --Performing one-way Fisher's test for APOBEC enrichment
## ---APOBEC related mutations are enriched in 3.315 % of samples (APOBEC enrichment score > 2 ; 6 of 181 samples)
## -Creating mutation matrix
## --matrix of dimension 188x96
## [1] 188 96
## A[C>A]A A[C>A]C A[C>A]G A[C>A]T C[C>A]A C[C>A]C C[C>A]G C[C>A]T G[C>A]A G[C>A]C
## 0 0 0 0 0 0 0 0 0 0
## G[C>A]G G[C>A]T T[C>A]A T[C>A]C T[C>A]G T[C>A]T A[C>G]A A[C>G]C A[C>G]G A[C>G]T
## 0 0 0 0 0 0 0 0 0 0
## C[C>G]A C[C>G]C C[C>G]G C[C>G]T G[C>G]A G[C>G]C G[C>G]G G[C>G]T T[C>G]A T[C>G]C
## 0 0 0 0 0 0 0 0 0 0
## T[C>G]G T[C>G]T A[C>T]A A[C>T]C A[C>T]G A[C>T]T C[C>T]A C[C>T]C C[C>T]G C[C>T]T
## 0 0 0 0 1 0 1 0 1 1
## G[C>T]A G[C>T]C G[C>T]G G[C>T]T T[C>T]A T[C>T]C T[C>T]G T[C>T]T A[T>A]A A[T>A]C
## 0 0 1 1 1 0 0 0 0 0
## A[T>A]G A[T>A]T C[T>A]A C[T>A]C C[T>A]G C[T>A]T G[T>A]A G[T>A]C G[T>A]G G[T>A]T
## 1 0 0 0 1 0 0 0 0 0
## T[T>A]A T[T>A]C T[T>A]G T[T>A]T A[T>C]A A[T>C]C A[T>C]G A[T>C]T C[T>C]A C[T>C]C
## 0 0 0 0 0 0 0 0 0 0
## C[T>C]G C[T>C]T G[T>C]A G[T>C]C G[T>C]G G[T>C]T T[T>C]A T[T>C]C T[T>C]G T[T>C]T
## 1 0 0 0 0 0 0 0 0 0
## A[T>G]A A[T>G]C A[T>G]G A[T>G]T C[T>G]A C[T>G]C C[T>G]G C[T>G]T G[T>G]A G[T>G]C
## 0 0 0 0 0 0 0 0 0 0
## G[T>G]G G[T>G]T T[T>G]A T[T>G]C T[T>G]G T[T>G]T
## 0 0 0 0 0 0
Lets focus on a single sample
## A[C>A]A A[C>A]C
## 1 1
What does this pattern mean?
In this step, Cophenetic correlation, which represents how faithfully a clustering results represent the original data, would be used to estimate how many signatures could be identified in this dataset.
laml.sig.ext <- extractSignatures(mat = laml.tnm,
n = 3,
pConstant = 0.1,
parallel = 1)
laml.sig.ext$signatures[1:5,] # use for mapping to mutational signature database
## Signature_1 Signature_2 Signature_3
## A[C>A]A 0.009675202 0.004069562 0.011908054
## A[C>A]C 0.025910582 0.005584385 0.008636836
## A[C>A]G 0.010011263 0.002505450 0.005986448
## A[C>A]T 0.007482971 0.007651267 0.006813378
## C[C>A]A 0.012005658 0.007071996 0.010397992
laml.og30.cosm = compareSignatures(nmfRes = laml.sig.ext,
sig_db = "legacy")
laml.og30.cosm$cosine_similarities[,1:5]
## COSMIC_1 COSMIC_2 COSMIC_3 COSMIC_4 COSMIC_5
## Signature_1 0.840 0.099 0.231 0.198 0.375
## Signature_2 0.577 0.153 0.313 0.253 0.405
## Signature_3 0.328 0.229 0.803 0.586 0.851
laml.sign.sbs = compareSignatures(nmfRes = laml.sig.ext, sig_db = "SBS")
laml.sign.sbs$cosine_similarities[, 1:5]
## SBS1 SBS2 SBS3 SBS4 SBS5
## Signature_1 0.858 0.084 0.215 0.155 0.298
## Signature_2 0.482 0.124 0.314 0.203 0.440
## Signature_3 0.075 0.186 0.836 0.487 0.823
## Hugo_Symbol Group1 Group2 n_mutated_group1 n_mutated_group2
## 1: IDH2 Signature_2 Rest 14 of 65 6 of 123
## 2: NPM1 Signature_3 Rest 18 of 63 14 of 125
## p_value OR OR_low OR_high fdr
## 1: 0.0008227612 5.299864 1.790632 17.818258 0.04936567
## 2: 0.0039456214 3.149946 1.351675 7.491285 0.08958865