Variant Call Format (VCF) is widely used file format to record genomic variants. We introduced the VCF file format in previous section. In this session, we want to demonstrate how to manipulate VCF file with VariantAnnotation.
We will be using a study on bladder cancer from Wei Wangs group at BGI-Shenzhen as our dataset in this session: - Raw data from SRA: PRJNA185252 - Reference genome: BSgenome.Hsapiens.UCSC.hg19 - Variant calling procedure is based on GATK Best Practice + Align to reference genome: Burrows-Wheeler Aligner, bwa-mem + Sort and remove duplicates: Picard-tools, SortSam/MarkDuplicates + Recalibrate Base Quality Scores: GATK, BaseRecalibrator/ApplyBQSR + Variant Calling: GATK, HaplotypeCaller/GenotypeGVCFs - Demo data: SAMN01882168 (“data/SAMN01882168_filt.vcf.gz”)
## class: CollapsedVCF
## dim: 72244 1
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 19 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DP, ...
## info(header(vcf)):
## Number Type Description
## AC A Integer Allele count in genotypes, for each ALT al...
## AF A Float Allele Frequency, for each ALT allele, in ...
## AN 1 Integer Total number of alleles in called genotypes
## BaseQRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt...
## ClippingRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt...
## DP 1 Integer Approximate read depth; some reads may hav...
## DS 0 Flag Were any of the samples downsampled?
## END 1 Integer Stop position of the interval
## ExcessHet 1 Float Phred-scaled p-value for exact test of exc...
## FS 1 Float Phred-scaled p-value using Fisher's exact ...
## InbreedingCoeff 1 Float Inbreeding coefficient as estimated from t...
## MLEAC A Integer Maximum likelihood expectation (MLE) for t...
## MLEAF A Float Maximum likelihood expectation (MLE) for t...
## MQ 1 Float RMS Mapping Quality
## MQRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt...
## QD 1 Float Variant Confidence/Quality by Depth
## RAW_MQ 1 Float Raw data for RMS Mapping Quality
## ReadPosRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt...
## SOR 1 Float Symmetric Odds Ratio of 2x2 contingency ta...
## geno(vcf):
## List of length 10: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, RGQ, SB
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## AD R Integer Allelic depths for the ref and alt alleles in the o...
## DP 1 Integer Approximate read depth (reads with MQ=255 or with b...
## GQ 1 Integer Genotype Quality
## MIN_DP 1 Integer Minimum DP observed within the GVCF block
## PGT 1 String Physical phasing haplotype information, describing ...
## PID 1 String Physical phasing ID information, where each unique ...
## PL G Integer Normalized, Phred-scaled likelihoods for genotypes ...
## RGQ 1 Integer Unconditional reference genotype confidence, encode...
## SB 4 Integer Per-sample component statistics which comprise the ...
As with objects we have met in other courses, there are accessor functions to grab the contents of our VCF object.
## class: VCFHeader
## samples(1): SAMN01882168
## meta(5): fileformat reference source GATKCommandLine contig
## fixed(2): FILTER ALT
## info(19): AC AF ... ReadPosRankSum SOR
## geno(10): GT AD ... RGQ SB
## DataFrameList of length 5
## names(5): fileformat reference source GATKCommandLine contig
Once we have the META extracted, we can use the dollar sign to extract specific fields.
## DataFrame with 1 row and 1 column
## Value
## <character>
## fileformat VCFv4.2
## DataFrame with 1 row and 1 column
## Value
## <character>
## source GenotypeGVCFs
Once we have the META extracted, we can use the dollar sign to extract specific fields.
## DataFrame with 25 rows and 2 columns
## length assembly
## <character> <character>
## chr1 249250621 BSgenome.Hsapiens.UC..
## chr2 243199373 BSgenome.Hsapiens.UC..
## chr3 198022430 BSgenome.Hsapiens.UC..
## chr4 191154276 BSgenome.Hsapiens.UC..
## chr5 180915260 BSgenome.Hsapiens.UC..
## ... ... ...
## chr21 48129895 BSgenome.Hsapiens.UC..
## chr22 51304566 BSgenome.Hsapiens.UC..
## chrX 155270560 BSgenome.Hsapiens.UC..
## chrY 59373566 BSgenome.Hsapiens.UC..
## chrM 16571 BSgenome.Hsapiens.UC..
The variant information is recorded in a VRange object. This is a form of GRanges object which we regularly use and are integral to Bioconductor. It includes chromosome, start position, end potision, reference allele, and alternative allele.
## GRanges object with 2 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## chr1:815337_G/T chr1 815337 * | NA G
## chr1:815341_G/A chr1 815341 * | NA G
## ALT QUAL FILTER
## <DNAStringSetList> <numeric> <character>
## chr1:815337_G/T T 32.77 .
## chr1:815341_G/A A 32.77 .
## -------
## seqinfo: 25 sequences from hg19 genome
It’s not a strand-specific library. This is indicated by the "", which means unstranded. *
We can use similar accessors to those used for GRanges to get position information. - chromosome: seqnames() - start position: start() - end position: end()
## [1] "chr1" "chr1"
## [1] 815337 815341
## [1] 815337 815341
## DNAStringSet object of length 2:
## width seq
## [1] 1 G
## [2] 1 G
## [1] "G" "G"
The alt() function can be used to get the alternative allele for a variant. The result is a DNAStringSetList as alternative alleles could be multiple changes.
## DNAStringSetList of length 2
## [[1]] T
## [[2]] A
## [[1]]
## 1-letter DNAString object
## seq: T
##
## [[2]]
## 1-letter DNAString object
## seq: A
## [1] "T" "A"
## DataFrame with 2 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## AC A Integer Allele count in geno..
## AF A Float Allele Frequency, fo..
By calling info() directly on the whole VCF object, we get information for every variant.
## DataFrame with 2 rows and 19 columns
## AC AF AN BaseQRankSum
## <IntegerList> <NumericList> <integer> <numeric>
## chr1:815337_G/T 1 0.5 2 0.084
## chr1:815341_G/A 1 0.5 2 2.740
## ClippingRankSum DP DS END ExcessHet
## <numeric> <integer> <logical> <integer> <numeric>
## chr1:815337_G/T 0 11 FALSE NA 3.0103
## chr1:815341_G/A 0 11 FALSE NA 3.0103
## FS InbreedingCoeff MLEAC MLEAF MQ
## <numeric> <numeric> <IntegerList> <NumericList> <numeric>
## chr1:815337_G/T 3.09 NA 1 0.5 48.39
## chr1:815341_G/A 3.09 NA 1 0.5 48.39
## MQRankSum QD RAW_MQ ReadPosRankSum SOR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## chr1:815337_G/T -2.744 2.98 NA -1.083 2.093
## chr1:815341_G/A -2.744 2.98 NA -1.200 2.093
## DataFrame with 2 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## GT 1 String Genotype
## AD R Integer Allelic depths for t..
By calling geno() directly on the whole VCF object, we get information for every variant.
## [1] "GT: Genotype"
## chr1:815337_G/T chr1:815341_G/A
## "0/1" "0/1"
##
## 0/1 1/1 1/2
## 70215 1853 176
## [1] "DP: Approximate read depth (reads with MQ=255 or with bad mates are filtered)"
## chr1:815337_G/T chr1:815341_G/A
## 11 11
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 10.00 14.00 24.41 21.00 4136.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] "GQ: Genotype Quality"
## chr1:815337_G/T chr1:815341_G/A
## 61 61
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 90.00 99.00 88.41 99.00 99.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## DNAStringSet object of length 2:
## width seq
## [1] 1 G
## [2] 1 G
## [[1]]
## 1-letter DNAString object
## seq: T
## [[1]]
## [1] "T"
## [[1]]
## [1] 8 3
var_2 <- rownames(geno(vcf)$GT)[geno(vcf)$GT=="1/2"]
varTab2 <- data.frame(variant=names(rd)[names(rd) %in% var_2],
chr=as.vector(seqnames(rd)[names(rd) %in% var_2]),
start=start(rd)[names(rd) %in% var_2],
end=end(rd)[names(rd) %in% var_2],
refBase=unlist(lapply(lapply(
alt(vcf)[rownames(vcf) %in% var_2],`[[`,1),as.character)),
altBase=unlist(lapply(lapply(
alt(vcf)[rownames(vcf) %in% var_2],`[[`,2),as.character)),
refCount=unlist(lapply(
geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,2)),
altCount=unlist(
lapply(geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,3)),
genoType=geno(vcf)$GT[rownames(geno(vcf)$GT) %in% var_2],
gtQuality=geno(vcf)$GQ[rownames(geno(vcf)$GQ) %in% var_2],
stringsAsFactors = FALSE)
## variant chr start end refBase altBase refCount altCount genoType
## 1 chr1:815337_G/T chr1 815337 815337 G T 8 3 0/1
## 2 chr1:815341_G/A chr1 815341 815341 G A 8 3 0/1
## gtQuality
## 1 61
## 2 61
# differentiate SNP/INS/DEL/Others
for (k in 1:length(varTab$variant)) {
if (width(varTab$refBase[k]) < width(varTab$altBase[k])) {
varTab$mutType[k] <- "INS"
} else if (width(varTab$refBase[k]) > width(varTab$altBase[k])) {
varTab$mutType[k] <- "DEL"
} else if (width(varTab$refBase[k]) == 1 & width(varTab$altBase[k]) == 1) {
varTab$mutType[k] <- "SNP"
} else {
varTab$mutType[k] <- "Others"
}
}
#
tbl <- table(varTab$mutType)
tbl_dat <- as.data.frame(tbl)
tbl
##
## DEL INS Others SNP
## 4351 3174 5 64714
# Transition (Ti)
ti <- c("A>G","G>A","C>T","T>C")
# Transveersion (Tv)
tv <- c("A>T","A>C","G>T","G>C","C>A","C>G","T>A","T>G")
varTab$nuSub <- paste0(varTab$refBase,">",varTab$altBase)
varTab$TiTv[varTab$nuSub %in% ti] <- "Ti"
varTab$TiTv[varTab$nuSub %in% tv] <- "Tv"
varTab[1:2,]
## variant chr start end refBase altBase refCount altCount genoType
## 1 chr1:815337_G/T chr1 815337 815337 G T 8 3 0/1
## 2 chr1:815341_G/A chr1 815341 815341 G A 8 3 0/1
## gtQuality mutType nuSub TiTv
## 1 61 SNP G>T Tv
## 2 61 SNP G>A Ti
varX <- varTab[varTab$mutType == "SNP", ]
tbl <- table(varX$nuSub)
tbl_dat <- as.data.frame(tbl)
tbl
##
## A>C A>G A>T C>A C>G C>T G>A G>C G>T T>A T>C T>G
## 3062 9380 3083 3448 2830 10393 10579 2793 3515 3143 9424 3064
Other constraints may bias nucleotide transitions. One of these is the trinucleotide motif around the variant (from -1 to +1). This pattern could reflect different causes for the mutations.
One such example is APOBEC, a deaminase leading to C>T substitutions in the motif T[C>T]W. So, the proportion of T[C>T]W in overall C>T substitutions could reflect to APOBEC activity. More applications of the trinucleotide motifs will be discussed in session3.
library(BSgenome.Hsapiens.UCSC.hg19)
library(GenomicFeatures)
library(stringr)
#
rd_idx <- str_split(names(rd), "_", simplify = T)
rd_idx[1:2, ]
## [,1] [,2]
## [1,] "chr1:815337" "G/T"
## [2,] "chr1:815341" "G/A"
## GRanges object with 2 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## chr1:815520_C/T chr1 815520 * | NA C
## chr1:815573_C/T chr1 815573 * | NA C
## ALT QUAL FILTER
## <DNAStringSetList> <numeric> <character>
## chr1:815520_C/T T 916.77 .
## chr1:815573_C/T T 70.77 .
## -------
## seqinfo: 25 sequences from hg19 genome
We can use the getSeq() function to get the sequence underlying these variants, as we have done with GRanges.
rd_sub$triNu <- getSeq(Hsapiens,
seqnames(rd_sub),
start=start(rd_sub)-1,
end=end(rd_sub)+1)
rd_sub[1:2]
## GRanges object with 2 ranges and 6 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## chr1:815520_C/T chr1 815520 * | NA C
## chr1:815573_C/T chr1 815573 * | NA C
## ALT QUAL FILTER triNu
## <DNAStringSetList> <numeric> <character> <DNAStringSet>
## chr1:815520_C/T T 916.77 . CCT
## chr1:815573_C/T T 70.77 . ACA
## -------
## seqinfo: 25 sequences from hg19 genome
##
## ACA ACC ACG ACT CCA CCC CCG CCT GCA GCC GCG GCT TCA TCC TCG TCT
## 1176 535 1096 879 619 429 544 641 554 387 511 556 633 535 584 706
## Var1 Freq
## 1 ACA 1176
## 2 ACC 535
## 3 ACG 1096
## 4 ACT 879
## 5 CCA 619
## 6 CCC 429
## 7 CCG 544
## 8 CCT 641
## 9 GCA 554
## 10 GCC 387
## 11 GCG 511
## 12 GCT 556
## 13 TCA 633
## 14 TCC 535
## 15 TCG 584
## 16 TCT 706
# collapse by APOBEC_target
apobec_dat <- aggregate(Freq ~ APOBEC_target, tbl_dat, FUN = sum, na.rm = TRUE)
apobec_dat
## APOBEC_target Freq
## 1 FALSE 9046
## 2 TRUE 1339
Probably not APOBEC-enriched