+ - 0:00:00
Notes for current slide
Notes for next slide

Genomic Variants ~ Session 1


Rockefeller University, Bioinformatics Resource Centre

http://rockefelleruniversity.github.io/RU_GenomicVariants/

1 / 61

About this session

Variant Call Format (VCF) is widely used file format to record genomic variants. We introduced the VCF file format in previous section. In this session, we want to demonstrate how to manipulate VCF file with VariantAnnotation.

We will be using a study on bladder cancer from Wei Wangs group at BGI-Shenzhen as our dataset in this session:

2 / 61

Outline of this session

  • Load VCF file into a vcf object
  • Survey and retrieve information from each field
    • META field: general information of the VCF file
    • Fixed field: variants, usually in VRange format
    • Genotype field: information for each sample
    • Integrate information into single table
  • Differentiate variant types
  • Retrieve nucleotide substitution pattern
  • Trinucleotide motif analysis
3 / 61

VCF files


4 / 61

Load VCF file

library(VariantAnnotation)
vcf <- readVcf("data/SAMN01882168_filt.vcf.gz", "hg19")
vcf
## class: CollapsedVCF
## dim: 72244 1
## rowRanges(vcf):
## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
## DataFrame with 19 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DP, ...
## info(header(vcf)):
## Number Type Description
## AC A Integer Allele count in genotypes, for each ALT al...
## AF A Float Allele Frequency, for each ALT allele, in ...
## AN 1 Integer Total number of alleles in called genotypes
## BaseQRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt...
## ClippingRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt...
## DP 1 Integer Approximate read depth; some reads may hav...
## DS 0 Flag Were any of the samples downsampled?
## END 1 Integer Stop position of the interval
## ExcessHet 1 Float Phred-scaled p-value for exact test of exc...
## FS 1 Float Phred-scaled p-value using Fisher's exact ...
## InbreedingCoeff 1 Float Inbreeding coefficient as estimated from t...
## MLEAC A Integer Maximum likelihood expectation (MLE) for t...
## MLEAF A Float Maximum likelihood expectation (MLE) for t...
## MQ 1 Float RMS Mapping Quality
## MQRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt...
## QD 1 Float Variant Confidence/Quality by Depth
## RAW_MQ 1 Float Raw data for RMS Mapping Quality
## ReadPosRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt...
## SOR 1 Float Symmetric Odds Ratio of 2x2 contingency ta...
## geno(vcf):
## List of length 10: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, RGQ, SB
## geno(header(vcf)):
## Number Type Description
## GT 1 String Genotype
## AD R Integer Allelic depths for the ref and alt alleles in the o...
## DP 1 Integer Approximate read depth (reads with MQ=255 or with b...
## GQ 1 Integer Genotype Quality
## MIN_DP 1 Integer Minimum DP observed within the GVCF block
## PGT 1 String Physical phasing haplotype information, describing ...
## PID 1 String Physical phasing ID information, where each unique ...
## PL G Integer Normalized, Phred-scaled likelihoods for genotypes ...
## RGQ 1 Integer Unconditional reference genotype confidence, encode...
## SB 4 Integer Per-sample component statistics which comprise the ...
6 / 61

Overview

As with objects we have met in other courses, there are accessor functions to grab the contents of our VCF object.

header(vcf)
## class: VCFHeader
## samples(1): SAMN01882168
## meta(5): fileformat reference source GATKCommandLine contig
## fixed(2): FILTER ALT
## info(19): AC AF ... ReadPosRankSum SOR
## geno(10): GT AD ... RGQ SB
  • META field: general information of the VCF file; meta
  • FIX field: variants, usually in VRange format; rowRange, info
  • GENOTYPE field: information for each sample; geno
7 / 61

How many samples are in this VCF?

sampleID <- samples(header(vcf))
sampleID
## [1] "SAMN01882168"
8 / 61

What is in the META field?

meta(header(vcf))
## DataFrameList of length 5
## names(5): fileformat reference source GATKCommandLine contig
9 / 61

Retrieve information in the META field

Once we have the META extracted, we can use the dollar sign to extract specific fields.

# File format
meta(header(vcf))$fileformat
## DataFrame with 1 row and 1 column
## Value
## <character>
## fileformat VCFv4.2
# Source used for variant calling
meta(header(vcf))$source
## DataFrame with 1 row and 1 column
## Value
## <character>
## source GenotypeGVCFs
10 / 61

Retrieve information in the META field

Once we have the META extracted, we can use the dollar sign to extract specific fields.

meta(header(vcf))$contig
## DataFrame with 25 rows and 2 columns
## length assembly
## <character> <character>
## chr1 249250621 BSgenome.Hsapiens.UC..
## chr2 243199373 BSgenome.Hsapiens.UC..
## chr3 198022430 BSgenome.Hsapiens.UC..
## chr4 191154276 BSgenome.Hsapiens.UC..
## chr5 180915260 BSgenome.Hsapiens.UC..
## ... ... ...
## chr21 48129895 BSgenome.Hsapiens.UC..
## chr22 51304566 BSgenome.Hsapiens.UC..
## chrX 155270560 BSgenome.Hsapiens.UC..
## chrY 59373566 BSgenome.Hsapiens.UC..
## chrM 16571 BSgenome.Hsapiens.UC..
11 / 61

Variants information (VRange format)

The variant information is recorded in a VRange object. This is a form of GRanges object which we regularly use and are integral to Bioconductor. It includes chromosome, start position, end potision, reference allele, and alternative allele.

rd <- rowRanges(vcf)
rd[1:2]
## GRanges object with 2 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## chr1:815337_G/T chr1 815337 * | NA G
## chr1:815341_G/A chr1 815341 * | NA G
## ALT QUAL FILTER
## <DNAStringSetList> <numeric> <character>
## chr1:815337_G/T T 32.77 .
## chr1:815341_G/A A 32.77 .
## -------
## seqinfo: 25 sequences from hg19 genome

It's not a strand-specific library. This is indicated by the "", which means unstranded. *

12 / 61

Retrieving variation information

We can use similar accessors to those used for GRanges to get position information.

  • chromosome: seqnames()
  • start position: start()
  • end position: end()
13 / 61

Position of the variations

as.vector(seqnames(rd)[1:2]) # Chromosome
## [1] "chr1" "chr1"
start(rd)[1:2] # Start position
## [1] 815337 815341
end(rd)[1:2] # End position
## [1] 815337 815341
14 / 61

How to get the reference alleles?

  • The ref() function can be used to get the reference allele for a variant. The result is a DNAStringSet.
refBase <- ref(vcf)
refBase[1:2]
## DNAStringSet object of length 2:
## width seq
## [1] 1 G
## [2] 1 G
  • Use as.character to covert it to character
refBase <- as.character(refBase)
refBase[1:2]
## [1] "G" "G"
15 / 61

How to get alternative alleles?

The alt() function can be used to get the alternative allele for a variant. The result is a DNAStringSetList as alternative alleles could be multiple changes.

altBase <- alt(vcf)
alt(vcf)[1:2]
## DNAStringSetList of length 2
## [[1]] T
## [[2]] A
16 / 61

How to get alternative alleles?

  • As it is a list, getting a character back is more complicated. First we can extract vectors from list by using lapply(). Specifically at this point we are subsetting the data to only the first variant.
# get the 1st vector of the list
altBase <- lapply(altBase, `[[`, 1)
altBase[1:2]
## [[1]]
## 1-letter DNAString object
## seq: T
##
## [[2]]
## 1-letter DNAString object
## seq: A
  • Convert DNAString object to character
altBase <- unlist(lapply(altBase, as.character))
altBase[1:2]
## [1] "T" "A"
17 / 61

INFO section from FIXED field

  • Integrated information from all the samples
  • Annotation information would be recorded in this section
  • All information is stored as a data frame
info(header(vcf))[1:2, ]
## DataFrame with 2 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## AC A Integer Allele count in geno..
## AF A Float Allele Frequency, fo..
18 / 61

INFO section

By calling info() directly on the whole VCF object, we get information for every variant.

info(vcf)[1:2, ]
## DataFrame with 2 rows and 19 columns
## AC AF AN BaseQRankSum
## <IntegerList> <NumericList> <integer> <numeric>
## chr1:815337_G/T 1 0.5 2 0.084
## chr1:815341_G/A 1 0.5 2 2.740
## ClippingRankSum DP DS END ExcessHet
## <numeric> <integer> <logical> <integer> <numeric>
## chr1:815337_G/T 0 11 FALSE NA 3.0103
## chr1:815341_G/A 0 11 FALSE NA 3.0103
## FS InbreedingCoeff MLEAC MLEAF MQ
## <numeric> <numeric> <IntegerList> <NumericList> <numeric>
## chr1:815337_G/T 3.09 NA 1 0.5 48.39
## chr1:815341_G/A 3.09 NA 1 0.5 48.39
## MQRankSum QD RAW_MQ ReadPosRankSum SOR
## <numeric> <numeric> <numeric> <numeric> <numeric>
## chr1:815337_G/T -2.744 2.98 NA -1.083 2.093
## chr1:815341_G/A -2.744 2.98 NA -1.200 2.093
19 / 61

GENOTYPE field

  • Separated by individual samples
geno(header(vcf))[1:2, ]
## DataFrame with 2 rows and 3 columns
## Number Type Description
## <character> <character> <character>
## GT 1 String Genotype
## AD R Integer Allelic depths for t..
20 / 61

Genotype information for each sample

By calling geno() directly on the whole VCF object, we get information for every variant.

paste0("GT: ", geno(header(vcf))[1, 3])
## [1] "GT: Genotype"
matGT <- geno(vcf)$GT
matGT[1:2, ]
## chr1:815337_G/T chr1:815341_G/A
## "0/1" "0/1"
21 / 61

GT Types

tbl <- table(geno(vcf)$GT)
tbl_dat <- as.data.frame(tbl)
tbl
##
## 0/1 1/1 1/2
## 70215 1853 176
  • 0/1: heterozygous mutations, one allele is the same as reference sequence
  • 1/1: homozygous mutations, both alleles are different from reference sequence
  • 1/2: heterozygous mutations, both alleles are different from reference sequence
22 / 61

GT Types

ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
geom_bar(stat='Identity')+
labs(x="",y="Counts",fill="")+
theme_classic()
23 / 61

GT Types

24 / 61

Depth for each sample (DP)

paste0("DP: ", geno(header(vcf))[3, 3])
## [1] "DP: Approximate read depth (reads with MQ=255 or with bad mates are filtered)"
matDP <- geno(vcf)$DP
matDP[1:2, ]
## chr1:815337_G/T chr1:815341_G/A
## 11 11
25 / 61

DP Distribution

summary(as.vector(matDP))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 10.00 14.00 24.41 21.00 4136.00
ggplot(as.data.frame(matDP),aes(x=SAMN01882168))+geom_histogram()+
labs(x="",y="Counts")+
scale_x_log10()+
theme_classic()
26 / 61

DP Distribution

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

27 / 61

Genotype calling quality (GQ)

paste0("GQ: ", geno(header(vcf))[4, 3])
## [1] "GQ: Genotype Quality"
matGQ <- geno(vcf)$GQ
matGQ[1:2, ]
## chr1:815337_G/T chr1:815341_G/A
## 61 61
28 / 61

GQ distribution

summary(as.vector(matGQ))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 90.00 99.00 88.41 99.00 99.00
ggplot(as.data.frame(matGQ),aes(x=SAMN01882168))+geom_histogram()+
labs(x="",y="Counts")+
scale_x_log10()+
theme_classic()
29 / 61

GQ Distribution

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

30 / 61

Gathering variant information


31 / 61

Gathering information ~ GT 0/1 and 1/1

  • select variants with GT 0/1 or 1/1
var_1 <- rownames(geno(vcf)$GT)[
geno(vcf)$GT=="0/1" |
geno(vcf)$GT=="1/1"]
  • Extract variant information
varTab1 <- data.frame(variant=names(rd)[names(rd) %in% var_1],
chr=as.vector(seqnames(rd)[names(rd) %in% var_1]),
start=start(rd)[names(rd) %in% var_1],
end=end(rd)[names(rd) %in% var_1],
stringsAsFactors = FALSE)
32 / 61

Gathering information ~ GT 0/1 and 1/1

  • Ref alleles are retrieved from ref(vcf)
ref_base <- ref(vcf)[rownames(vcf) %in% var_1]
ref_base[1:2]
## DNAStringSet object of length 2:
## width seq
## [1] 1 G
## [2] 1 G
varTab1$refBase <- as.character(ref_base)
33 / 61

Gathering information ~ GT 0/1 and 1/1

  • Alt alleles are retrieved from alt(vcf)
alt_base <- lapply(alt(vcf)[rownames(vcf) %in% var_1],`[[`,1)
alt_base[1]
## [[1]]
## 1-letter DNAString object
## seq: T
alt_base <- lapply(alt_base,as.character)
alt_base[1]
## [[1]]
## [1] "T"
varTab1$altBase <- unlist(alt_base)
34 / 61

Gathering information ~ GT 0/1 and 1/1

  • Extract counts from AD
adCount <- geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_1]
adCount[1]
## [[1]]
## [1] 8 3
varTab1$refCount <- unlist(lapply(adCount,`[[`,1))
varTab1$altCount <- unlist(lapply(adCount,`[[`,2))
35 / 61

Gathering information ~ GT 0/1 and 1/1

  • genoType: genotype (GT)
  • gtQuality: genotyping quality (GQ)
varTab1$genoType <- geno(vcf)$GT[rownames(geno(vcf)$GT) %in% var_1]
varTab1$gtQuality <- geno(vcf)$GQ[rownames(geno(vcf)$GQ) %in% var_1]
36 / 61

Gathering information ~ genotype 1/2

  • Heterozygous mutations, both alleles are different from the reference sequence.
  • Both ref allele and alt allele have to be retrieved from alt(vcf)
var_2 <- rownames(geno(vcf)$GT)[geno(vcf)$GT=="1/2"]
varTab2 <- data.frame(variant=names(rd)[names(rd) %in% var_2],
chr=as.vector(seqnames(rd)[names(rd) %in% var_2]),
start=start(rd)[names(rd) %in% var_2],
end=end(rd)[names(rd) %in% var_2],
refBase=unlist(lapply(lapply(
alt(vcf)[rownames(vcf) %in% var_2],`[[`,1),as.character)),
altBase=unlist(lapply(lapply(
alt(vcf)[rownames(vcf) %in% var_2],`[[`,2),as.character)),
refCount=unlist(lapply(
geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,2)),
altCount=unlist(
lapply(geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,3)),
genoType=geno(vcf)$GT[rownames(geno(vcf)$GT) %in% var_2],
gtQuality=geno(vcf)$GQ[rownames(geno(vcf)$GQ) %in% var_2],
stringsAsFactors = FALSE)
37 / 61

Merged datatable

varTab <- rbind(varTab1, varTab2)
varTab[1:2, ]
## variant chr start end refBase altBase refCount altCount genoType
## 1 chr1:815337_G/T chr1 815337 815337 G T 8 3 0/1
## 2 chr1:815341_G/A chr1 815341 815341 G A 8 3 0/1
## gtQuality
## 1 61
## 2 61
38 / 61

Differentiate variant types

  • SNP: single-nucleotide substitutions
  • DEL: deletions
  • INS: insertions
  • Others: complicated variations, such as Ins/Del or Inversion
39 / 61

Variant types

# differentiate SNP/INS/DEL/Others
for (k in 1:length(varTab$variant)) {
if (width(varTab$refBase[k]) < width(varTab$altBase[k])) {
varTab$mutType[k] <- "INS"
} else if (width(varTab$refBase[k]) > width(varTab$altBase[k])) {
varTab$mutType[k] <- "DEL"
} else if (width(varTab$refBase[k]) == 1 & width(varTab$altBase[k]) == 1) {
varTab$mutType[k] <- "SNP"
} else {
varTab$mutType[k] <- "Others"
}
}
#
tbl <- table(varTab$mutType)
tbl_dat <- as.data.frame(tbl)
tbl
##
## DEL INS Others SNP
## 4351 3174 5 64714
40 / 61

Variant types

ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
geom_bar(stat = 'identity')+
labs(x="",y="Mutations",fill="")+
theme_classic()
41 / 61

Variant types

42 / 61

Nucleotide substitution pattern

  • only SNPs
  • Transition (Ti): purine-to-purine, pyrimidine-to-pyrimidine
  • Transversion (Tv): purine-to-pyrimidine, pyrimidine-to-purine
43 / 61

Nucleotide substitution

# Transition (Ti)
ti <- c("A>G","G>A","C>T","T>C")
# Transveersion (Tv)
tv <- c("A>T","A>C","G>T","G>C","C>A","C>G","T>A","T>G")
varTab$nuSub <- paste0(varTab$refBase,">",varTab$altBase)
varTab$TiTv[varTab$nuSub %in% ti] <- "Ti"
varTab$TiTv[varTab$nuSub %in% tv] <- "Tv"
varTab[1:2,]
## variant chr start end refBase altBase refCount altCount genoType
## 1 chr1:815337_G/T chr1 815337 815337 G T 8 3 0/1
## 2 chr1:815341_G/A chr1 815341 815341 G A 8 3 0/1
## gtQuality mutType nuSub TiTv
## 1 61 SNP G>T Tv
## 2 61 SNP G>A Ti
44 / 61

Nucleotide substitution

varX <- varTab[varTab$mutType == "SNP", ]
tbl <- table(varX$nuSub)
tbl_dat <- as.data.frame(tbl)
tbl
##
## A>C A>G A>T C>A C>G C>T G>A G>C G>T T>A T>C T>G
## 3062 9380 3083 3448 2830 10393 10579 2793 3515 3143 9424 3064
45 / 61

Nucleotide substitution

ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
geom_bar(stat = 'identity')+
labs(x="",y="Mutations",fill="")+
theme(legend.position = "none")
46 / 61

Nucleotide substitution

47 / 61

Ti/Tv

tbl <- table(varX$TiTv)
tbl_dat <- as.data.frame(tbl)
tbl
##
## Ti Tv
## 39776 24938
48 / 61

Ti/Tv

ggplot(as.data.frame(table(varX$TiTv)),aes(x=Var1,y=Freq,fill=Var1))+
geom_bar(stat = 'identity')+labs(x="",y="Mutations",fill="")+
theme(legend.position = "none")
49 / 61

Ti/Tv

50 / 61

Trinucleotide motif analysis


51 / 61

Trinucleotide motif analysis

Other constraints may bias nucleotide transitions. One of these is the trinucleotide motif around the variant (from -1 to +1). This pattern could reflect different causes for the mutations.

One such example is APOBEC, a deaminase leading to C>T substitutions in the motif T[C>T]W. So, the proportion of T[C>T]W in overall C>T substitutions could reflect to APOBEC activity. More applications of the trinucleotide motifs will be discussed in session3.

52 / 61

Extract C>T substituion

library(BSgenome.Hsapiens.UCSC.hg19)
library(GenomicFeatures)
library(stringr)
#
rd_idx <- str_split(names(rd), "_", simplify = T)
rd_idx[1:2, ]
## [,1] [,2]
## [1,] "chr1:815337" "G/T"
## [2,] "chr1:815341" "G/A"
rd_sub <- rd[rd_idx[, 2] == "C/T"]
rd_sub[1:2, ]
## GRanges object with 2 ranges and 5 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## chr1:815520_C/T chr1 815520 * | NA C
## chr1:815573_C/T chr1 815573 * | NA C
## ALT QUAL FILTER
## <DNAStringSetList> <numeric> <character>
## chr1:815520_C/T T 916.77 .
## chr1:815573_C/T T 70.77 .
## -------
## seqinfo: 25 sequences from hg19 genome
53 / 61

Extract sequences beneath the mutation from -1 to +1

We can use the getSeq() function to get the sequence underlying these variants, as we have done with GRanges.

rd_sub$triNu <- getSeq(Hsapiens,
seqnames(rd_sub),
start=start(rd_sub)-1,
end=end(rd_sub)+1)
rd_sub[1:2]
## GRanges object with 2 ranges and 6 metadata columns:
## seqnames ranges strand | paramRangeID REF
## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet>
## chr1:815520_C/T chr1 815520 * | NA C
## chr1:815573_C/T chr1 815573 * | NA C
## ALT QUAL FILTER triNu
## <DNAStringSetList> <numeric> <character> <DNAStringSet>
## chr1:815520_C/T T 916.77 . CCT
## chr1:815573_C/T T 70.77 . ACA
## -------
## seqinfo: 25 sequences from hg19 genome
54 / 61

Trinucleotide pattern

tbl <- table(rd_sub$triNu)
tbl_dat <- as.data.frame(tbl)
tbl
##
## ACA ACC ACG ACT CCA CCC CCG CCT GCA GCC GCG GCT TCA TCC TCG TCT
## 1176 535 1096 879 619 429 544 641 554 387 511 556 633 535 584 706
55 / 61

Trinucleotide pattern

ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
geom_bar(stat='identity')+
labs(x="",y="Variants",fill="")+
theme(legend.position = "none")
56 / 61

Trinucleotide pattern

57 / 61

APOBEC activation in this sample?

  • APOBEC target: TCW (TCA/TCT)
# TCW: TCA/TCT
tbl_dat$APOBEC_target <- tbl_dat$Var1 %in% c("TCA", "TCT")
tbl_dat[1:2]
## Var1 Freq
## 1 ACA 1176
## 2 ACC 535
## 3 ACG 1096
## 4 ACT 879
## 5 CCA 619
## 6 CCC 429
## 7 CCG 544
## 8 CCT 641
## 9 GCA 554
## 10 GCC 387
## 11 GCG 511
## 12 GCT 556
## 13 TCA 633
## 14 TCC 535
## 15 TCG 584
## 16 TCT 706
# collapse by APOBEC_target
apobec_dat <- aggregate(Freq ~ APOBEC_target, tbl_dat, FUN = sum, na.rm = TRUE)
apobec_dat
## APOBEC_target Freq
## 1 FALSE 9046
## 2 TRUE 1339
58 / 61

APOBEC activation in this sample?

ggplot(apobec_dat,aes(x=APOBEC_target,y=Freq,fill=APOBEC_target))+
geom_bar(stat='identity')+
labs(x="",y="Variants",fill="")+
theme(legend.position = "none")
59 / 61

APOBEC activation in this sample?

Probably not APOBEC-enriched

60 / 61

Exercise Time

61 / 61

About this session

Variant Call Format (VCF) is widely used file format to record genomic variants. We introduced the VCF file format in previous section. In this session, we want to demonstrate how to manipulate VCF file with VariantAnnotation.

We will be using a study on bladder cancer from Wei Wangs group at BGI-Shenzhen as our dataset in this session:

2 / 61
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow