Genomic Variants (part 1)


About this session

Variant Call Format (VCF) is widely used file format to record genomic variants. We introduced the VCF file format in previous section. In this session, we want to demonstrate how to manipulate VCF file with VariantAnnotation.

We will be using a study on bladder cancer from Wei Wangs group at BGI-Shenzhen as our dataset in this session: - Raw data from SRA: PRJNA185252 - Reference genome: BSgenome.Hsapiens.UCSC.hg19 - Variant calling procedure is based on GATK Best Practice + Align to reference genome: Burrows-Wheeler Aligner, bwa-mem + Sort and remove duplicates: Picard-tools, SortSam/MarkDuplicates + Recalibrate Base Quality Scores: GATK, BaseRecalibrator/ApplyBQSR + Variant Calling: GATK, HaplotypeCaller/GenotypeGVCFs - Demo data: SAMN01882168 (“data/SAMN01882168_filt.vcf.gz”)

Outline of this session

  • Load VCF file into a vcf object
  • Survey and retrieve information from each field
    • META field: general information of the VCF file
    • Fixed field: variants, usually in VRange format
    • Genotype field: information for each sample
    • Integrate information into single table
  • Differentiate variant types
  • Retrieve nucleotide substitution pattern
  • Trinucleotide motif analysis

VCF files


Load VCF file

library(VariantAnnotation)
vcf <- readVcf("data/SAMN01882168_filt.vcf.gz", "hg19")
vcf
## class: CollapsedVCF 
## dim: 72244 1 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 19 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DP, ...
## info(header(vcf)):
##                    Number Type    Description                                  
##    AC              A      Integer Allele count in genotypes, for each ALT al...
##    AF              A      Float   Allele Frequency, for each ALT allele, in ...
##    AN              1      Integer Total number of alleles in called genotypes  
##    BaseQRankSum    1      Float   Z-score from Wilcoxon rank sum test of Alt...
##    ClippingRankSum 1      Float   Z-score From Wilcoxon rank sum test of Alt...
##    DP              1      Integer Approximate read depth; some reads may hav...
##    DS              0      Flag    Were any of the samples downsampled?         
##    END             1      Integer Stop position of the interval                
##    ExcessHet       1      Float   Phred-scaled p-value for exact test of exc...
##    FS              1      Float   Phred-scaled p-value using Fisher's exact ...
##    InbreedingCoeff 1      Float   Inbreeding coefficient as estimated from t...
##    MLEAC           A      Integer Maximum likelihood expectation (MLE) for t...
##    MLEAF           A      Float   Maximum likelihood expectation (MLE) for t...
##    MQ              1      Float   RMS Mapping Quality                          
##    MQRankSum       1      Float   Z-score From Wilcoxon rank sum test of Alt...
##    QD              1      Float   Variant Confidence/Quality by Depth          
##    RAW_MQ          1      Float   Raw data for RMS Mapping Quality             
##    ReadPosRankSum  1      Float   Z-score from Wilcoxon rank sum test of Alt...
##    SOR             1      Float   Symmetric Odds Ratio of 2x2 contingency ta...
## geno(vcf):
##   List of length 10: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, RGQ, SB
## geno(header(vcf)):
##           Number Type    Description                                           
##    GT     1      String  Genotype                                              
##    AD     R      Integer Allelic depths for the ref and alt alleles in the o...
##    DP     1      Integer Approximate read depth (reads with MQ=255 or with b...
##    GQ     1      Integer Genotype Quality                                      
##    MIN_DP 1      Integer Minimum DP observed within the GVCF block             
##    PGT    1      String  Physical phasing haplotype information, describing ...
##    PID    1      String  Physical phasing ID information, where each unique ...
##    PL     G      Integer Normalized, Phred-scaled likelihoods for genotypes ...
##    RGQ    1      Integer Unconditional reference genotype confidence, encode...
##    SB     4      Integer Per-sample component statistics which comprise the ...

Overview

As with objects we have met in other courses, there are accessor functions to grab the contents of our VCF object.

header(vcf)
## class: VCFHeader 
## samples(1): SAMN01882168
## meta(5): fileformat reference source GATKCommandLine contig
## fixed(2): FILTER ALT
## info(19): AC AF ... ReadPosRankSum SOR
## geno(10): GT AD ... RGQ SB
  • META field: general information of the VCF file; meta
  • FIX field: variants, usually in VRange format; rowRange, info
  • GENOTYPE field: information for each sample; geno

How many samples are in this VCF?

sampleID <- samples(header(vcf))
sampleID
## [1] "SAMN01882168"

What is in the META field?

meta(header(vcf))
## DataFrameList of length 5
## names(5): fileformat reference source GATKCommandLine contig

Retrieve information in the META field

Once we have the META extracted, we can use the dollar sign to extract specific fields.

# File format
meta(header(vcf))$fileformat
## DataFrame with 1 row and 1 column
##                  Value
##            <character>
## fileformat     VCFv4.2
# Source used for variant calling
meta(header(vcf))$source
## DataFrame with 1 row and 1 column
##                Value
##          <character>
## source GenotypeGVCFs

Retrieve information in the META field

Once we have the META extracted, we can use the dollar sign to extract specific fields.

meta(header(vcf))$contig
## DataFrame with 25 rows and 2 columns
##            length               assembly
##       <character>            <character>
## chr1    249250621 BSgenome.Hsapiens.UC..
## chr2    243199373 BSgenome.Hsapiens.UC..
## chr3    198022430 BSgenome.Hsapiens.UC..
## chr4    191154276 BSgenome.Hsapiens.UC..
## chr5    180915260 BSgenome.Hsapiens.UC..
## ...           ...                    ...
## chr21    48129895 BSgenome.Hsapiens.UC..
## chr22    51304566 BSgenome.Hsapiens.UC..
## chrX    155270560 BSgenome.Hsapiens.UC..
## chrY     59373566 BSgenome.Hsapiens.UC..
## chrM        16571 BSgenome.Hsapiens.UC..

Variants information (VRange format)

The variant information is recorded in a VRange object. This is a form of GRanges object which we regularly use and are integral to Bioconductor. It includes chromosome, start position, end potision, reference allele, and alternative allele.

rd <- rowRanges(vcf)
rd[1:2]
## GRanges object with 2 ranges and 5 metadata columns:
##                   seqnames    ranges strand | paramRangeID            REF
##                      <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##   chr1:815337_G/T     chr1    815337      * |           NA              G
##   chr1:815341_G/A     chr1    815341      * |           NA              G
##                                  ALT      QUAL      FILTER
##                   <DNAStringSetList> <numeric> <character>
##   chr1:815337_G/T                  T     32.77           .
##   chr1:815341_G/A                  A     32.77           .
##   -------
##   seqinfo: 25 sequences from hg19 genome

It’s not a strand-specific library. This is indicated by the "", which means unstranded. *

Retrieving variation information

We can use similar accessors to those used for GRanges to get position information. - chromosome: seqnames() - start position: start() - end position: end()

Position of the variations

as.vector(seqnames(rd)[1:2])  # Chromosome
## [1] "chr1" "chr1"
start(rd)[1:2]  # Start position
## [1] 815337 815341
end(rd)[1:2]  # End position
## [1] 815337 815341

How to get the reference alleles?

  • The ref() function can be used to get the reference allele for a variant. The result is a DNAStringSet.
refBase <- ref(vcf)
refBase[1:2]
## DNAStringSet object of length 2:
##     width seq
## [1]     1 G
## [2]     1 G
  • Use as.character to covert it to character
refBase <- as.character(refBase)
refBase[1:2]
## [1] "G" "G"

How to get alternative alleles?

The alt() function can be used to get the alternative allele for a variant. The result is a DNAStringSetList as alternative alleles could be multiple changes.

altBase <- alt(vcf)
alt(vcf)[1:2]
## DNAStringSetList of length 2
## [[1]] T
## [[2]] A

How to get alternative alleles?

  • As it is a list, getting a character back is more complicated. First we can extract vectors from list by using lapply(). Specifically at this point we are subsetting the data to only the first variant.
# get the 1st vector of the list
altBase <- lapply(altBase, `[[`, 1)
altBase[1:2]
## [[1]]
## 1-letter DNAString object
## seq: T
## 
## [[2]]
## 1-letter DNAString object
## seq: A
  • Convert DNAString object to character
altBase <- unlist(lapply(altBase, as.character))
altBase[1:2]
## [1] "T" "A"

INFO section from FIXED field

  • Integrated information from all the samples
  • Annotation information would be recorded in this section
  • All information is stored as a data frame
info(header(vcf))[1:2, ]
## DataFrame with 2 rows and 3 columns
##         Number        Type            Description
##    <character> <character>            <character>
## AC           A     Integer Allele count in geno..
## AF           A       Float Allele Frequency, fo..

INFO section

By calling info() directly on the whole VCF object, we get information for every variant.

info(vcf)[1:2, ]
## DataFrame with 2 rows and 19 columns
##                            AC            AF        AN BaseQRankSum
##                 <IntegerList> <NumericList> <integer>    <numeric>
## chr1:815337_G/T             1           0.5         2        0.084
## chr1:815341_G/A             1           0.5         2        2.740
##                 ClippingRankSum        DP        DS       END ExcessHet
##                       <numeric> <integer> <logical> <integer> <numeric>
## chr1:815337_G/T               0        11     FALSE        NA    3.0103
## chr1:815341_G/A               0        11     FALSE        NA    3.0103
##                        FS InbreedingCoeff         MLEAC         MLEAF        MQ
##                 <numeric>       <numeric> <IntegerList> <NumericList> <numeric>
## chr1:815337_G/T      3.09              NA             1           0.5     48.39
## chr1:815341_G/A      3.09              NA             1           0.5     48.39
##                 MQRankSum        QD    RAW_MQ ReadPosRankSum       SOR
##                 <numeric> <numeric> <numeric>      <numeric> <numeric>
## chr1:815337_G/T    -2.744      2.98        NA         -1.083     2.093
## chr1:815341_G/A    -2.744      2.98        NA         -1.200     2.093

GENOTYPE field

  • Separated by individual samples
geno(header(vcf))[1:2, ]
## DataFrame with 2 rows and 3 columns
##         Number        Type            Description
##    <character> <character>            <character>
## GT           1      String               Genotype
## AD           R     Integer Allelic depths for t..

Genotype information for each sample

By calling geno() directly on the whole VCF object, we get information for every variant.

paste0("GT: ", geno(header(vcf))[1, 3])
## [1] "GT: Genotype"
matGT <- geno(vcf)$GT
matGT[1:2, ]
## chr1:815337_G/T chr1:815341_G/A 
##           "0/1"           "0/1"

GT Types

tbl <- table(geno(vcf)$GT)
tbl_dat <- as.data.frame(tbl)
tbl
## 
##   0/1   1/1   1/2 
## 70215  1853   176
  • 0/1: heterozygous mutations, one allele is the same as reference sequence
  • 1/1: homozygous mutations, both alleles are different from reference sequence
  • 1/2: heterozygous mutations, both alleles are different from reference sequence

GT Types

ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat='Identity')+
  labs(x="",y="Counts",fill="")+
  theme_classic()

GT Types

Depth for each sample (DP)

paste0("DP: ", geno(header(vcf))[3, 3])
## [1] "DP: Approximate read depth (reads with MQ=255 or with bad mates are filtered)"
matDP <- geno(vcf)$DP
matDP[1:2, ]
## chr1:815337_G/T chr1:815341_G/A 
##              11              11

DP Distribution

summary(as.vector(matDP))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   10.00   14.00   24.41   21.00 4136.00
ggplot(as.data.frame(matDP),aes(x=SAMN01882168))+geom_histogram()+
  labs(x="",y="Counts")+
  scale_x_log10()+
  theme_classic()

DP Distribution

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Genotype calling quality (GQ)

paste0("GQ: ", geno(header(vcf))[4, 3])
## [1] "GQ: Genotype Quality"
matGQ <- geno(vcf)$GQ
matGQ[1:2, ]
## chr1:815337_G/T chr1:815341_G/A 
##              61              61

GQ distribution

summary(as.vector(matGQ))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   90.00   99.00   88.41   99.00   99.00
ggplot(as.data.frame(matGQ),aes(x=SAMN01882168))+geom_histogram()+
  labs(x="",y="Counts")+
  scale_x_log10()+
  theme_classic()

GQ Distribution

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Gathering variant information


Gathering information ~ GT 0/1 and 1/1

  • select variants with GT 0/1 or 1/1
var_1 <- rownames(geno(vcf)$GT)[
  geno(vcf)$GT=="0/1" | 
    geno(vcf)$GT=="1/1"]
  • Extract variant information
varTab1 <- data.frame(variant=names(rd)[names(rd) %in% var_1],
                      chr=as.vector(seqnames(rd)[names(rd) %in% var_1]),
                      start=start(rd)[names(rd) %in% var_1],
                      end=end(rd)[names(rd) %in% var_1],
                      stringsAsFactors = FALSE)

Gathering information ~ GT 0/1 and 1/1

  • Ref alleles are retrieved from ref(vcf)
ref_base <- ref(vcf)[rownames(vcf) %in% var_1]
ref_base[1:2]
## DNAStringSet object of length 2:
##     width seq
## [1]     1 G
## [2]     1 G
varTab1$refBase <- as.character(ref_base)

Gathering information ~ GT 0/1 and 1/1

  • Alt alleles are retrieved from alt(vcf)
alt_base <- lapply(alt(vcf)[rownames(vcf) %in% var_1],`[[`,1)
alt_base[1]
## [[1]]
## 1-letter DNAString object
## seq: T
alt_base <- lapply(alt_base,as.character)
alt_base[1]
## [[1]]
## [1] "T"
varTab1$altBase <- unlist(alt_base)

Gathering information ~ GT 0/1 and 1/1

  • Extract counts from AD
adCount <- geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_1]
adCount[1]
## [[1]]
## [1] 8 3
varTab1$refCount <- unlist(lapply(adCount,`[[`,1))
varTab1$altCount <- unlist(lapply(adCount,`[[`,2))

Gathering information ~ GT 0/1 and 1/1

  • genoType: genotype (GT)
  • gtQuality: genotyping quality (GQ)
varTab1$genoType <- geno(vcf)$GT[rownames(geno(vcf)$GT) %in% var_1]
varTab1$gtQuality <- geno(vcf)$GQ[rownames(geno(vcf)$GQ) %in% var_1]

Gathering information ~ genotype 1/2

  • Heterozygous mutations, both alleles are different from the reference sequence.
  • Both ref allele and alt allele have to be retrieved from alt(vcf)
var_2 <- rownames(geno(vcf)$GT)[geno(vcf)$GT=="1/2"]
varTab2 <- data.frame(variant=names(rd)[names(rd) %in% var_2],
                      chr=as.vector(seqnames(rd)[names(rd) %in% var_2]),
                      start=start(rd)[names(rd) %in% var_2],
                      end=end(rd)[names(rd) %in% var_2],
                      refBase=unlist(lapply(lapply(
                        alt(vcf)[rownames(vcf) %in% var_2],`[[`,1),as.character)),
                      altBase=unlist(lapply(lapply(
                        alt(vcf)[rownames(vcf) %in% var_2],`[[`,2),as.character)),
                      refCount=unlist(lapply(
                        geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,2)),
                      altCount=unlist(
                        lapply(geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,3)),
                      genoType=geno(vcf)$GT[rownames(geno(vcf)$GT) %in% var_2],
                      gtQuality=geno(vcf)$GQ[rownames(geno(vcf)$GQ) %in% var_2],
                      stringsAsFactors = FALSE)

Merged datatable

varTab <- rbind(varTab1, varTab2)
varTab[1:2, ]
##           variant  chr  start    end refBase altBase refCount altCount genoType
## 1 chr1:815337_G/T chr1 815337 815337       G       T        8        3      0/1
## 2 chr1:815341_G/A chr1 815341 815341       G       A        8        3      0/1
##   gtQuality
## 1        61
## 2        61

Differentiate variant types

  • SNP: single-nucleotide substitutions
  • DEL: deletions
  • INS: insertions
  • Others: complicated variations, such as Ins/Del or Inversion

Variant types

# differentiate SNP/INS/DEL/Others
for (k in 1:length(varTab$variant)) {
    if (width(varTab$refBase[k]) < width(varTab$altBase[k])) {
        varTab$mutType[k] <- "INS"
    } else if (width(varTab$refBase[k]) > width(varTab$altBase[k])) {
        varTab$mutType[k] <- "DEL"
    } else if (width(varTab$refBase[k]) == 1 & width(varTab$altBase[k]) == 1) {
        varTab$mutType[k] <- "SNP"
    } else {
        varTab$mutType[k] <- "Others"
    }
}
# 
tbl <- table(varTab$mutType)
tbl_dat <- as.data.frame(tbl)
tbl
## 
##    DEL    INS Others    SNP 
##   4351   3174      5  64714

Variant types

ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat = 'identity')+
  labs(x="",y="Mutations",fill="")+
  theme_classic()

Variant types

Nucleotide substitution pattern

  • only SNPs
  • Transition (Ti): purine-to-purine, pyrimidine-to-pyrimidine
  • Transversion (Tv): purine-to-pyrimidine, pyrimidine-to-purine

Nucleotide substitution

# Transition (Ti)
ti <- c("A>G","G>A","C>T","T>C")
# Transveersion (Tv)
tv <- c("A>T","A>C","G>T","G>C","C>A","C>G","T>A","T>G")
varTab$nuSub <- paste0(varTab$refBase,">",varTab$altBase)
varTab$TiTv[varTab$nuSub %in% ti] <- "Ti"
varTab$TiTv[varTab$nuSub %in% tv] <- "Tv"
varTab[1:2,]
##           variant  chr  start    end refBase altBase refCount altCount genoType
## 1 chr1:815337_G/T chr1 815337 815337       G       T        8        3      0/1
## 2 chr1:815341_G/A chr1 815341 815341       G       A        8        3      0/1
##   gtQuality mutType nuSub TiTv
## 1        61     SNP   G>T   Tv
## 2        61     SNP   G>A   Ti

Nucleotide substitution

varX <- varTab[varTab$mutType == "SNP", ]
tbl <- table(varX$nuSub)
tbl_dat <- as.data.frame(tbl)
tbl
## 
##   A>C   A>G   A>T   C>A   C>G   C>T   G>A   G>C   G>T   T>A   T>C   T>G 
##  3062  9380  3083  3448  2830 10393 10579  2793  3515  3143  9424  3064

Nucleotide substitution

ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat = 'identity')+
  labs(x="",y="Mutations",fill="")+
  theme(legend.position = "none")

Nucleotide substitution

Ti/Tv

tbl <- table(varX$TiTv)
tbl_dat <- as.data.frame(tbl)
tbl
## 
##    Ti    Tv 
## 39776 24938

Ti/Tv

ggplot(as.data.frame(table(varX$TiTv)),aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat = 'identity')+labs(x="",y="Mutations",fill="")+
  theme(legend.position = "none")

Ti/Tv

Trinucleotide motif analysis


Trinucleotide motif analysis

Other constraints may bias nucleotide transitions. One of these is the trinucleotide motif around the variant (from -1 to +1). This pattern could reflect different causes for the mutations.

One such example is APOBEC, a deaminase leading to C>T substitutions in the motif T[C>T]W. So, the proportion of T[C>T]W in overall C>T substitutions could reflect to APOBEC activity. More applications of the trinucleotide motifs will be discussed in session3.

Extract C>T substituion

library(BSgenome.Hsapiens.UCSC.hg19)
library(GenomicFeatures)
library(stringr)
# 
rd_idx <- str_split(names(rd), "_", simplify = T)
rd_idx[1:2, ]
##      [,1]          [,2] 
## [1,] "chr1:815337" "G/T"
## [2,] "chr1:815341" "G/A"
rd_sub <- rd[rd_idx[, 2] == "C/T"]
rd_sub[1:2, ]
## GRanges object with 2 ranges and 5 metadata columns:
##                   seqnames    ranges strand | paramRangeID            REF
##                      <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##   chr1:815520_C/T     chr1    815520      * |           NA              C
##   chr1:815573_C/T     chr1    815573      * |           NA              C
##                                  ALT      QUAL      FILTER
##                   <DNAStringSetList> <numeric> <character>
##   chr1:815520_C/T                  T    916.77           .
##   chr1:815573_C/T                  T     70.77           .
##   -------
##   seqinfo: 25 sequences from hg19 genome

Extract sequences beneath the mutation from -1 to +1

We can use the getSeq() function to get the sequence underlying these variants, as we have done with GRanges.

rd_sub$triNu <- getSeq(Hsapiens,
                 seqnames(rd_sub),
                 start=start(rd_sub)-1,
                 end=end(rd_sub)+1)
rd_sub[1:2]
## GRanges object with 2 ranges and 6 metadata columns:
##                   seqnames    ranges strand | paramRangeID            REF
##                      <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##   chr1:815520_C/T     chr1    815520      * |           NA              C
##   chr1:815573_C/T     chr1    815573      * |           NA              C
##                                  ALT      QUAL      FILTER          triNu
##                   <DNAStringSetList> <numeric> <character> <DNAStringSet>
##   chr1:815520_C/T                  T    916.77           .            CCT
##   chr1:815573_C/T                  T     70.77           .            ACA
##   -------
##   seqinfo: 25 sequences from hg19 genome

Trinucleotide pattern

tbl <- table(rd_sub$triNu)
tbl_dat <- as.data.frame(tbl)
tbl
## 
##  ACA  ACC  ACG  ACT  CCA  CCC  CCG  CCT  GCA  GCC  GCG  GCT  TCA  TCC  TCG  TCT 
## 1176  535 1096  879  619  429  544  641  554  387  511  556  633  535  584  706

Trinucleotide pattern

ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat='identity')+
  labs(x="",y="Variants",fill="")+
  theme(legend.position = "none")

Trinucleotide pattern

APOBEC activation in this sample?

  • APOBEC target: TCW (TCA/TCT)
# TCW: TCA/TCT
tbl_dat$APOBEC_target <- tbl_dat$Var1 %in% c("TCA", "TCT")
tbl_dat[1:2]
##    Var1 Freq
## 1   ACA 1176
## 2   ACC  535
## 3   ACG 1096
## 4   ACT  879
## 5   CCA  619
## 6   CCC  429
## 7   CCG  544
## 8   CCT  641
## 9   GCA  554
## 10  GCC  387
## 11  GCG  511
## 12  GCT  556
## 13  TCA  633
## 14  TCC  535
## 15  TCG  584
## 16  TCT  706
# collapse by APOBEC_target
apobec_dat <- aggregate(Freq ~ APOBEC_target, tbl_dat, FUN = sum, na.rm = TRUE)
apobec_dat
##   APOBEC_target Freq
## 1         FALSE 9046
## 2          TRUE 1339

APOBEC activation in this sample?

ggplot(apobec_dat,aes(x=APOBEC_target,y=Freq,fill=APOBEC_target))+
  geom_bar(stat='identity')+
  labs(x="",y="Variants",fill="")+
  theme(legend.position = "none")

APOBEC activation in this sample?

Probably not APOBEC-enriched

Exercise Time