Genomic Variant manipulation

In this exercise, we will practice how to manipulate VCF files. Please find this VCF file “data/SAMN01882168_filt.vcf.gz” and use it to answer the following questions.

  1. Read in the VCF file and make a VRange object

  2. Please extract the genotype field and explain the abbreviations

## DataFrame with 10 rows and 3 columns
##             Number        Type            Description
##        <character> <character>            <character>
## GT               1      String               Genotype
## AD               R     Integer Allelic depths for t..
## DP               1     Integer Approximate read dep..
## GQ               1     Integer       Genotype Quality
## MIN_DP           1     Integer Minimum DP observed ..
## PGT              1      String Physical phasing hap..
## PID              1      String Physical phasing ID ..
## PL               G     Integer Normalized, Phred-sc..
## RGQ              1     Integer Unconditional refere..
## SB               4     Integer Per-sample component..
  1. Please calculate the incidence of mutation occurence in each chromosome.

  1. Subset to just the variants on Chr21.

  2. Extract GT information from VCF subset and make a barchart to describe the variant number in each genotype.

  1. Extract DP information from VCFf subset and make a histogram.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   13.00   19.00   25.05   30.00  629.00
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

  1. Extract GQ information from VCF subset and make histogram
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   99.00   99.00   91.32   99.00   99.00
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 12 rows containing non-finite values (stat_bin).

  1. Please make a data.frame including the following information demonstrated below.
##              variant   chr   start     end refBase altBase refCount altCount
## 1 chr21:9412358_TA/T chr21 9412358 9412359      TA       T        6        4
## 2  chr21:9413584_G/A chr21 9413584 9413584       G       A        3        8
##   genoType gtQuality
## 1      0/1        81
## 2      0/1        62
  1. Please differentiate variants by types (SNP/INS/DEL/Others) and count variants by each type
## 
##  DEL  INS  SNP 
##  612  482 9536

  1. Evaluate nucleotide substitutions
##              variant   chr   start     end refBase altBase refCount altCount
## 1 chr21:9412358_TA/T chr21 9412358 9412359      TA       T        6        4
## 2  chr21:9413584_G/A chr21 9413584 9413584       G       A        3        8
##   genoType gtQuality mutType nuSub TiTv
## 1      0/1        81     DEL  TA>T <NA>
## 2      0/1        62     SNP   G>A   Ti