RU_GenomicVariant_session1.utf8

class: center, middle, inverse, title-slide

# Genomic Variants ~ Session 1
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=796px>
</html>
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/RU_GenomicVariants/" class="uri">http://rockefelleruniversity.github.io/RU_GenomicVariants/</a>

---

## About this session
Variant Call Format (VCF) is widely used file format to record genomic variants. We introduced the VCF file format in [previous section](http://rockefelleruniversity.github.io/Genomic_Data//presentations/slides/GenomicsData.html#35). In this session, we want to demonstrate how to manipulate VCF file with [VariantAnnotation](https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html).

We will be using a study on bladder cancer from Wei Wangs group at BGI-Shenzhen as our dataset in this session:
- Raw data from SRA: [PRJNA185252](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA185252)
- Reference genome: [BSgenome.Hsapiens.UCSC.hg19](https://bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html)
- Variant calling procedure is based on [GATK Best Practice](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-)
  + Align to reference genome: [Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/bwa.shtml), bwa-mem
  + Sort and remove duplicates: [Picard-tools](http://broadinstitute.github.io/picard/command-line-overview.html), [SortSam](https://broadinstitute.github.io/picard/command-line-overview.html#SortSam)/[MarkDuplicates](https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates)
  + Recalibrate Base Quality Scores: [GATK](https://gatk.broadinstitute.org/hc/en-us/articles/360036433132--Tool-Documentation-Index), [BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036434532-BaseRecalibrator)/[ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360036433652-ApplyBQSR)
  + Variant Calling: [GATK](https://gatk.broadinstitute.org/hc/en-us/articles/360036433132--Tool-Documentation-Index), [HaplotypeCaller](https://gatk.broadinstitute.org/hc/en-us/articles/360036803991-HaplotypeCaller)/[GenotypeGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/360036806011-GenotypeGVCFs)
- Demo data: SAMN01882168 ("data/SAMN01882168_filt.vcf.gz")

---
## Outline of this session
- Load VCF file into a vcf object
- Survey and retrieve information from each field
  + META field: general information of the VCF file
  + Fixed field: variants, usually in VRange format
  + Genotype field: information for each sample
  + Integrate information into single table
- Differentiate variant types
- Retrieve nucleotide substitution pattern
- Trinucleotide motif analysis

---
class: inverse, center, middle

# VCF files

---

## [VariantAnnotation](https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html)
<img src="imgs/vcfMan_fig2.png" width="75%" style="display: block; margin: auto;" />

---
## Load VCF file

```r
library(VariantAnnotation)
vcf <- readVcf("data/SAMN01882168_filt.vcf.gz", "hg19")
vcf
```

```
## class: CollapsedVCF 
## dim: 72244 1 
## rowRanges(vcf):
##   GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER
## info(vcf):
##   DataFrame with 19 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DP, ...
## info(header(vcf)):
##                    Number Type    Description                                  
##    AC              A      Integer Allele count in genotypes, for each ALT al...
##    AF              A      Float   Allele Frequency, for each ALT allele, in ...
##    AN              1      Integer Total number of alleles in called genotypes  
##    BaseQRankSum    1      Float   Z-score from Wilcoxon rank sum test of Alt...
##    ClippingRankSum 1      Float   Z-score From Wilcoxon rank sum test of Alt...
##    DP              1      Integer Approximate read depth; some reads may hav...
##    DS              0      Flag    Were any of the samples downsampled?         
##    END             1      Integer Stop position of the interval                
##    ExcessHet       1      Float   Phred-scaled p-value for exact test of exc...
##    FS              1      Float   Phred-scaled p-value using Fisher's exact ...
##    InbreedingCoeff 1      Float   Inbreeding coefficient as estimated from t...
##    MLEAC           A      Integer Maximum likelihood expectation (MLE) for t...
##    MLEAF           A      Float   Maximum likelihood expectation (MLE) for t...
##    MQ              1      Float   RMS Mapping Quality                          
##    MQRankSum       1      Float   Z-score From Wilcoxon rank sum test of Alt...
##    QD              1      Float   Variant Confidence/Quality by Depth          
##    RAW_MQ          1      Float   Raw data for RMS Mapping Quality             
##    ReadPosRankSum  1      Float   Z-score from Wilcoxon rank sum test of Alt...
##    SOR             1      Float   Symmetric Odds Ratio of 2x2 contingency ta...
## geno(vcf):
##   List of length 10: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, RGQ, SB
## geno(header(vcf)):
##           Number Type    Description                                           
##    GT     1      String  Genotype                                              
##    AD     R      Integer Allelic depths for the ref and alt alleles in the o...
##    DP     1      Integer Approximate read depth (reads with MQ=255 or with b...
##    GQ     1      Integer Genotype Quality                                      
##    MIN_DP 1      Integer Minimum DP observed within the GVCF block             
##    PGT    1      String  Physical phasing haplotype information, describing ...
##    PID    1      String  Physical phasing ID information, where each unique ...
##    PL     G      Integer Normalized, Phred-scaled likelihoods for genotypes ...
##    RGQ    1      Integer Unconditional reference genotype confidence, encode...
##    SB     4      Integer Per-sample component statistics which comprise the ...
```

---
## Overview
As with objects we have met in other courses, there are accessor functions to grab the contents of our VCF object.

```r
header(vcf)
```

```
## class: VCFHeader 
## samples(1): SAMN01882168
## meta(5): fileformat reference source GATKCommandLine contig
## fixed(2): FILTER ALT
## info(19): AC AF ... ReadPosRankSum SOR
## geno(10): GT AD ... RGQ SB
```
- META field: general information of the VCF file; ***meta***
- FIX field: variants, usually in VRange format; ***rowRange***, ***info***
- GENOTYPE field: information for each sample; ***geno***

---
## How many samples are in this VCF?

```r
sampleID <- samples(header(vcf))
sampleID
```

```
## [1] "SAMN01882168"
```

---
## What is in the META field?

```r
meta(header(vcf))
```

```
## DataFrameList of length 5
## names(5): fileformat reference source GATKCommandLine contig
```

---
## Retrieve information in the META field
Once we have the META extracted, we can use the dollar sign to extract specific fields.

```r
# File format
meta(header(vcf))$fileformat
```

```
## DataFrame with 1 row and 1 column
##                  Value
##            <character>
## fileformat     VCFv4.2
```

```r
# Source used for variant calling
meta(header(vcf))$source
```

```
## DataFrame with 1 row and 1 column
##                Value
##          <character>
## source GenotypeGVCFs
```

---
## Retrieve information in the META field
Once we have the META extracted, we can use the dollar sign to extract specific fields.

```r
meta(header(vcf))$contig
```

```
## DataFrame with 25 rows and 2 columns
##            length               assembly
##       <character>            <character>
## chr1    249250621 BSgenome.Hsapiens.UC..
## chr2    243199373 BSgenome.Hsapiens.UC..
## chr3    198022430 BSgenome.Hsapiens.UC..
## chr4    191154276 BSgenome.Hsapiens.UC..
## chr5    180915260 BSgenome.Hsapiens.UC..
## ...           ...                    ...
## chr21    48129895 BSgenome.Hsapiens.UC..
## chr22    51304566 BSgenome.Hsapiens.UC..
## chrX    155270560 BSgenome.Hsapiens.UC..
## chrY     59373566 BSgenome.Hsapiens.UC..
## chrM        16571 BSgenome.Hsapiens.UC..
```

---
## Variants information (VRange format)
The variant information is recorded in a VRange object. This is a form of [GRanges](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicIntervals_In_Bioconductor.html#6) object which we regularly use and are integral to Bioconductor. It includes chromosome, start position, end potision, reference allele, and alternative allele.

```r
rd <- rowRanges(vcf)
rd[1:2]
```

```
## GRanges object with 2 ranges and 5 metadata columns:
##                   seqnames    ranges strand | paramRangeID            REF
##                      <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##   chr1:815337_G/T     chr1    815337      * |           NA              G
##   chr1:815341_G/A     chr1    815341      * |           NA              G
##                                  ALT      QUAL      FILTER
##                   <DNAStringSetList> <numeric> <character>
##   chr1:815337_G/T                  T     32.77           .
##   chr1:815341_G/A                  A     32.77           .
##   -------
##   seqinfo: 25 sequences from hg19 genome
```
***It's not a strand-specific library. This is indicated by the "*", which means unstranded. ***

---
## Retrieving variation information
We can use similar accessors to those used for [GRanges](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicIntervals_In_Bioconductor.html#16) to get position information. 
- chromosome: seqnames()
- start position: start()
- end position: end()

---
## Position of the variations

```r
as.vector(seqnames(rd)[1:2])  # Chromosome
```

```
## [1] "chr1" "chr1"
```

```r
start(rd)[1:2]  # Start position
```

```
## [1] 815337 815341
```

```r
end(rd)[1:2]  # End position
```

```
## [1] 815337 815341
```

---
## How to get the reference alleles?
- The *ref()* function can be used to get the reference allele for a variant. The result is a DNAStringSet.

```r
refBase <- ref(vcf)
refBase[1:2]
```

```
## DNAStringSet object of length 2:
##     width seq
## [1]     1 G
## [2]     1 G
```
- Use as.character to covert it to character

```r
refBase <- as.character(refBase)
refBase[1:2]
```

```
## [1] "G" "G"
```

---
## How to get alternative alleles?
The *alt()* function can be used to get the alternative allele for a variant. The result is a DNAStringSetList ***as alternative alleles could be multiple changes.***

```r
altBase <- alt(vcf)
alt(vcf)[1:2]
```

```
## DNAStringSetList of length 2
## [[1]] T
## [[2]] A
```

---
## How to get alternative alleles?
- As it is a list, getting a character back is more complicated. First we can extract vectors from list by using lapply(). Specifically at this point we are subsetting the data to only the first variant.

```r
# get the 1st vector of the list
altBase <- lapply(altBase, `[[`, 1)
altBase[1:2]
```

```
## [[1]]
## 1-letter DNAString object
## seq: T
## 
## [[2]]
## 1-letter DNAString object
## seq: A
```
- Convert DNAString object to character

```r
altBase <- unlist(lapply(altBase, as.character))
altBase[1:2]
```

```
## [1] "T" "A"
```

---
# INFO section from FIXED field
- **Integrated information from all the samples**
- Annotation information would be recorded in this section
- All information is stored as a *data frame*

```r
info(header(vcf))[1:2, ]
```

```
## DataFrame with 2 rows and 3 columns
##         Number        Type            Description
##    <character> <character>            <character>
## AC           A     Integer Allele count in geno..
## AF           A       Float Allele Frequency, fo..
```

---
## INFO section
By calling *info()* directly on the whole VCF object, we get information for every variant.

```r
info(vcf)[1:2, ]
```

```
## DataFrame with 2 rows and 19 columns
##                            AC            AF        AN BaseQRankSum
##                 <IntegerList> <NumericList> <integer>    <numeric>
## chr1:815337_G/T             1           0.5         2        0.084
## chr1:815341_G/A             1           0.5         2        2.740
##                 ClippingRankSum        DP        DS       END ExcessHet
##                       <numeric> <integer> <logical> <integer> <numeric>
## chr1:815337_G/T               0        11     FALSE        NA    3.0103
## chr1:815341_G/A               0        11     FALSE        NA    3.0103
##                        FS InbreedingCoeff         MLEAC         MLEAF        MQ
##                 <numeric>       <numeric> <IntegerList> <NumericList> <numeric>
## chr1:815337_G/T      3.09              NA             1           0.5     48.39
## chr1:815341_G/A      3.09              NA             1           0.5     48.39
##                 MQRankSum        QD    RAW_MQ ReadPosRankSum       SOR
##                 <numeric> <numeric> <numeric>      <numeric> <numeric>
## chr1:815337_G/T    -2.744      2.98        NA         -1.083     2.093
## chr1:815341_G/A    -2.744      2.98        NA         -1.200     2.093
```

---
# GENOTYPE field
- Separated by individual samples

```r
geno(header(vcf))[1:2, ]
```

```
## DataFrame with 2 rows and 3 columns
##         Number        Type            Description
##    <character> <character>            <character>
## GT           1      String               Genotype
## AD           R     Integer Allelic depths for t..
```

---
## Genotype information for each sample
By calling *geno()* directly on the whole VCF object, we get information for every variant.

```r
paste0("GT: ", geno(header(vcf))[1, 3])
```

```
## [1] "GT: Genotype"
```

```r
matGT <- geno(vcf)$GT
matGT[1:2, ]
```

```
## chr1:815337_G/T chr1:815341_G/A 
##           "0/1"           "0/1"
```

---
## GT Types

```r
tbl <- table(geno(vcf)$GT)
tbl_dat <- as.data.frame(tbl)
tbl
```

```
## 
##   0/1   1/1   1/2 
## 70215  1853   176
```
- 0/1: *heterozygous mutations, one allele is the same as reference sequence*
- 1/1: *homozygous mutations, both alleles are different from reference sequence*
- 1/2: *heterozygous mutations, both alleles are different from reference sequence*

---
## GT Types

```r
ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat='Identity')+
  labs(x="",y="Counts",fill="")+
  theme_classic()
```

---
## GT Types
<img src="RU_GenomicVariant_session1_files/figure-html/genoGT_varMan_disp2-1.png" style="display: block; margin: auto;" />

---
## Depth for each sample (DP)

```r
paste0("DP: ", geno(header(vcf))[3, 3])
```

```
## [1] "DP: Approximate read depth (reads with MQ=255 or with bad mates are filtered)"
```

```r
matDP <- geno(vcf)$DP
matDP[1:2, ]
```

```
## chr1:815337_G/T chr1:815341_G/A 
##              11              11
```

---
## DP Distribution

```r
summary(as.vector(matDP))
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00   10.00   14.00   24.41   21.00 4136.00
```

```r
ggplot(as.data.frame(matDP),aes(x=SAMN01882168))+geom_histogram()+
  labs(x="",y="Counts")+
  scale_x_log10()+
  theme_classic()
```

---
## DP Distribution

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

---
## Genotype calling quality (GQ)
- The confidence in genotype calling
- Expressed as [Phred quality score](https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores)

```r
paste0("GQ: ", geno(header(vcf))[4, 3])
```

```
## [1] "GQ: Genotype Quality"
```

```r
matGQ <- geno(vcf)$GQ
matGQ[1:2, ]
```

```
## chr1:815337_G/T chr1:815341_G/A 
##              61              61
```

---
## GQ distribution

```r
summary(as.vector(matGQ))
```

```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   90.00   99.00   88.41   99.00   99.00
```

```r
ggplot(as.data.frame(matGQ),aes(x=SAMN01882168))+geom_histogram()+
  labs(x="",y="Counts")+
  scale_x_log10()+
  theme_classic()
```

---
## GQ Distribution

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

---
class: inverse, center, middle

# Gathering variant information

---

## Gathering information ~ GT 0/1 and 1/1
- select variants with GT 0/1 or 1/1

```r
var_1 <- rownames(geno(vcf)$GT)[
  geno(vcf)$GT=="0/1" | 
    geno(vcf)$GT=="1/1"]
```
- Extract variant information

```r
varTab1 <- data.frame(variant=names(rd)[names(rd) %in% var_1],
                      chr=as.vector(seqnames(rd)[names(rd) %in% var_1]),
                      start=start(rd)[names(rd) %in% var_1],
                      end=end(rd)[names(rd) %in% var_1],
                      stringsAsFactors = FALSE)
```

---
## Gathering information ~ GT 0/1 and 1/1
- Ref alleles are retrieved from ref(vcf)

```r
ref_base <- ref(vcf)[rownames(vcf) %in% var_1]
ref_base[1:2]
```

```
## DNAStringSet object of length 2:
##     width seq
## [1]     1 G
## [2]     1 G
```

```r
varTab1$refBase <- as.character(ref_base)
```

---
## Gathering information ~ GT 0/1 and 1/1
- Alt alleles are retrieved from alt(vcf)

```r
alt_base <- lapply(alt(vcf)[rownames(vcf) %in% var_1],`[[`,1)
alt_base[1]
```

```
## [[1]]
## 1-letter DNAString object
## seq: T
```

```r
alt_base <- lapply(alt_base,as.character)
alt_base[1]
```

```
## [[1]]
## [1] "T"
```

```r
varTab1$altBase <- unlist(alt_base)
```

---
## Gathering information ~ GT 0/1 and 1/1
- Extract counts from AD

```r
adCount <- geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_1]
adCount[1]
```

```
## [[1]]
## [1] 8 3
```

```r
varTab1$refCount <- unlist(lapply(adCount,`[[`,1))
varTab1$altCount <- unlist(lapply(adCount,`[[`,2))
```

---
## Gathering information ~ GT 0/1 and 1/1
- genoType: genotype (GT)
- gtQuality: genotyping quality (GQ)

```r
varTab1$genoType <- geno(vcf)$GT[rownames(geno(vcf)$GT) %in% var_1]
varTab1$gtQuality <- geno(vcf)$GQ[rownames(geno(vcf)$GQ) %in% var_1]
```

---
## Gathering information ~ genotype 1/2
- Heterozygous mutations, both alleles are different from the reference sequence.
- Both ref allele and alt allele have to be retrieved from alt(vcf)

```r
var_2 <- rownames(geno(vcf)$GT)[geno(vcf)$GT=="1/2"]
varTab2 <- data.frame(variant=names(rd)[names(rd) %in% var_2],
                      chr=as.vector(seqnames(rd)[names(rd) %in% var_2]),
                      start=start(rd)[names(rd) %in% var_2],
                      end=end(rd)[names(rd) %in% var_2],
                      refBase=unlist(lapply(lapply(
                        alt(vcf)[rownames(vcf) %in% var_2],`[[`,1),as.character)),
                      altBase=unlist(lapply(lapply(
                        alt(vcf)[rownames(vcf) %in% var_2],`[[`,2),as.character)),
                      refCount=unlist(lapply(
                        geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,2)),
                      altCount=unlist(
                        lapply(geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,3)),
                      genoType=geno(vcf)$GT[rownames(geno(vcf)$GT) %in% var_2],
                      gtQuality=geno(vcf)$GQ[rownames(geno(vcf)$GQ) %in% var_2],
                      stringsAsFactors = FALSE)
```

---
## Merged datatable

```r
varTab <- rbind(varTab1, varTab2)
varTab[1:2, ]
```

```
##           variant  chr  start    end refBase altBase refCount altCount genoType
## 1 chr1:815337_G/T chr1 815337 815337       G       T        8        3      0/1
## 2 chr1:815341_G/A chr1 815341 815341       G       A        8        3      0/1
##   gtQuality
## 1        61
## 2        61
```

---
## Differentiate variant types
- SNP: single-nucleotide substitutions
- DEL: deletions
- INS: insertions
- Others: complicated variations, such as Ins/Del or Inversion

---
## Variant types

```r
# differentiate SNP/INS/DEL/Others
for (k in 1:length(varTab$variant)) {
    if (width(varTab$refBase[k]) < width(varTab$altBase[k])) {
        varTab$mutType[k] <- "INS"
    } else if (width(varTab$refBase[k]) > width(varTab$altBase[k])) {
        varTab$mutType[k] <- "DEL"
    } else if (width(varTab$refBase[k]) == 1 & width(varTab$altBase[k]) == 1) {
        varTab$mutType[k] <- "SNP"
    } else {
        varTab$mutType[k] <- "Others"
    }
}
# 
tbl <- table(varTab$mutType)
tbl_dat <- as.data.frame(tbl)
tbl
```

```
## 
##    DEL    INS Others    SNP 
##   4351   3174      5  64714
```

---
## Variant types

```r
ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat = 'identity')+
  labs(x="",y="Mutations",fill="")+
  theme_classic()
```

---
## Variant types
<img src="RU_GenomicVariant_session1_files/figure-html/mutType_pres2-1.png" style="display: block; margin: auto;" />

---
## Nucleotide substitution pattern
- only SNPs
- Transition (Ti): purine-to-purine, pyrimidine-to-pyrimidine
- Transversion (Tv): purine-to-pyrimidine, pyrimidine-to-purine

---
## Nucleotide substitution

```r
# Transition (Ti)
ti <- c("A>G","G>A","C>T","T>C")
# Transveersion (Tv)
tv <- c("A>T","A>C","G>T","G>C","C>A","C>G","T>A","T>G")
varTab$nuSub <- paste0(varTab$refBase,">",varTab$altBase)
varTab$TiTv[varTab$nuSub %in% ti] <- "Ti"
varTab$TiTv[varTab$nuSub %in% tv] <- "Tv"
varTab[1:2,]
```

---
## Nucleotide substitution

```r
varX <- varTab[varTab$mutType == "SNP", ]
tbl <- table(varX$nuSub)
tbl_dat <- as.data.frame(tbl)
tbl
```

```
## 
##   A>C   A>G   A>T   C>A   C>G   C>T   G>A   G>C   G>T   T>A   T>C   T>G 
##  3062  9380  3083  3448  2830 10393 10579  2793  3515  3143  9424  3064
```

---
## Nucleotide substitution

```r
ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat = 'identity')+
  labs(x="",y="Mutations",fill="")+
  theme(legend.position = "none")
```

---
## Nucleotide substitution
<img src="RU_GenomicVariant_session1_files/figure-html/TiTv_pres_nuSubPres2-1.png" style="display: block; margin: auto;" />

---
## Ti/Tv

```r
tbl <- table(varX$TiTv)
tbl_dat <- as.data.frame(tbl)
tbl
```

```
## 
##    Ti    Tv 
## 39776 24938
```

---
## Ti/Tv

```r
ggplot(as.data.frame(table(varX$TiTv)),aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat = 'identity')+labs(x="",y="Mutations",fill="")+
  theme(legend.position = "none")
```

---
## Ti/Tv
<img src="RU_GenomicVariant_session1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" />

---
class: inverse, center, middle

# Trinucleotide motif analysis

---

## Trinucleotide motif analysis
Other constraints may bias nucleotide transitions. One of these is the trinucleotide motif around the variant (from -1 to +1). This pattern could reflect different causes for the mutations.

One such example is APOBEC, a deaminase leading to C>T substitutions in the motif **T[C>T]W**. So, the proportion of **T[C>T]W** in overall C>T substitutions could reflect to APOBEC activity. More applications of the trinucleotide motifs will be discussed in [session3](https://rockefelleruniversity.github.io/RU_GenomicVariants/presentations/slides/RU_GenomicVariant_session3.html#19).

---
## Extract C>T substituion

```r
library(BSgenome.Hsapiens.UCSC.hg19)
library(GenomicFeatures)
library(stringr)
# 
rd_idx <- str_split(names(rd), "_", simplify = T)
rd_idx[1:2, ]
```

```
##      [,1]          [,2] 
## [1,] "chr1:815337" "G/T"
## [2,] "chr1:815341" "G/A"
```

```r
rd_sub <- rd[rd_idx[, 2] == "C/T"]
rd_sub[1:2, ]
```

```
## GRanges object with 2 ranges and 5 metadata columns:
##                   seqnames    ranges strand | paramRangeID            REF
##                      <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##   chr1:815520_C/T     chr1    815520      * |           NA              C
##   chr1:815573_C/T     chr1    815573      * |           NA              C
##                                  ALT      QUAL      FILTER
##                   <DNAStringSetList> <numeric> <character>
##   chr1:815520_C/T                  T    916.77           .
##   chr1:815573_C/T                  T     70.77           .
##   -------
##   seqinfo: 25 sequences from hg19 genome
```

---
## Extract sequences beneath the mutation from -1 to +1
We can use the *getSeq()* function to get the sequence underlying these variants, as we have done with [GRanges](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicIntervals_In_Bioconductor.html#63).

```r
rd_sub$triNu <- getSeq(Hsapiens,
                 seqnames(rd_sub),
                 start=start(rd_sub)-1,
                 end=end(rd_sub)+1)
rd_sub[1:2]
```

```
## GRanges object with 2 ranges and 6 metadata columns:
##                   seqnames    ranges strand | paramRangeID            REF
##                      <Rle> <IRanges>  <Rle> |     <factor> <DNAStringSet>
##   chr1:815520_C/T     chr1    815520      * |           NA              C
##   chr1:815573_C/T     chr1    815573      * |           NA              C
##                                  ALT      QUAL      FILTER          triNu
##                   <DNAStringSetList> <numeric> <character> <DNAStringSet>
##   chr1:815520_C/T                  T    916.77           .            CCT
##   chr1:815573_C/T                  T     70.77           .            ACA
##   -------
##   seqinfo: 25 sequences from hg19 genome
```

---
## Trinucleotide pattern

```r
tbl <- table(rd_sub$triNu)
tbl_dat <- as.data.frame(tbl)
tbl
```

```
## 
##  ACA  ACC  ACG  ACT  CCA  CCC  CCG  CCT  GCA  GCC  GCG  GCT  TCA  TCC  TCG  TCT 
## 1176  535 1096  879  619  429  544  641  554  387  511  556  633  535  584  706
```

---
## Trinucleotide pattern

```r
ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+
  geom_bar(stat='identity')+
  labs(x="",y="Variants",fill="")+
  theme(legend.position = "none")
```

---
## Trinucleotide pattern
<img src="RU_GenomicVariant_session1_files/figure-html/motif_seqPat_advan3-1.png" style="display: block; margin: auto;" />

---
## APOBEC activation in this sample?
- APOBEC target: TCW (TCA/TCT)

```r
# TCW: TCA/TCT
tbl_dat$APOBEC_target <- tbl_dat$Var1 %in% c("TCA", "TCT")
tbl_dat[1:2]
```

```
##    Var1 Freq
## 1   ACA 1176
## 2   ACC  535
## 3   ACG 1096
## 4   ACT  879
## 5   CCA  619
## 6   CCC  429
## 7   CCG  544
## 8   CCT  641
## 9   GCA  554
## 10  GCC  387
## 11  GCG  511
## 12  GCT  556
## 13  TCA  633
## 14  TCC  535
## 15  TCG  584
## 16  TCT  706
```

```r
# collapse by APOBEC_target
apobec_dat <- aggregate(Freq ~ APOBEC_target, tbl_dat, FUN = sum, na.rm = TRUE)
apobec_dat
```

```
##   APOBEC_target Freq
## 1         FALSE 9046
## 2          TRUE 1339
```

---
## APOBEC activation in this sample?

```r
ggplot(apobec_dat,aes(x=APOBEC_target,y=Freq,fill=APOBEC_target))+
  geom_bar(stat='identity')+
  labs(x="",y="Variants",fill="")+
  theme(legend.position = "none")
```

---
## APOBEC activation in this sample?
<img src="RU_GenomicVariant_session1_files/figure-html/motif_ApoTar_advan3-1.png" style="display: block; margin: auto;" />
***Probably not APOBEC-enriched***

---
## Exercise Time
- [Exercise](../../exercises/exercises/GenomicVariant_session1_exercise.html)
- [Answer](../../exercises/answers/GenomicVariant_session1_answers.html)