class: center, middle, inverse, title-slide # Genomic Variants ~ Session 1
### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/RU_GenomicVariants/
--- ## About this session Variant Call Format (VCF) is widely used file format to record genomic variants. We introduced the VCF file format in [previous section](http://rockefelleruniversity.github.io/Genomic_Data//presentations/slides/GenomicsData.html#35). In this session, we want to demonstrate how to manipulate VCF file with [VariantAnnotation](https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html). We will be using a study on bladder cancer from Wei Wangs group at BGI-Shenzhen as our dataset in this session: - Raw data from SRA: [PRJNA185252](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA185252) - Reference genome: [BSgenome.Hsapiens.UCSC.hg19](https://bioconductor.org/packages/release/data/annotation/html/BSgenome.Hsapiens.UCSC.hg19.html) - Variant calling procedure is based on [GATK Best Practice](https://gatk.broadinstitute.org/hc/en-us/articles/360035535932-Germline-short-variant-discovery-SNPs-Indels-) + Align to reference genome: [Burrows-Wheeler Aligner](http://bio-bwa.sourceforge.net/bwa.shtml), bwa-mem + Sort and remove duplicates: [Picard-tools](http://broadinstitute.github.io/picard/command-line-overview.html), [SortSam](https://broadinstitute.github.io/picard/command-line-overview.html#SortSam)/[MarkDuplicates](https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates) + Recalibrate Base Quality Scores: [GATK](https://gatk.broadinstitute.org/hc/en-us/articles/360036433132--Tool-Documentation-Index), [BaseRecalibrator](https://gatk.broadinstitute.org/hc/en-us/articles/360036434532-BaseRecalibrator)/[ApplyBQSR](https://gatk.broadinstitute.org/hc/en-us/articles/360036433652-ApplyBQSR) + Variant Calling: [GATK](https://gatk.broadinstitute.org/hc/en-us/articles/360036433132--Tool-Documentation-Index), [HaplotypeCaller](https://gatk.broadinstitute.org/hc/en-us/articles/360036803991-HaplotypeCaller)/[GenotypeGVCFs](https://gatk.broadinstitute.org/hc/en-us/articles/360036806011-GenotypeGVCFs) - Demo data: SAMN01882168 ("data/SAMN01882168_filt.vcf.gz") --- ## Outline of this session - Load VCF file into a vcf object - Survey and retrieve information from each field + META field: general information of the VCF file + Fixed field: variants, usually in VRange format + Genotype field: information for each sample + Integrate information into single table - Differentiate variant types - Retrieve nucleotide substitution pattern - Trinucleotide motif analysis --- class: inverse, center, middle # VCF files <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## [VariantAnnotation](https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html) <img src="imgs/vcfMan_fig2.png" width="75%" style="display: block; margin: auto;" /> --- ## Load VCF file ```r library(VariantAnnotation) vcf <- readVcf("data/SAMN01882168_filt.vcf.gz", "hg19") vcf ``` ``` ## class: CollapsedVCF ## dim: 72244 1 ## rowRanges(vcf): ## GRanges with 5 metadata columns: paramRangeID, REF, ALT, QUAL, FILTER ## info(vcf): ## DataFrame with 19 columns: AC, AF, AN, BaseQRankSum, ClippingRankSum, DP, ... ## info(header(vcf)): ## Number Type Description ## AC A Integer Allele count in genotypes, for each ALT al... ## AF A Float Allele Frequency, for each ALT allele, in ... ## AN 1 Integer Total number of alleles in called genotypes ## BaseQRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt... ## ClippingRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt... ## DP 1 Integer Approximate read depth; some reads may hav... ## DS 0 Flag Were any of the samples downsampled? ## END 1 Integer Stop position of the interval ## ExcessHet 1 Float Phred-scaled p-value for exact test of exc... ## FS 1 Float Phred-scaled p-value using Fisher's exact ... ## InbreedingCoeff 1 Float Inbreeding coefficient as estimated from t... ## MLEAC A Integer Maximum likelihood expectation (MLE) for t... ## MLEAF A Float Maximum likelihood expectation (MLE) for t... ## MQ 1 Float RMS Mapping Quality ## MQRankSum 1 Float Z-score From Wilcoxon rank sum test of Alt... ## QD 1 Float Variant Confidence/Quality by Depth ## RAW_MQ 1 Float Raw data for RMS Mapping Quality ## ReadPosRankSum 1 Float Z-score from Wilcoxon rank sum test of Alt... ## SOR 1 Float Symmetric Odds Ratio of 2x2 contingency ta... ## geno(vcf): ## List of length 10: GT, AD, DP, GQ, MIN_DP, PGT, PID, PL, RGQ, SB ## geno(header(vcf)): ## Number Type Description ## GT 1 String Genotype ## AD R Integer Allelic depths for the ref and alt alleles in the o... ## DP 1 Integer Approximate read depth (reads with MQ=255 or with b... ## GQ 1 Integer Genotype Quality ## MIN_DP 1 Integer Minimum DP observed within the GVCF block ## PGT 1 String Physical phasing haplotype information, describing ... ## PID 1 String Physical phasing ID information, where each unique ... ## PL G Integer Normalized, Phred-scaled likelihoods for genotypes ... ## RGQ 1 Integer Unconditional reference genotype confidence, encode... ## SB 4 Integer Per-sample component statistics which comprise the ... ``` --- ## Overview As with objects we have met in other courses, there are accessor functions to grab the contents of our VCF object. ```r header(vcf) ``` ``` ## class: VCFHeader ## samples(1): SAMN01882168 ## meta(5): fileformat reference source GATKCommandLine contig ## fixed(2): FILTER ALT ## info(19): AC AF ... ReadPosRankSum SOR ## geno(10): GT AD ... RGQ SB ``` - META field: general information of the VCF file; ***meta*** - FIX field: variants, usually in VRange format; ***rowRange***, ***info*** - GENOTYPE field: information for each sample; ***geno*** --- ## How many samples are in this VCF? ```r sampleID <- samples(header(vcf)) sampleID ``` ``` ## [1] "SAMN01882168" ``` --- ## What is in the META field? ```r meta(header(vcf)) ``` ``` ## DataFrameList of length 5 ## names(5): fileformat reference source GATKCommandLine contig ``` --- ## Retrieve information in the META field Once we have the META extracted, we can use the dollar sign to extract specific fields. ```r # File format meta(header(vcf))$fileformat ``` ``` ## DataFrame with 1 row and 1 column ## Value ## <character> ## fileformat VCFv4.2 ``` ```r # Source used for variant calling meta(header(vcf))$source ``` ``` ## DataFrame with 1 row and 1 column ## Value ## <character> ## source GenotypeGVCFs ``` --- ## Retrieve information in the META field Once we have the META extracted, we can use the dollar sign to extract specific fields. ```r meta(header(vcf))$contig ``` ``` ## DataFrame with 25 rows and 2 columns ## length assembly ## <character> <character> ## chr1 249250621 BSgenome.Hsapiens.UC.. ## chr2 243199373 BSgenome.Hsapiens.UC.. ## chr3 198022430 BSgenome.Hsapiens.UC.. ## chr4 191154276 BSgenome.Hsapiens.UC.. ## chr5 180915260 BSgenome.Hsapiens.UC.. ## ... ... ... ## chr21 48129895 BSgenome.Hsapiens.UC.. ## chr22 51304566 BSgenome.Hsapiens.UC.. ## chrX 155270560 BSgenome.Hsapiens.UC.. ## chrY 59373566 BSgenome.Hsapiens.UC.. ## chrM 16571 BSgenome.Hsapiens.UC.. ``` --- ## Variants information (VRange format) The variant information is recorded in a VRange object. This is a form of [GRanges](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicIntervals_In_Bioconductor.html#6) object which we regularly use and are integral to Bioconductor. It includes chromosome, start position, end potision, reference allele, and alternative allele. ```r rd <- rowRanges(vcf) rd[1:2] ``` ``` ## GRanges object with 2 ranges and 5 metadata columns: ## seqnames ranges strand | paramRangeID REF ## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet> ## chr1:815337_G/T chr1 815337 * | NA G ## chr1:815341_G/A chr1 815341 * | NA G ## ALT QUAL FILTER ## <DNAStringSetList> <numeric> <character> ## chr1:815337_G/T T 32.77 . ## chr1:815341_G/A A 32.77 . ## ------- ## seqinfo: 25 sequences from hg19 genome ``` ***It's not a strand-specific library. This is indicated by the "*", which means unstranded. *** --- ## Retrieving variation information We can use similar accessors to those used for [GRanges](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicIntervals_In_Bioconductor.html#16) to get position information. - chromosome: seqnames() - start position: start() - end position: end() --- ## Position of the variations ```r as.vector(seqnames(rd)[1:2]) # Chromosome ``` ``` ## [1] "chr1" "chr1" ``` ```r start(rd)[1:2] # Start position ``` ``` ## [1] 815337 815341 ``` ```r end(rd)[1:2] # End position ``` ``` ## [1] 815337 815341 ``` --- ## How to get the reference alleles? - The *ref()* function can be used to get the reference allele for a variant. The result is a DNAStringSet. ```r refBase <- ref(vcf) refBase[1:2] ``` ``` ## DNAStringSet object of length 2: ## width seq ## [1] 1 G ## [2] 1 G ``` - Use as.character to covert it to character ```r refBase <- as.character(refBase) refBase[1:2] ``` ``` ## [1] "G" "G" ``` --- ## How to get alternative alleles? The *alt()* function can be used to get the alternative allele for a variant. The result is a DNAStringSetList ***as alternative alleles could be multiple changes.*** ```r altBase <- alt(vcf) alt(vcf)[1:2] ``` ``` ## DNAStringSetList of length 2 ## [[1]] T ## [[2]] A ``` --- ## How to get alternative alleles? - As it is a list, getting a character back is more complicated. First we can extract vectors from list by using lapply(). Specifically at this point we are subsetting the data to only the first variant. ```r # get the 1st vector of the list altBase <- lapply(altBase, `[[`, 1) altBase[1:2] ``` ``` ## [[1]] ## 1-letter DNAString object ## seq: T ## ## [[2]] ## 1-letter DNAString object ## seq: A ``` - Convert DNAString object to character ```r altBase <- unlist(lapply(altBase, as.character)) altBase[1:2] ``` ``` ## [1] "T" "A" ``` --- # INFO section from FIXED field - **Integrated information from all the samples** - Annotation information would be recorded in this section - All information is stored as a *data frame* ```r info(header(vcf))[1:2, ] ``` ``` ## DataFrame with 2 rows and 3 columns ## Number Type Description ## <character> <character> <character> ## AC A Integer Allele count in geno.. ## AF A Float Allele Frequency, fo.. ``` --- ## INFO section By calling *info()* directly on the whole VCF object, we get information for every variant. ```r info(vcf)[1:2, ] ``` ``` ## DataFrame with 2 rows and 19 columns ## AC AF AN BaseQRankSum ## <IntegerList> <NumericList> <integer> <numeric> ## chr1:815337_G/T 1 0.5 2 0.084 ## chr1:815341_G/A 1 0.5 2 2.740 ## ClippingRankSum DP DS END ExcessHet ## <numeric> <integer> <logical> <integer> <numeric> ## chr1:815337_G/T 0 11 FALSE NA 3.0103 ## chr1:815341_G/A 0 11 FALSE NA 3.0103 ## FS InbreedingCoeff MLEAC MLEAF MQ ## <numeric> <numeric> <IntegerList> <NumericList> <numeric> ## chr1:815337_G/T 3.09 NA 1 0.5 48.39 ## chr1:815341_G/A 3.09 NA 1 0.5 48.39 ## MQRankSum QD RAW_MQ ReadPosRankSum SOR ## <numeric> <numeric> <numeric> <numeric> <numeric> ## chr1:815337_G/T -2.744 2.98 NA -1.083 2.093 ## chr1:815341_G/A -2.744 2.98 NA -1.200 2.093 ``` --- # GENOTYPE field - Separated by individual samples ```r geno(header(vcf))[1:2, ] ``` ``` ## DataFrame with 2 rows and 3 columns ## Number Type Description ## <character> <character> <character> ## GT 1 String Genotype ## AD R Integer Allelic depths for t.. ``` --- ## Genotype information for each sample By calling *geno()* directly on the whole VCF object, we get information for every variant. ```r paste0("GT: ", geno(header(vcf))[1, 3]) ``` ``` ## [1] "GT: Genotype" ``` ```r matGT <- geno(vcf)$GT matGT[1:2, ] ``` ``` ## chr1:815337_G/T chr1:815341_G/A ## "0/1" "0/1" ``` --- ## GT Types ```r tbl <- table(geno(vcf)$GT) tbl_dat <- as.data.frame(tbl) tbl ``` ``` ## ## 0/1 1/1 1/2 ## 70215 1853 176 ``` - 0/1: *heterozygous mutations, one allele is the same as reference sequence* - 1/1: *homozygous mutations, both alleles are different from reference sequence* - 1/2: *heterozygous mutations, both alleles are different from reference sequence* --- ## GT Types ```r ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+ geom_bar(stat='Identity')+ labs(x="",y="Counts",fill="")+ theme_classic() ``` --- ## GT Types <img src="RU_GenomicVariant_session1_files/figure-html/genoGT_varMan_disp2-1.png" style="display: block; margin: auto;" /> --- ## Depth for each sample (DP) ```r paste0("DP: ", geno(header(vcf))[3, 3]) ``` ``` ## [1] "DP: Approximate read depth (reads with MQ=255 or with bad mates are filtered)" ``` ```r matDP <- geno(vcf)$DP matDP[1:2, ] ``` ``` ## chr1:815337_G/T chr1:815341_G/A ## 11 11 ``` --- ## DP Distribution ```r summary(as.vector(matDP)) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 2.00 10.00 14.00 24.41 21.00 4136.00 ``` ```r ggplot(as.data.frame(matDP),aes(x=SAMN01882168))+geom_histogram()+ labs(x="",y="Counts")+ scale_x_log10()+ theme_classic() ``` --- ## DP Distribution ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="RU_GenomicVariant_session1_files/figure-html/genoDP_varMan_distPres2-1.png" style="display: block; margin: auto;" /> --- ## Genotype calling quality (GQ) - The confidence in genotype calling - Expressed as [Phred quality score](https://gatk.broadinstitute.org/hc/en-us/articles/360035531872-Phred-scaled-quality-scores) ```r paste0("GQ: ", geno(header(vcf))[4, 3]) ``` ``` ## [1] "GQ: Genotype Quality" ``` ```r matGQ <- geno(vcf)$GQ matGQ[1:2, ] ``` ``` ## chr1:815337_G/T chr1:815341_G/A ## 61 61 ``` --- ## GQ distribution ```r summary(as.vector(matGQ)) ``` ``` ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 90.00 99.00 88.41 99.00 99.00 ``` ```r ggplot(as.data.frame(matGQ),aes(x=SAMN01882168))+geom_histogram()+ labs(x="",y="Counts")+ scale_x_log10()+ theme_classic() ``` --- ## GQ Distribution ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <img src="RU_GenomicVariant_session1_files/figure-html/genoGQ_varMan_distPres2-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Gathering variant information <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Gathering information ~ GT 0/1 and 1/1 - select variants with GT 0/1 or 1/1 ```r var_1 <- rownames(geno(vcf)$GT)[ geno(vcf)$GT=="0/1" | geno(vcf)$GT=="1/1"] ``` - Extract variant information ```r varTab1 <- data.frame(variant=names(rd)[names(rd) %in% var_1], chr=as.vector(seqnames(rd)[names(rd) %in% var_1]), start=start(rd)[names(rd) %in% var_1], end=end(rd)[names(rd) %in% var_1], stringsAsFactors = FALSE) ``` --- ## Gathering information ~ GT 0/1 and 1/1 - Ref alleles are retrieved from ref(vcf) ```r ref_base <- ref(vcf)[rownames(vcf) %in% var_1] ref_base[1:2] ``` ``` ## DNAStringSet object of length 2: ## width seq ## [1] 1 G ## [2] 1 G ``` ```r varTab1$refBase <- as.character(ref_base) ``` --- ## Gathering information ~ GT 0/1 and 1/1 - Alt alleles are retrieved from alt(vcf) ```r alt_base <- lapply(alt(vcf)[rownames(vcf) %in% var_1],`[[`,1) alt_base[1] ``` ``` ## [[1]] ## 1-letter DNAString object ## seq: T ``` ```r alt_base <- lapply(alt_base,as.character) alt_base[1] ``` ``` ## [[1]] ## [1] "T" ``` ```r varTab1$altBase <- unlist(alt_base) ``` --- ## Gathering information ~ GT 0/1 and 1/1 - Extract counts from AD ```r adCount <- geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_1] adCount[1] ``` ``` ## [[1]] ## [1] 8 3 ``` ```r varTab1$refCount <- unlist(lapply(adCount,`[[`,1)) varTab1$altCount <- unlist(lapply(adCount,`[[`,2)) ``` --- ## Gathering information ~ GT 0/1 and 1/1 - genoType: genotype (GT) - gtQuality: genotyping quality (GQ) ```r varTab1$genoType <- geno(vcf)$GT[rownames(geno(vcf)$GT) %in% var_1] varTab1$gtQuality <- geno(vcf)$GQ[rownames(geno(vcf)$GQ) %in% var_1] ``` --- ## Gathering information ~ genotype 1/2 - Heterozygous mutations, both alleles are different from the reference sequence. - Both ref allele and alt allele have to be retrieved from alt(vcf) ```r var_2 <- rownames(geno(vcf)$GT)[geno(vcf)$GT=="1/2"] varTab2 <- data.frame(variant=names(rd)[names(rd) %in% var_2], chr=as.vector(seqnames(rd)[names(rd) %in% var_2]), start=start(rd)[names(rd) %in% var_2], end=end(rd)[names(rd) %in% var_2], refBase=unlist(lapply(lapply( alt(vcf)[rownames(vcf) %in% var_2],`[[`,1),as.character)), altBase=unlist(lapply(lapply( alt(vcf)[rownames(vcf) %in% var_2],`[[`,2),as.character)), refCount=unlist(lapply( geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,2)), altCount=unlist( lapply(geno(vcf)$AD[rownames(geno(vcf)$AD) %in% var_2],`[[`,3)), genoType=geno(vcf)$GT[rownames(geno(vcf)$GT) %in% var_2], gtQuality=geno(vcf)$GQ[rownames(geno(vcf)$GQ) %in% var_2], stringsAsFactors = FALSE) ``` --- ## Merged datatable ```r varTab <- rbind(varTab1, varTab2) varTab[1:2, ] ``` ``` ## variant chr start end refBase altBase refCount altCount genoType ## 1 chr1:815337_G/T chr1 815337 815337 G T 8 3 0/1 ## 2 chr1:815341_G/A chr1 815341 815341 G A 8 3 0/1 ## gtQuality ## 1 61 ## 2 61 ``` --- ## Differentiate variant types - SNP: single-nucleotide substitutions - DEL: deletions - INS: insertions - Others: complicated variations, such as Ins/Del or Inversion --- ## Variant types ```r # differentiate SNP/INS/DEL/Others for (k in 1:length(varTab$variant)) { if (width(varTab$refBase[k]) < width(varTab$altBase[k])) { varTab$mutType[k] <- "INS" } else if (width(varTab$refBase[k]) > width(varTab$altBase[k])) { varTab$mutType[k] <- "DEL" } else if (width(varTab$refBase[k]) == 1 & width(varTab$altBase[k]) == 1) { varTab$mutType[k] <- "SNP" } else { varTab$mutType[k] <- "Others" } } # tbl <- table(varTab$mutType) tbl_dat <- as.data.frame(tbl) tbl ``` ``` ## ## DEL INS Others SNP ## 4351 3174 5 64714 ``` --- ## Variant types ```r ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+ geom_bar(stat = 'identity')+ labs(x="",y="Mutations",fill="")+ theme_classic() ``` --- ## Variant types <img src="RU_GenomicVariant_session1_files/figure-html/mutType_pres2-1.png" style="display: block; margin: auto;" /> --- ## Nucleotide substitution pattern - only SNPs - Transition (Ti): purine-to-purine, pyrimidine-to-pyrimidine - Transversion (Tv): purine-to-pyrimidine, pyrimidine-to-purine --- ## Nucleotide substitution ```r # Transition (Ti) ti <- c("A>G","G>A","C>T","T>C") # Transveersion (Tv) tv <- c("A>T","A>C","G>T","G>C","C>A","C>G","T>A","T>G") varTab$nuSub <- paste0(varTab$refBase,">",varTab$altBase) varTab$TiTv[varTab$nuSub %in% ti] <- "Ti" varTab$TiTv[varTab$nuSub %in% tv] <- "Tv" varTab[1:2,] ``` ``` ## variant chr start end refBase altBase refCount altCount genoType ## 1 chr1:815337_G/T chr1 815337 815337 G T 8 3 0/1 ## 2 chr1:815341_G/A chr1 815341 815341 G A 8 3 0/1 ## gtQuality mutType nuSub TiTv ## 1 61 SNP G>T Tv ## 2 61 SNP G>A Ti ``` --- ## Nucleotide substitution ```r varX <- varTab[varTab$mutType == "SNP", ] tbl <- table(varX$nuSub) tbl_dat <- as.data.frame(tbl) tbl ``` ``` ## ## A>C A>G A>T C>A C>G C>T G>A G>C G>T T>A T>C T>G ## 3062 9380 3083 3448 2830 10393 10579 2793 3515 3143 9424 3064 ``` --- ## Nucleotide substitution ```r ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+ geom_bar(stat = 'identity')+ labs(x="",y="Mutations",fill="")+ theme(legend.position = "none") ``` --- ## Nucleotide substitution <img src="RU_GenomicVariant_session1_files/figure-html/TiTv_pres_nuSubPres2-1.png" style="display: block; margin: auto;" /> --- ## Ti/Tv ```r tbl <- table(varX$TiTv) tbl_dat <- as.data.frame(tbl) tbl ``` ``` ## ## Ti Tv ## 39776 24938 ``` --- ## Ti/Tv ```r ggplot(as.data.frame(table(varX$TiTv)),aes(x=Var1,y=Freq,fill=Var1))+ geom_bar(stat = 'identity')+labs(x="",y="Mutations",fill="")+ theme(legend.position = "none") ``` --- ## Ti/Tv <img src="RU_GenomicVariant_session1_files/figure-html/unnamed-chunk-4-1.png" style="display: block; margin: auto;" /> --- class: inverse, center, middle # Trinucleotide motif analysis <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Trinucleotide motif analysis Other constraints may bias nucleotide transitions. One of these is the trinucleotide motif around the variant (from -1 to +1). This pattern could reflect different causes for the mutations. One such example is APOBEC, a deaminase leading to C>T substitutions in the motif **T[C>T]W**. So, the proportion of **T[C>T]W** in overall C>T substitutions could reflect to APOBEC activity. More applications of the trinucleotide motifs will be discussed in [session3](https://rockefelleruniversity.github.io/RU_GenomicVariants/presentations/slides/RU_GenomicVariant_session3.html#19). --- ## Extract C>T substituion ```r library(BSgenome.Hsapiens.UCSC.hg19) library(GenomicFeatures) library(stringr) # rd_idx <- str_split(names(rd), "_", simplify = T) rd_idx[1:2, ] ``` ``` ## [,1] [,2] ## [1,] "chr1:815337" "G/T" ## [2,] "chr1:815341" "G/A" ``` ```r rd_sub <- rd[rd_idx[, 2] == "C/T"] rd_sub[1:2, ] ``` ``` ## GRanges object with 2 ranges and 5 metadata columns: ## seqnames ranges strand | paramRangeID REF ## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet> ## chr1:815520_C/T chr1 815520 * | NA C ## chr1:815573_C/T chr1 815573 * | NA C ## ALT QUAL FILTER ## <DNAStringSetList> <numeric> <character> ## chr1:815520_C/T T 916.77 . ## chr1:815573_C/T T 70.77 . ## ------- ## seqinfo: 25 sequences from hg19 genome ``` --- ## Extract sequences beneath the mutation from -1 to +1 We can use the *getSeq()* function to get the sequence underlying these variants, as we have done with [GRanges](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicIntervals_In_Bioconductor.html#63). ```r rd_sub$triNu <- getSeq(Hsapiens, seqnames(rd_sub), start=start(rd_sub)-1, end=end(rd_sub)+1) rd_sub[1:2] ``` ``` ## GRanges object with 2 ranges and 6 metadata columns: ## seqnames ranges strand | paramRangeID REF ## <Rle> <IRanges> <Rle> | <factor> <DNAStringSet> ## chr1:815520_C/T chr1 815520 * | NA C ## chr1:815573_C/T chr1 815573 * | NA C ## ALT QUAL FILTER triNu ## <DNAStringSetList> <numeric> <character> <DNAStringSet> ## chr1:815520_C/T T 916.77 . CCT ## chr1:815573_C/T T 70.77 . ACA ## ------- ## seqinfo: 25 sequences from hg19 genome ``` --- ## Trinucleotide pattern ```r tbl <- table(rd_sub$triNu) tbl_dat <- as.data.frame(tbl) tbl ``` ``` ## ## ACA ACC ACG ACT CCA CCC CCG CCT GCA GCC GCG GCT TCA TCC TCG TCT ## 1176 535 1096 879 619 429 544 641 554 387 511 556 633 535 584 706 ``` --- ## Trinucleotide pattern ```r ggplot(tbl_dat,aes(x=Var1,y=Freq,fill=Var1))+ geom_bar(stat='identity')+ labs(x="",y="Variants",fill="")+ theme(legend.position = "none") ``` --- ## Trinucleotide pattern <img src="RU_GenomicVariant_session1_files/figure-html/motif_seqPat_advan3-1.png" style="display: block; margin: auto;" /> --- ## APOBEC activation in this sample? - APOBEC target: TCW (TCA/TCT) ```r # TCW: TCA/TCT tbl_dat$APOBEC_target <- tbl_dat$Var1 %in% c("TCA", "TCT") tbl_dat[1:2] ``` ``` ## Var1 Freq ## 1 ACA 1176 ## 2 ACC 535 ## 3 ACG 1096 ## 4 ACT 879 ## 5 CCA 619 ## 6 CCC 429 ## 7 CCG 544 ## 8 CCT 641 ## 9 GCA 554 ## 10 GCC 387 ## 11 GCG 511 ## 12 GCT 556 ## 13 TCA 633 ## 14 TCC 535 ## 15 TCG 584 ## 16 TCT 706 ``` ```r # collapse by APOBEC_target apobec_dat <- aggregate(Freq ~ APOBEC_target, tbl_dat, FUN = sum, na.rm = TRUE) apobec_dat ``` ``` ## APOBEC_target Freq ## 1 FALSE 9046 ## 2 TRUE 1339 ``` --- ## APOBEC activation in this sample? ```r ggplot(apobec_dat,aes(x=APOBEC_target,y=Freq,fill=APOBEC_target))+ geom_bar(stat='identity')+ labs(x="",y="Variants",fill="")+ theme(legend.position = "none") ``` --- ## APOBEC activation in this sample? <img src="RU_GenomicVariant_session1_files/figure-html/motif_ApoTar_advan3-1.png" style="display: block; margin: auto;" /> ***Probably not APOBEC-enriched*** --- ## Exercise Time - [Exercise](../../exercises/exercises/GenomicVariant_session1_exercise.html) - [Answer](../../exercises/answers/GenomicVariant_session1_answers.html)