class: center, middle, inverse, title-slide # Sequences In Bioconductor
### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/Bioconductor_Introduction/
--- # Genomic Sequences As we have seen earlier, genomic sequences are often stored in FASTA format. <div align="center"> <img src="imgs/newFASTA.png" alt="igv" height="400" width="800"> </div> --- # Genomic Sequences in Bioconductor Genomic sequences can be handled in Bioconductor using the functions in the **Biostrings** package and full genome sequences for many model organisms are contained within the [**BSgenome** packages](https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData). <div align="center"> <img src="imgs/bsgenomeExample_2020.png" alt="igv" height="415" width="400"> </div> --- # BSgenome packages To make use of a BSgenome package we must first install and load the library. The resulting BSgenome object (here BSgenome.Mmusculus.UCSC.mm10) contains the full genome sequence for the model organism. ```r library(BSgenome.Mmusculus.UCSC.mm10) class(BSgenome.Mmusculus.UCSC.mm10) ``` ``` ## [1] "BSgenome" ## attr(,"package") ## [1] "BSgenome" ``` --- # BSgenome.Mmusculus.UCSC.mm10 ```r BSgenome.Mmusculus.UCSC.mm10 ``` ``` ## Mouse genome: ## # organism: Mus musculus (Mouse) ## # genome: mm10 ## # provider: UCSC ## # release date: Dec. 2011 ## # 66 sequences: ## # chr1 chr2 chr3 ## # chr4 chr5 chr6 ## # chr7 chr8 chr9 ## # chr10 chr11 chr12 ## # chr13 chr14 chr15 ## # ... ... ... ## # chrUn_GL456372 chrUn_GL456378 chrUn_GL456379 ## # chrUn_GL456381 chrUn_GL456382 chrUn_GL456383 ## # chrUn_GL456385 chrUn_GL456387 chrUn_GL456389 ## # chrUn_GL456390 chrUn_GL456392 chrUn_GL456393 ## # chrUn_GL456394 chrUn_GL456396 chrUn_JH584304 ## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator ## # to access a given sequence) ``` --- # Data in BSgenome objects Simply by typing the object name we have retrieved some important information. Importantly it tells us how we can access information in the object. - **seqnames()** function to retrieve all contig names. - **$** and **[[]]** accessors to retrieve a given contig's sequence. --- # Contig Names We can retrieve contig names using **seqnames()** function. ```r contigNames <- seqnames(BSgenome.Mmusculus.UCSC.mm10) contigNames[1:22] ``` ``` ## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9" ## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18" ## [19] "chr19" "chrX" "chrY" "chrM" ``` --- # Contig Lengths We can also retrieve contig lengths using **seqlengths()** function. ```r contigLengths <- seqlengths(BSgenome.Mmusculus.UCSC.mm10) contigLengths[1:4] ``` ``` ## chr1 chr2 chr3 chr4 ## 195471971 182113224 160039680 156508116 ``` --- # Retrieving Sequence Information Now we know sequence names, we can retrieve sequence information using either the **$** or **[[]]** accessors. Here we use the data.frame and list accessor **$** with name of contig of interest ```r chr19_Seq <- BSgenome.Mmusculus.UCSC.mm10$chr19 chr19_Seq ``` ``` ## 61431566-letter DNAString object ## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ``` --- # Retrieving Sequence Information now we use list accessor **[[]]** with name of contig of interest ```r chr19_Seq <- BSgenome.Mmusculus.UCSC.mm10[["chr19"]] chr19_Seq ``` ``` ## 61431566-letter DNAString object ## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ``` --- # Biostrings and DNAString The returned sequence object is a **DNAString** object from the **Biostrings** package. DNAStrings are efficient objects for storing and accessing large sequences. ```r class(chr19_Seq) ``` ``` ## [1] "DNAString" ## attr(,"package") ## [1] "Biostrings" ``` **?DNAString** for full help on DNAString objects. --- # Subsetting a DNAString Set We can subset a DNAString object just like indexing a normal vector ```r chr19_Seq[1:10000000] ``` ``` ## 10000000-letter DNAString object ## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG ``` --- # Subsetting a DNAString Set Or we can use the Biostrings function **subseq()** on our DNAString. Note that **subseq()** is a faster implementation of subsetting for DNAString objects. ```r subseq(chr19_Seq,start=1,end=10000000) ``` ``` ## 10000000-letter DNAString object ## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG ``` --- # Useful Functions The Biostrings package has many useful functions for handling sequence data. We can review the frequency of bases using the **alphabetFrequency()** function. ```r alphabetFrequency(chr19_Seq) ``` ``` ## A C G T M R W S ## 16732680 12449343 12420880 16602953 0 0 0 0 ## Y K V H D B N - ## 0 0 0 0 0 0 3225710 0 ## + . ## 0 0 ``` --- # Useful Functions We can get the complement,reverse and reverse complement of sequence using the **complement()**, **reverse()**, **reverseComplement()**. ```r chr19_SeqComp <- complement(chr19_Seq) chr19_SeqRev <- reverse(chr19_Seq) chr19_SeqRevComp <- reverseComplement(chr19_Seq[10000000:10000010]) chr19_Seq[10000000:10000010] ``` ``` ## 11-letter DNAString object ## seq: GTAATGTACAG ``` ```r chr19_SeqRevComp ``` ``` ## 11-letter DNAString object ## seq: CTGTACATTAC ``` --- # Useful Functions We can get the translation of our sequence using the **translate()**. ```r length(chr19_Seq[10000000:10000008]) ``` ``` ## [1] 9 ``` ```r chr19_SeqTranslation <- translate(chr19_Seq[10000000:10000008]) chr19_SeqTranslation ``` ``` ## 3-letter AAString object ## seq: VMY ``` --- # Useful Functions We can get also match and count the occurrence of patterns in our sequence using **matchPattern()** and **countPattern()**. To count occurrence of patterns we provide our sequence to match to the **pattern** argument and the sequence to search within as a DNAstring object to the **subject** argument. ```r chr19_Count <- countPattern(pattern="ATCTGCAATG", subject=chr19_Seq) chr19_Count ``` ``` ## [1] 86 ``` --- # Useful Functions We may not always expect a perfect match for our sequence. We can search for close matches by setting the degree of specificity to the **max.mismatch** and **min.mismatch** arguments. By default both **max.mismatch** and **min.mismatch** are set to 0. ```r chr19_Count <- countPattern(pattern="ATCTGCAATG", subject=chr19_Seq, max.mismatch = 2, min.mismatch = 0) chr19_Count ``` ``` ## [1] 34456 ``` --- # Useful Functions We can also use IUPAC codes to account for ambiguity in the sequence we will match. We can see the availble IUPAC code by viewing the Biostring named vector **IUPAC_CODE_MAP**. ```r IUPAC_CODE_MAP ``` ``` ## A C G T M R W S Y K V ## "A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG" ## H D B N ## "ACT" "AGT" "CGT" "ACGT" ``` --- # Useful Functions We can make use of IUPAC codes in our search by setting the **fixed** argument to false. ```r chr19_Count <- countPattern(pattern="RYKHBNKYSRR", subject=chr19_Seq, fixed=FALSE) chr19_Count ``` ``` ## [1] 3303152 ``` --- # Useful Functions Typically we will search both strands (Watson and Crick/forward and reverse) to identify any patterns. ```r chr19_Count_Watson <- countPattern(pattern="ATCTGCAATG", subject=chr19_Seq) chr19_Count_Crick <- countPattern(pattern="ATCTGCAATG", subject=reverseComplement(chr19_Seq) ) Total_chr19_Count <- chr19_Count_Watson+chr19_Count_Crick ``` --- #Writing to a FASTA File The **Biostrings** package contains useful functions to read and write to a FASTA file. To write our DNAString object to a FASTA file we can use the **writeXStringSet()** function. First we will convert our DNAString to a DNAStringSet object using the **DNAStringSet()** function and set names using the **names()** function. ```r chr19_SeqSet <- DNAStringSet(chr19_Seq[10000000:10000010]) names(chr19_SeqSet) <- "chr19" writeXStringSet(chr19_SeqSet,filepath = "data/chr19_Seq.fa") ``` --- #Reading a FASTA File Now we can read our a FASTA file back to DNAStringSet object using the **readDNAStringSet()** function. We can access our DNAString from the DNAStringSet object again using **$** and **[[]]** functions. ```r chr19_FromFASTA <- readDNAStringSet(filepath = "data/chr19_Seq.fa") chr19_FromFASTA$chr19 ``` ``` ## 11-letter DNAString object ## seq: GTAATGTACAG ``` --- # Time for an exercise. [Link_to_exercises](../../exercises/exercises/fastaAndBiostrings_exercise.html) [Link_to_answers](../../exercises/answers/fastaAndBiostrings_answers.html)