SequencesInBioconductor.knit

class: center, middle, inverse, title-slide

# Sequences In Bioconductor
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=796px>
</html>
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/Bioconductor_Introduction/" class="uri">http://rockefelleruniversity.github.io/Bioconductor_Introduction/</a>

---

# Genomic Sequences

As we have seen earlier, genomic sequences are often stored in FASTA format.

---

# Genomic Sequences in Bioconductor

Genomic sequences can be handled in Bioconductor using the functions in the **Biostrings** package and full genome sequences for many model organisms are contained within the [**BSgenome** packages](https://bioconductor.org/packages/release/BiocViews.html#___AnnotationData).

---

# BSgenome packages

To make use of a BSgenome package we must first install and load the library.

The resulting BSgenome object (here BSgenome.Mmusculus.UCSC.mm10) contains the full genome sequence for the model organism.

```r
library(BSgenome.Mmusculus.UCSC.mm10)
class(BSgenome.Mmusculus.UCSC.mm10)
```

```
## [1] "BSgenome"
## attr(,"package")
## [1] "BSgenome"
```
---
# BSgenome.Mmusculus.UCSC.mm10

```r
BSgenome.Mmusculus.UCSC.mm10
```

```
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # genome: mm10
## # provider: UCSC
## # release date: Dec. 2011
## # 66 sequences:
## #   chr1                 chr2                 chr3                
## #   chr4                 chr5                 chr6                
## #   chr7                 chr8                 chr9                
## #   chr10                chr11                chr12               
## #   chr13                chr14                chr15               
## #   ...                  ...                  ...                 
## #   chrUn_GL456372       chrUn_GL456378       chrUn_GL456379      
## #   chrUn_GL456381       chrUn_GL456382       chrUn_GL456383      
## #   chrUn_GL456385       chrUn_GL456387       chrUn_GL456389      
## #   chrUn_GL456390       chrUn_GL456392       chrUn_GL456393      
## #   chrUn_GL456394       chrUn_GL456396       chrUn_JH584304      
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)
```
---
# Data in BSgenome objects

Simply by typing the object name we have retrieved some important information.

Importantly it tells us how we can access information in the object.

- **seqnames()** function to retrieve all contig names.
- **$** and **[[]]** accessors to retrieve a given contig's sequence.
---
# Contig Names

We can retrieve contig names using **seqnames()** function.

```r
contigNames <- seqnames(BSgenome.Mmusculus.UCSC.mm10)
contigNames[1:22]
```

```
##  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chrX"  "chrY"  "chrM"
```
---

# Contig Lengths

We can also retrieve contig lengths using **seqlengths()** function.

```r
contigLengths <- seqlengths(BSgenome.Mmusculus.UCSC.mm10)
contigLengths[1:4]
```

```
##      chr1      chr2      chr3      chr4 
## 195471971 182113224 160039680 156508116
```

---

# Retrieving Sequence Information

Now we know sequence names, we can retrieve sequence information using either the **$** or **[[]]** accessors.

Here we use the data.frame and list accessor **$** with name of contig of interest

```r
chr19_Seq <- BSgenome.Mmusculus.UCSC.mm10$chr19
chr19_Seq
```

```
## 61431566-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
```
---
# Retrieving Sequence Information

now we use list accessor **[[]]** with name of contig of interest

```r
chr19_Seq <- BSgenome.Mmusculus.UCSC.mm10[["chr19"]]
chr19_Seq
```

```
## 61431566-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
```

---

# Biostrings and DNAString

The returned sequence object is a **DNAString** object from the **Biostrings** package. DNAStrings are efficient objects for storing and accessing large sequences.

```r
class(chr19_Seq)
```

```
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
```

**?DNAString** for full help on DNAString objects.
---

# Subsetting a DNAString Set

We can subset a DNAString object just like indexing a normal vector

```r
chr19_Seq[1:10000000]
```

```
## 10000000-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG
```

---

# Subsetting a DNAString Set

Or we can use the Biostrings function **subseq()** on our DNAString.

Note that **subseq()** is a faster implementation of subsetting for DNAString objects.

```r
subseq(chr19_Seq,start=1,end=10000000)
```

```
## 10000000-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG
```

---

# Useful Functions

The Biostrings package has many useful functions for handling sequence data.

We can review the frequency of bases using the **alphabetFrequency()** function.

```r
alphabetFrequency(chr19_Seq)
```

```
##        A        C        G        T        M        R        W        S 
## 16732680 12449343 12420880 16602953        0        0        0        0 
##        Y        K        V        H        D        B        N        - 
##        0        0        0        0        0        0  3225710        0 
##        +        . 
##        0        0
```

---

# Useful Functions

We can get the complement,reverse and reverse complement of sequence using the **complement()**, **reverse()**, **reverseComplement()**.

```r
chr19_SeqComp <- complement(chr19_Seq)
chr19_SeqRev <- reverse(chr19_Seq)
chr19_SeqRevComp <- reverseComplement(chr19_Seq[10000000:10000010])
chr19_Seq[10000000:10000010]
```

```
## 11-letter DNAString object
## seq: GTAATGTACAG
```

```r
chr19_SeqRevComp
```

```
## 11-letter DNAString object
## seq: CTGTACATTAC
```

---

# Useful Functions

We can get the translation of our sequence using the **translate()**.

```r
length(chr19_Seq[10000000:10000008])
```

```
## [1] 9
```

```r
chr19_SeqTranslation <- translate(chr19_Seq[10000000:10000008])
chr19_SeqTranslation
```

```
## 3-letter AAString object
## seq: VMY
```

---

# Useful Functions

We can get also match and count the occurrence of patterns in our sequence using **matchPattern()** and **countPattern()**.

To count occurrence of patterns we provide our sequence to match to the **pattern** argument and the sequence to search within as a DNAstring object to the **subject** argument.

```r
chr19_Count <- countPattern(pattern="ATCTGCAATG",
                            subject=chr19_Seq)
chr19_Count
```

```
## [1] 86
```

---

# Useful Functions

We may not always expect a perfect match for our sequence. We can search for close matches by setting the degree of specificity to the **max.mismatch** and **min.mismatch** arguments.

By default both **max.mismatch** and **min.mismatch**  are set to 0.

```r
chr19_Count <- countPattern(pattern="ATCTGCAATG",
                            subject=chr19_Seq,
                            max.mismatch = 2,
                            min.mismatch = 0)
chr19_Count
```

```
## [1] 34456
```

---

# Useful Functions

We can also use IUPAC codes to account for ambiguity in the sequence we will match.

We can see the availble IUPAC code by viewing the Biostring named vector **IUPAC_CODE_MAP**.

```r
IUPAC_CODE_MAP
```

```
##      A      C      G      T      M      R      W      S      Y      K      V 
##    "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG" 
##      H      D      B      N 
##  "ACT"  "AGT"  "CGT" "ACGT"
```

---

# Useful Functions

We can make use of IUPAC codes in our search by setting the **fixed** argument to false.

```r
chr19_Count <- countPattern(pattern="RYKHBNKYSRR",
                            subject=chr19_Seq,
                            fixed=FALSE)
chr19_Count
```

```
## [1] 3303152
```

---

# Useful Functions

Typically we will search both strands (Watson and Crick/forward and reverse) to identify any patterns.

```r
chr19_Count_Watson <- countPattern(pattern="ATCTGCAATG",
                                    subject=chr19_Seq)
chr19_Count_Crick <- countPattern(pattern="ATCTGCAATG",
                                    subject=reverseComplement(chr19_Seq)
                                   )
Total_chr19_Count <- chr19_Count_Watson+chr19_Count_Crick
```

---

#Writing to a FASTA File

The **Biostrings** package contains useful functions to read and write to a FASTA file.

To write our DNAString object to a FASTA file we can use the **writeXStringSet()** function.

First we will convert our DNAString to a DNAStringSet object using the **DNAStringSet()** function and set names using the **names()** function.

```r
chr19_SeqSet <- DNAStringSet(chr19_Seq[10000000:10000010])
names(chr19_SeqSet) <- "chr19"
writeXStringSet(chr19_SeqSet,filepath = "data/chr19_Seq.fa")
```

---

#Reading a FASTA File

Now we can read our a FASTA file back to DNAStringSet object using the **readDNAStringSet()** function.

We can access our DNAString from the DNAStringSet object again using **$** and **[[]]** functions.

```r
chr19_FromFASTA <- readDNAStringSet(filepath = "data/chr19_Seq.fa")
chr19_FromFASTA$chr19
```

```
## 11-letter DNAString object
## seq: GTAATGTACAG
```

---

# Time for an exercise.

[Link_to_exercises](../../exercises/exercises/fastaAndBiostrings_exercise.html)

[Link_to_answers](../../exercises/answers/fastaAndBiostrings_answers.html)