As we have seen earlier, genomic sequences are often stored in FASTA format.
Genomic sequences can be handled in Bioconductor using the functions in the Biostrings package and full genome sequences for many model organisms are contained within the BSgenome packages.
To make use of a BSgenome package we must first install and load the library.
The resulting BSgenome object (here BSgenome.Mmusculus.UCSC.mm10) contains the full genome sequence for the model organism.
library(BSgenome.Mmusculus.UCSC.mm10)
class(BSgenome.Mmusculus.UCSC.mm10)
## [1] "BSgenome"
## attr(,"package")
## [1] "BSgenome"
BSgenome.Mmusculus.UCSC.mm10
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # genome: mm10
## # provider: UCSC
## # release date: Dec. 2011
## # 66 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chrUn_GL456372 chrUn_GL456378 chrUn_GL456379
## # chrUn_GL456381 chrUn_GL456382 chrUn_GL456383
## # chrUn_GL456385 chrUn_GL456387 chrUn_GL456389
## # chrUn_GL456390 chrUn_GL456392 chrUn_GL456393
## # chrUn_GL456394 chrUn_GL456396 chrUn_JH584304
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)
Simply by typing the object name we have retrieved some important information.
Importantly it tells us how we can access information in the object.
We can retrieve contig names using seqnames() function.
<- seqnames(BSgenome.Mmusculus.UCSC.mm10)
contigNames 1:22] contigNames[
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chrX" "chrY" "chrM"
We can also retrieve contig lengths using seqlengths() function.
<- seqlengths(BSgenome.Mmusculus.UCSC.mm10)
contigLengths 1:4] contigLengths[
## chr1 chr2 chr3 chr4
## 195471971 182113224 160039680 156508116
Now we know sequence names, we can retrieve sequence information using either the $ or [[]] accessors.
Here we use the data.frame and list accessor $ with name of contig of interest
<- BSgenome.Mmusculus.UCSC.mm10$chr19
chr19_Seq chr19_Seq
## 61431566-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
now we use list accessor [[]] with name of contig of interest
<- BSgenome.Mmusculus.UCSC.mm10[["chr19"]]
chr19_Seq chr19_Seq
## 61431566-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
The returned sequence object is a DNAString object from the Biostrings package. DNAStrings are efficient objects for storing and accessing large sequences.
class(chr19_Seq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"
?DNAString for full help on DNAString objects.
We can subset a DNAString object just like indexing a normal vector
1:10000000] chr19_Seq[
## 10000000-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG
Or we can use the Biostrings function subseq() on our DNAString.
Note that subseq() is a faster implementation of subsetting for DNAString objects.
subseq(chr19_Seq,start=1,end=10000000)
## 10000000-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG
The Biostrings package has many useful functions for handling sequence data.
We can review the frequency of bases using the alphabetFrequency() function.
alphabetFrequency(chr19_Seq)
## A C G T M R W S
## 16732680 12449343 12420880 16602953 0 0 0 0
## Y K V H D B N -
## 0 0 0 0 0 0 3225710 0
## + .
## 0 0
We can get the complement,reverse and reverse complement of sequence using the complement(), reverse(), reverseComplement().
<- complement(chr19_Seq)
chr19_SeqComp <- reverse(chr19_Seq)
chr19_SeqRev <- reverseComplement(chr19_Seq[10000000:10000010])
chr19_SeqRevComp 10000000:10000010] chr19_Seq[
## 11-letter DNAString object
## seq: GTAATGTACAG
chr19_SeqRevComp
## 11-letter DNAString object
## seq: CTGTACATTAC
We can get the translation of our sequence using the translate().
length(chr19_Seq[10000000:10000008])
## [1] 9
<- translate(chr19_Seq[10000000:10000008])
chr19_SeqTranslation chr19_SeqTranslation
## 3-letter AAString object
## seq: VMY
We can get also match and count the occurrence of patterns in our sequence using matchPattern() and countPattern().
To count occurrence of patterns we provide our sequence to match to the pattern argument and the sequence to search within as a DNAstring object to the subject argument.
<- countPattern(pattern="ATCTGCAATG",
chr19_Count subject=chr19_Seq)
chr19_Count
## [1] 86
We may not always expect a perfect match for our sequence. We can search for close matches by setting the degree of specificity to the max.mismatch and min.mismatch arguments.
By default both max.mismatch and min.mismatch are set to 0.
<- countPattern(pattern="ATCTGCAATG",
chr19_Count subject=chr19_Seq,
max.mismatch = 2,
min.mismatch = 0)
chr19_Count
## [1] 34456
We can also use IUPAC codes to account for ambiguity in the sequence we will match.
We can see the availble IUPAC code by viewing the Biostring named vector IUPAC_CODE_MAP.
IUPAC_CODE_MAP
## A C G T M R W S Y K V
## "A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG"
## H D B N
## "ACT" "AGT" "CGT" "ACGT"
We can make use of IUPAC codes in our search by setting the fixed argument to false.
<- countPattern(pattern="RYKHBNKYSRR",
chr19_Count subject=chr19_Seq,
fixed=FALSE)
chr19_Count
## [1] 3303152
Typically we will search both strands (Watson and Crick/forward and reverse) to identify any patterns.
<- countPattern(pattern="ATCTGCAATG",
chr19_Count_Watson subject=chr19_Seq)
<- countPattern(pattern="ATCTGCAATG",
chr19_Count_Crick subject=reverseComplement(chr19_Seq)
)<- chr19_Count_Watson+chr19_Count_Crick Total_chr19_Count
#Writing to a FASTA File
The Biostrings package contains useful functions to read and write to a FASTA file.
To write our DNAString object to a FASTA file we can use the writeXStringSet() function.
First we will convert our DNAString to a DNAStringSet object using the DNAStringSet() function and set names using the names() function.
<- DNAStringSet(chr19_Seq[10000000:10000010])
chr19_SeqSet names(chr19_SeqSet) <- "chr19"
writeXStringSet(chr19_SeqSet,filepath = "data/chr19_Seq.fa")
#Reading a FASTA File
Now we can read our a FASTA file back to DNAStringSet object using the readDNAStringSet() function.
We can access our DNAString from the DNAStringSet object again using $ and [[]] functions.
<- readDNAStringSet(filepath = "data/chr19_Seq.fa")
chr19_FromFASTA $chr19 chr19_FromFASTA
## 11-letter DNAString object
## seq: GTAATGTACAG