Genomic Sequences

As we have seen earlier, genomic sequences are often stored in FASTA format.

igv

Genomic Sequences in Bioconductor

Genomic sequences can be handled in Bioconductor using the functions in the Biostrings package and full genome sequences for many model organisms are contained within the BSgenome packages.

igv

BSgenome packages

To make use of a BSgenome package we must first install and load the library.

The resulting BSgenome object (here BSgenome.Mmusculus.UCSC.mm10) contains the full genome sequence for the model organism.

library(BSgenome.Mmusculus.UCSC.mm10)
class(BSgenome.Mmusculus.UCSC.mm10)
## [1] "BSgenome"
## attr(,"package")
## [1] "BSgenome"

BSgenome.Mmusculus.UCSC.mm10

BSgenome.Mmusculus.UCSC.mm10
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # genome: mm10
## # provider: UCSC
## # release date: Dec. 2011
## # 66 sequences:
## #   chr1                 chr2                 chr3                
## #   chr4                 chr5                 chr6                
## #   chr7                 chr8                 chr9                
## #   chr10                chr11                chr12               
## #   chr13                chr14                chr15               
## #   ...                  ...                  ...                 
## #   chrUn_GL456372       chrUn_GL456378       chrUn_GL456379      
## #   chrUn_GL456381       chrUn_GL456382       chrUn_GL456383      
## #   chrUn_GL456385       chrUn_GL456387       chrUn_GL456389      
## #   chrUn_GL456390       chrUn_GL456392       chrUn_GL456393      
## #   chrUn_GL456394       chrUn_GL456396       chrUn_JH584304      
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)

Data in BSgenome objects

Simply by typing the object name we have retrieved some important information.

Importantly it tells us how we can access information in the object.

  • seqnames() function to retrieve all contig names.
  • $ and [[]] accessors to retrieve a given contig’s sequence.

Contig Names

We can retrieve contig names using seqnames() function.

contigNames <- seqnames(BSgenome.Mmusculus.UCSC.mm10)
contigNames[1:22]
##  [1] "chr1"  "chr2"  "chr3"  "chr4"  "chr5"  "chr6"  "chr7"  "chr8"  "chr9" 
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chrX"  "chrY"  "chrM"

Contig Lengths

We can also retrieve contig lengths using seqlengths() function.

contigLengths <- seqlengths(BSgenome.Mmusculus.UCSC.mm10)
contigLengths[1:4]
##      chr1      chr2      chr3      chr4 
## 195471971 182113224 160039680 156508116

Retrieving Sequence Information

Now we know sequence names, we can retrieve sequence information using either the $ or [[]] accessors.

Here we use the data.frame and list accessor $ with name of contig of interest

chr19_Seq <- BSgenome.Mmusculus.UCSC.mm10$chr19
chr19_Seq
## 61431566-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Retrieving Sequence Information

now we use list accessor [[]] with name of contig of interest

chr19_Seq <- BSgenome.Mmusculus.UCSC.mm10[["chr19"]]
chr19_Seq
## 61431566-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

Biostrings and DNAString

The returned sequence object is a DNAString object from the Biostrings package. DNAStrings are efficient objects for storing and accessing large sequences.

class(chr19_Seq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"

?DNAString for full help on DNAString objects.

Subsetting a DNAString Set

We can subset a DNAString object just like indexing a normal vector

chr19_Seq[1:10000000]
## 10000000-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG

Subsetting a DNAString Set

Or we can use the Biostrings function subseq() on our DNAString.

Note that subseq() is a faster implementation of subsetting for DNAString objects.

subseq(chr19_Seq,start=1,end=10000000)
## 10000000-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG

Useful Functions

The Biostrings package has many useful functions for handling sequence data.

We can review the frequency of bases using the alphabetFrequency() function.

alphabetFrequency(chr19_Seq)
##        A        C        G        T        M        R        W        S 
## 16732680 12449343 12420880 16602953        0        0        0        0 
##        Y        K        V        H        D        B        N        - 
##        0        0        0        0        0        0  3225710        0 
##        +        . 
##        0        0

Useful Functions

We can get the complement,reverse and reverse complement of sequence using the complement(), reverse(), reverseComplement().

chr19_SeqComp <- complement(chr19_Seq)
chr19_SeqRev <- reverse(chr19_Seq)
chr19_SeqRevComp <- reverseComplement(chr19_Seq[10000000:10000010])
chr19_Seq[10000000:10000010]
## 11-letter DNAString object
## seq: GTAATGTACAG
chr19_SeqRevComp
## 11-letter DNAString object
## seq: CTGTACATTAC

Useful Functions

We can get the translation of our sequence using the translate().

length(chr19_Seq[10000000:10000008])
## [1] 9
chr19_SeqTranslation <- translate(chr19_Seq[10000000:10000008])
chr19_SeqTranslation
## 3-letter AAString object
## seq: VMY

Useful Functions

We can get also match and count the occurrence of patterns in our sequence using matchPattern() and countPattern().

To count occurrence of patterns we provide our sequence to match to the pattern argument and the sequence to search within as a DNAstring object to the subject argument.

chr19_Count <- countPattern(pattern="ATCTGCAATG",
                            subject=chr19_Seq)
chr19_Count
## [1] 86

Useful Functions

We may not always expect a perfect match for our sequence. We can search for close matches by setting the degree of specificity to the max.mismatch and min.mismatch arguments.

By default both max.mismatch and min.mismatch are set to 0.

chr19_Count <- countPattern(pattern="ATCTGCAATG",
                            subject=chr19_Seq,
                            max.mismatch = 2,
                            min.mismatch = 0)
chr19_Count
## [1] 34456

Useful Functions

We can also use IUPAC codes to account for ambiguity in the sequence we will match.

We can see the availble IUPAC code by viewing the Biostring named vector IUPAC_CODE_MAP.

IUPAC_CODE_MAP
##      A      C      G      T      M      R      W      S      Y      K      V 
##    "A"    "C"    "G"    "T"   "AC"   "AG"   "AT"   "CG"   "CT"   "GT"  "ACG" 
##      H      D      B      N 
##  "ACT"  "AGT"  "CGT" "ACGT"

Useful Functions

We can make use of IUPAC codes in our search by setting the fixed argument to false.

chr19_Count <- countPattern(pattern="RYKHBNKYSRR",
                            subject=chr19_Seq,
                            fixed=FALSE)
chr19_Count
## [1] 3303152

Useful Functions

Typically we will search both strands (Watson and Crick/forward and reverse) to identify any patterns.

chr19_Count_Watson <- countPattern(pattern="ATCTGCAATG",
                                    subject=chr19_Seq)
chr19_Count_Crick <- countPattern(pattern="ATCTGCAATG",
                                    subject=reverseComplement(chr19_Seq)
                                   )
Total_chr19_Count <- chr19_Count_Watson+chr19_Count_Crick

#Writing to a FASTA File

The Biostrings package contains useful functions to read and write to a FASTA file.

To write our DNAString object to a FASTA file we can use the writeXStringSet() function.

First we will convert our DNAString to a DNAStringSet object using the DNAStringSet() function and set names using the names() function.

chr19_SeqSet <- DNAStringSet(chr19_Seq[10000000:10000010])
names(chr19_SeqSet) <- "chr19"
writeXStringSet(chr19_SeqSet,filepath = "data/chr19_Seq.fa")

#Reading a FASTA File

Now we can read our a FASTA file back to DNAStringSet object using the readDNAStringSet() function.

We can access our DNAString from the DNAStringSet object again using $ and [[]] functions.

chr19_FromFASTA <- readDNAStringSet(filepath = "data/chr19_Seq.fa")
chr19_FromFASTA$chr19
## 11-letter DNAString object
## seq: GTAATGTACAG

Time for an exercise.

Link_to_exercises

Link_to_answers