+ - 0:00:00
Notes for current slide
Notes for next slide

Sequences In Bioconductor


Rockefeller University, Bioinformatics Resource Centre

http://rockefelleruniversity.github.io/Bioconductor_Introduction/

1 / 24

Genomic Sequences

As we have seen earlier, genomic sequences are often stored in FASTA format.

igv
2 / 24

Genomic Sequences in Bioconductor

Genomic sequences can be handled in Bioconductor using the functions in the Biostrings package and full genome sequences for many model organisms are contained within the BSgenome packages.

igv
3 / 24

BSgenome packages

To make use of a BSgenome package we must first install and load the library.

The resulting BSgenome object (here BSgenome.Mmusculus.UCSC.mm10) contains the full genome sequence for the model organism.

library(BSgenome.Mmusculus.UCSC.mm10)
class(BSgenome.Mmusculus.UCSC.mm10)
## [1] "BSgenome"
## attr(,"package")
## [1] "BSgenome"
4 / 24

BSgenome.Mmusculus.UCSC.mm10

BSgenome.Mmusculus.UCSC.mm10
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # genome: mm10
## # provider: UCSC
## # release date: Dec. 2011
## # 66 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chrUn_GL456372 chrUn_GL456378 chrUn_GL456379
## # chrUn_GL456381 chrUn_GL456382 chrUn_GL456383
## # chrUn_GL456385 chrUn_GL456387 chrUn_GL456389
## # chrUn_GL456390 chrUn_GL456392 chrUn_GL456393
## # chrUn_GL456394 chrUn_GL456396 chrUn_JH584304
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)
5 / 24

Data in BSgenome objects

Simply by typing the object name we have retrieved some important information.

Importantly it tells us how we can access information in the object.

  • seqnames() function to retrieve all contig names.
  • $ and [[]] accessors to retrieve a given contig's sequence.
6 / 24

Contig Names

We can retrieve contig names using seqnames() function.

contigNames <- seqnames(BSgenome.Mmusculus.UCSC.mm10)
contigNames[1:22]
## [1] "chr1" "chr2" "chr3" "chr4" "chr5" "chr6" "chr7" "chr8" "chr9"
## [10] "chr10" "chr11" "chr12" "chr13" "chr14" "chr15" "chr16" "chr17" "chr18"
## [19] "chr19" "chrX" "chrY" "chrM"
7 / 24

Contig Lengths

We can also retrieve contig lengths using seqlengths() function.

contigLengths <- seqlengths(BSgenome.Mmusculus.UCSC.mm10)
contigLengths[1:4]
## chr1 chr2 chr3 chr4
## 195471971 182113224 160039680 156508116
8 / 24

Retrieving Sequence Information

Now we know sequence names, we can retrieve sequence information using either the $ or [[]] accessors.

Here we use the data.frame and list accessor $ with name of contig of interest

chr19_Seq <- BSgenome.Mmusculus.UCSC.mm10$chr19
chr19_Seq
## 61431566-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
9 / 24

Retrieving Sequence Information

now we use list accessor [[]] with name of contig of interest

chr19_Seq <- BSgenome.Mmusculus.UCSC.mm10[["chr19"]]
chr19_Seq
## 61431566-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
10 / 24

Biostrings and DNAString

The returned sequence object is a DNAString object from the Biostrings package. DNAStrings are efficient objects for storing and accessing large sequences.

class(chr19_Seq)
## [1] "DNAString"
## attr(,"package")
## [1] "Biostrings"

?DNAString for full help on DNAString objects.

11 / 24

Subsetting a DNAString Set

We can subset a DNAString object just like indexing a normal vector

chr19_Seq[1:10000000]
## 10000000-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG
12 / 24

Subsetting a DNAString Set

Or we can use the Biostrings function subseq() on our DNAString.

Note that subseq() is a faster implementation of subsetting for DNAString objects.

subseq(chr19_Seq,start=1,end=10000000)
## 10000000-letter DNAString object
## seq: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...GCAGACCCAACTTGCCCAGGTGCGGGTTTTCCATTG
13 / 24

Useful Functions

The Biostrings package has many useful functions for handling sequence data.

We can review the frequency of bases using the alphabetFrequency() function.

alphabetFrequency(chr19_Seq)
## A C G T M R W S
## 16732680 12449343 12420880 16602953 0 0 0 0
## Y K V H D B N -
## 0 0 0 0 0 0 3225710 0
## + .
## 0 0
14 / 24

Useful Functions

We can get the complement,reverse and reverse complement of sequence using the complement(), reverse(), reverseComplement().

chr19_SeqComp <- complement(chr19_Seq)
chr19_SeqRev <- reverse(chr19_Seq)
chr19_SeqRevComp <- reverseComplement(chr19_Seq[10000000:10000010])
chr19_Seq[10000000:10000010]
## 11-letter DNAString object
## seq: GTAATGTACAG
chr19_SeqRevComp
## 11-letter DNAString object
## seq: CTGTACATTAC
15 / 24

Useful Functions

We can get the translation of our sequence using the translate().

length(chr19_Seq[10000000:10000008])
## [1] 9
chr19_SeqTranslation <- translate(chr19_Seq[10000000:10000008])
chr19_SeqTranslation
## 3-letter AAString object
## seq: VMY
16 / 24

Useful Functions

We can get also match and count the occurrence of patterns in our sequence using matchPattern() and countPattern().

To count occurrence of patterns we provide our sequence to match to the pattern argument and the sequence to search within as a DNAstring object to the subject argument.

chr19_Count <- countPattern(pattern="ATCTGCAATG",
subject=chr19_Seq)
chr19_Count
## [1] 86
17 / 24

Useful Functions

We may not always expect a perfect match for our sequence. We can search for close matches by setting the degree of specificity to the max.mismatch and min.mismatch arguments.

By default both max.mismatch and min.mismatch are set to 0.

chr19_Count <- countPattern(pattern="ATCTGCAATG",
subject=chr19_Seq,
max.mismatch = 2,
min.mismatch = 0)
chr19_Count
## [1] 34456
18 / 24

Useful Functions

We can also use IUPAC codes to account for ambiguity in the sequence we will match.

We can see the availble IUPAC code by viewing the Biostring named vector IUPAC_CODE_MAP.

IUPAC_CODE_MAP
## A C G T M R W S Y K V
## "A" "C" "G" "T" "AC" "AG" "AT" "CG" "CT" "GT" "ACG"
## H D B N
## "ACT" "AGT" "CGT" "ACGT"
19 / 24

Useful Functions

We can make use of IUPAC codes in our search by setting the fixed argument to false.

chr19_Count <- countPattern(pattern="RYKHBNKYSRR",
subject=chr19_Seq,
fixed=FALSE)
chr19_Count
## [1] 3303152
20 / 24

Useful Functions

Typically we will search both strands (Watson and Crick/forward and reverse) to identify any patterns.

chr19_Count_Watson <- countPattern(pattern="ATCTGCAATG",
subject=chr19_Seq)
chr19_Count_Crick <- countPattern(pattern="ATCTGCAATG",
subject=reverseComplement(chr19_Seq)
)
Total_chr19_Count <- chr19_Count_Watson+chr19_Count_Crick
21 / 24

Writing to a FASTA File

The Biostrings package contains useful functions to read and write to a FASTA file.

To write our DNAString object to a FASTA file we can use the writeXStringSet() function.

First we will convert our DNAString to a DNAStringSet object using the DNAStringSet() function and set names using the names() function.

chr19_SeqSet <- DNAStringSet(chr19_Seq[10000000:10000010])
names(chr19_SeqSet) <- "chr19"
writeXStringSet(chr19_SeqSet,filepath = "data/chr19_Seq.fa")
22 / 24

Reading a FASTA File

Now we can read our a FASTA file back to DNAStringSet object using the readDNAStringSet() function.

We can access our DNAString from the DNAStringSet object again using $ and [[]] functions.

chr19_FromFASTA <- readDNAStringSet(filepath = "data/chr19_Seq.fa")
chr19_FromFASTA$chr19
## 11-letter DNAString object
## seq: GTAATGTACAG
23 / 24

Time for an exercise.

Link_to_exercises

Link_to_answers

24 / 24

Genomic Sequences

As we have seen earlier, genomic sequences are often stored in FASTA format.

igv
2 / 24
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow