Working with BSgenome objects using Biostrings

In these exercises will gain some experience working with the BSgenome packages.

  1. Load the library BSgenome.Hsapiens.UCSC.hg19
suppressPackageStartupMessages(library(BSgenome.Hsapiens.UCSC.hg19))
library(BSgenome.Hsapiens.UCSC.hg19)
  1. Count the number of contigs.
length(seqnames(BSgenome.Hsapiens.UCSC.hg19))
## [1] 298
  1. Give the sum of lengths of the 3 smallest chromosomes.
contigLengths <- seqlengths(BSgenome.Hsapiens.UCSC.hg19)
sum(contigLengths[order(contigLengths,decreasing = FALSE)][1:3])
## [1] 35839
  1. How many unknown bases - base N - are in chromosome 20
alpFreq <-alphabetFrequency(BSgenome.Hsapiens.UCSC.hg19$chr20)
alpFreq["N"]
##       N 
## 3520000
  1. Create a barchart of the total number of the A,T,C,G bases on chromosome 20.
library(ggplot2)
alpFreq <-alphabetFrequency(BSgenome.Hsapiens.UCSC.hg19$chr20)
atcgFreq <- alpFreq[c("A","T","C","G")]
atcgFreqDF <- data.frame(Bases=names(atcgFreq),Frequency=atcgFreq)
ggplot(atcgFreqDF,aes(y=Frequency,x=Bases,fill=Bases))+geom_bar(stat = "identity")+theme_minimal()

  1. Extract the sequence from chromosome 20 at position 1,000,000 to 1,000,020 and retrieve the complement sequence
mySeq <- BSgenome.Hsapiens.UCSC.hg19$chr20[1000000:1000020]
complement(mySeq)
## 21-letter DNAString object
## seq: CACCCTCTCTTGACCTTGTTC
  1. Write this complement sequence to a FASTA file.
mySet <- DNAStringSet(complement(mySeq))
names(mySet) <- "testFasta"
writeXStringSet(mySet,filepath ="data/testFasta.fa")
  1. Look up the position of MYC in IGV (Human hg19) and find the genomic coordinates of its first exon.
chr8:128748315-128748869
  1. Extract the sequence for the first exon.
mySeq <- BSgenome.Hsapiens.UCSC.hg19$chr8[128748315:128748869]
  1. Compare the sequence to that found in IGV and identify start of translated region in gene

  2. Count the number of classical start codons (ATG) in the first exon.

countPattern("ATG",mySeq)
## [1] 0
  1. Use IGV to review translation start codon for Gapdh and similarly count occurrence of ATG in exon 2 of NM_001289745 transcript. (chr12:6643976-6644027)
gapdhSeq <- BSgenome.Hsapiens.UCSC.hg19$chr12[6643976:6644027]
countPattern("ATG",gapdhSeq)
## [1] 1