Working with Genomic Intervals

In this exercise we will work with some ChIP-seq peak calls for the Encode experiment ENCSR000ERN

  1. Install the GenomicRanges and rtracklayer package
install.packages("BiocManager")
require(BiocManager)
install("GenomicRanges",update = FALSE)
install("rtracklayer",update = FALSE)
  1. Read in the file data/Myc_Ch12_1_withInput_Input_Ch12_peaks.xls and create a GRanges object which includes values of fold_enrichment and - log10 pvalue provided in file.
suppressPackageStartupMessages(library(rtracklayer))
suppressPackageStartupMessages(library(GenomicRanges))
"%in%" <- BiocGenerics::"%in%"
library(rtracklayer)
library(GenomicRanges)
MYCpeaks <- read.delim("data/Myc_Ch12_1_withInput_Input_Ch12_peaks.xls",sep="\t",comment.char = "#")
MYCgranges <- GRanges(MYCpeaks$chr,IRanges(MYCpeaks$start,MYCpeaks$end),
                       FE=MYCpeaks$fold_enrichment,
                       minuslog10Pval=MYCpeaks$X.log10.pvalue.)
  1. Create a boxplot of the fold enrichments in genomic intervals over every chromosome.
MYCDF <- as.data.frame(MYCgranges)
library(ggplot2)
ggplot(MYCDF,aes(x=seqnames,y=FE,color=seqnames))+geom_boxplot()+theme_bw()

  1. Create a GRanges object of genomic intervals on chromosome 1 with scores greater than 10 and pvalue less than 0.0001 and export as BED file filteredMyc.bed.
MYCfilteredGRanges <- MYCgranges[MYCgranges$FE > 10 & MYCgranges$minuslog10Pval > -log10(0.0001)]
MYCfilteredGRangesChr1 <- MYCfilteredGRanges[seqnames(MYCfilteredGRanges) %in% "chr1"]
export.bed(MYCfilteredGRangesChr1,"filteredMyc.bed")
  1. Read in TXT file of gene positions (containing contig - named as seqnames column-, genomic start and end) from the file data/mm10_GenePosForIGV.txt and export to a BED file Genes.bed
IGVGenePositions <- read.delim("data/mm10_GenePosForIGV.txt",sep="\t")

genePos <- GRanges(IGVGenePositions$seqnames,
                   IRanges(IGVGenePositions$start,IGVGenePositions$end),
                   strand = IGVGenePositions$strand)

names(genePos) <- IGVGenePositions$gene_id
export.bed(genePos,con="Genes.bed")
  1. Create a GRanges of the transcriptional start site positions of every gene (1bp exact TSS).
tssPos <- resize(genePos,width = 1,fix = "start")
  1. Extend this GRanges to be +/- 500 bp around the transciptional start sites.
tssPosExt <- resize(tssPos,width = 1000,fix = "center")
  1. Create a BED file (called filteredMycOnTSS.bed) of our Myc peaks from question 4 which overlap our new GRanges of +/- 500 bp around transciptional start sites.
MYCforBED <- MYCfilteredGRangesChr1[MYCfilteredGRangesChr1 %over% tssPosExt]
export.bed(MYCforBED,con="filteredMycOnTSS.bed")
  1. Import the generated BED files (filteredMycOnTSS.bed,filteredMyc.bed, Genes.bed) into IGV on mm10 genome.

Download the signal p-value bigwig from the encode portal for replicate 1 from experiment ENCSR000ERN and capture an image of bigWig signal over our BED intervals over igfbp2 gene.

10. Import the data/Myc_Ch12_1_withInput_Input_Ch12_summits.bed BED file to a GRanges. The 5th column in file represents summit height.

mySummits <- import.bed("data/Myc_Ch12_1_withInput_Input_Ch12_summits.bed")
  1. Generate density plots of summit heights for summits overlapping and not overlapping our extended TSS positions
mySummits$Overlap <- ifelse(mySummits %over% tssPosExt,"TSS","Not_TSS")
mySummitsDF <- as.data.frame(mySummits)
ggplot(mySummitsDF,aes(x=score,fill=Overlap))+geom_density(alpha=0.5)+
  facet_wrap(~seqnames)+theme_bw()

  1. Filter the Summits GRanges to the top 500 ranked by the GRanges score column.
orderSummits500 <- mySummits[order(mySummits$score,decreasing = TRUE)][1:500,]
  1. Extend these top 500 Summits GRanges to 50bps around the centre of the GRanges intervals. Extract the sequences under the peaks and write to a file.
centredSummits <- resize(orderSummits500,width = 50,fix="center")
library(BSgenome.Mmusculus.UCSC.mm10)
centredSummitsSeq <- getSeq(BSgenome.Mmusculus.UCSC.mm10,centredSummits)
writeXStringSet(centredSummitsSeq,filepath = "top500bySummit.fa")