Chromatin immunoprecipitation, followed by deep sequencing (ChIPseq) is a well established technique which allows for the genome wide identification of transcription factor binding sites and epigenetic marks.
]
Our raw ChIPseq sequencing data will be in FASTQ format.
In this ChIPseq workshop we will be investigating the genome wide binding patterns of the transcription factor Myc in mouse MEL and Ch12 cell lines.
We can retrieve the raw sequencing data from Encode website.
Here we download the sequencing data for the Myc ChIPseq from the Mouse MEL cell line, sample ENCSR000EUA (replicate 1), using the Encode portal.
The direct link to the raw sequecing reads in FASTQ format can be found here.
Download the FASTQ for the other Myc MEL replicate from sample ENCSR000EUA. Direct link is here.
Once we have downloaded the raw FASTQ data we can use the ShortRead package to review our sequence data quality.
We have reviewed how to work with raw sequencing data in the FASTQ in Bioconductor session.
First we load the ShortRead library.
library(ShortRead)
First we will review the raw sequencing reads using functions in the ShortRead package. This is similar to our QC we performed for RNAseq.
We do not need to review all reads in the file to can gain an understanding of data quality. We can simply review a subsample of the reads and save ourselves some time and memory.
Note when we subsample we retrieve random reads from across the entire FASTQ file. This is important as FASTQ files are often ordered by their position on the sequencer.
We can subsample from a FASTQ file using functions in ShortRead package.
Here we use the FastqSampler and yield function to randomly sample a defined number of reads from a FASTQ file. Here we subsample 1 million reads. This should be enough to have an understanding of the quality of the data.
<- FastqSampler("~/Downloads/ENCFF001NQP.fastq.gz", n = 10^6)
fqSample <- yield(fqSample) fastq
The resulting object is a ShortReadQ object showing information on the number of cycles, base pairs in reads, and number of reads in memory.
fastq
## class: ShortReadQ
## length: 1000000 reads; width: 36 cycles
If we wished, we can assess information from the FASTQ file using our familiar accessor functions.
<- sread(fastq)
readSequences <- quality(fastq)
readQuality <- id(fastq)
readIDs readSequences
## DNAStringSet object of length 1000000:
## width seq
## [1] 36 ATGAGAGTTCCTCTTTCTTACACATGTTTTTTTTTT
## [2] 36 GGTCANTGTGTTCAGTGTATGCTGCACTTACATTCC
## [3] 36 CTACCTGCTTCTTATCCAGCCCTCTCTTGTAATAGG
## [4] 36 GAATTGTTGATAATAACCTTATGCTTCTGTTGCTTA
## [5] 36 ATTCGTGGAGAGATAATGCGTGTATTTGGTTTTGTC
## ... ... ...
## [999996] 36 GAAATTCCAAAAACTATTTTTAGAACTTTACATATG
## [999997] 36 GTGGGGGCAGCAGACAAGTCCGGGGGAACAGTGAGC
## [999998] 36 CAAACAAACAAAACAAAACAAAACAAAAGAGAAGCA
## [999999] 36 TTGTATCCAGGAGAACCTTAGAATGTTCAGTGATGT
## [1000000] 36 AGGGACCGGCAAGTATTTCCCGCCTCATGTTTTGTC
We can check some simple quality metrics for our subsampled FASTQ data.
First, we can review the overall reads’ quality scores.
We use the alphabetScore() function with our read’s qualitys to retrieve the sum quality for every read from our subsample.
<- quality(fastq)
readQuality <- alphabetScore(readQuality)
readQualities 1:10] readQualities[
## [1] 1109 1002 1190 1196 868 72 805 816 1041 1082
We can then produce a histogram of quality scores to get a better understanding of the distribution of scores.
library(ggplot2)
<- data.frame(ReadQ = readQualities)
toPlot ggplot(toPlot, aes(x = ReadQ)) + geom_histogram() + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can review the occurrence of DNA bases within reads and well as the occurrence of DNA bases across sequencing cycles using the alphabetFrequency() and alphabetByCycle() functions respectively.
Here we check the overall frequency of A, G, C, T and N (unknown bases) in our sequence reads.
<- sread(fastq)
readSequences <- alphabetFrequency(readSequences)
readSequences_AlpFreq 1:3, ] readSequences_AlpFreq[
## A C G T M R W S Y K V H D B N - + .
## [1,] 6 6 4 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 6 8 8 13 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [3,] 6 12 5 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Once we have the frequency of DNA bases in our sequence reads we can retrieve the sum across all reads.
<- colSums(readSequences_AlpFreq)
summed__AlpFreq c("A", "C", "G", "T", "N")] summed__AlpFreq[
## A C G T N
## 10028851 7841813 7650350 10104255 374731
We can review DNA base occurrence by cycle using the alphabetByCycle() function.
<- alphabetByCycle(readSequences)
readSequences_AlpbyCycle 1:4, 1:10] readSequences_AlpbyCycle[
## cycle
## alphabet [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## A 307393 328036 282541 275320 276282 277553 278262 281389 271127 278521
## C 179812 173882 205519 223468 210473 220596 213693 217991 220137 222141
## G 242744 246446 227531 212323 212634 214609 210106 210020 213682 210753
## T 265988 248240 279490 279258 290871 277974 288655 279444 284173 277758
We often plot this to visualise the base occurrence over cycles to observe any bias. First we arrange the base frequency into a data frame.
<- readSequences_AlpbyCycle["A", ]
AFreq <- readSequences_AlpbyCycle["C", ]
CFreq <- readSequences_AlpbyCycle["G", ]
GFreq <- readSequences_AlpbyCycle["T", ]
TFreq <- data.frame(Count = c(AFreq, CFreq, GFreq, TFreq), Cycle = rep(1:36, 4),
toPlot Base = rep(c("A", "C", "G", "T"), each = 36))
Now we can plot the frequencies using ggplot2
ggplot(toPlot, aes(y = Count, x = Cycle, colour = Base)) + geom_line() + theme_bw()
We can also assess mean read quality over cycles. This will allow us to identify whether there are any isses with quality dropping off over time.
For this we use the as(read_quality,“matrix”) function first to translate our ASCII quality scores to numeric quality scores.
<- as(readQuality, "matrix")
qualAsMatrix 1:2, ] qualAsMatrix[
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 33 33 32 33 33 26 26 16 27 29 32 33 30 22
## [2,] 33 33 18 25 22 4 20 33 31 33 30 33 32 28
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,] 30 33 33 31 25 32 33 34 33 33 33 33
## [2,] 32 31 33 24 33 31 33 27 30 31 27 19
## [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## [1,] 25 33 33 34 34 32 33 33 32 32
## [2,] 25 33 30 30 28 24 30 29 25 22
We can now visualise qualities across cycles using a boxplot.
boxplot(qualAsMatrix[1:1000, ])
In this case the distribution of reads quality scores and read qualities over time look okay. We will often want to access FASTQ samples together to see if any samples stick out by these metrics.
Here we observed a second population of low quality scores so will remove some reads with low quality scores and high unknown bases.
We will want to conserve our memory usage to allow us to deal with loading large files.
Here we set up a FastqStreamer object to read in 100000 reads at a time.
<- FastqStreamer("~/Downloads/ENCFF001NQP.fastq.gz", n = 1e+05) fqStreamer
Now we loop through file, filter reads and write out a FASTQ of our filtered reads.
We are filtering low quality reads and reads with many nonspecific (N) base calls.
<- 0
TotalReads <- 0
TotalReadsFilt while (length(fq <- yield(fqStreamer)) > 0) {
<- TotalReads + length(fq)
TotalReads <- fq[alphabetScore(fq) > 300]
filt1 <- filt1[alphabetFrequency(sread(filt1))[, "N"] < 10]
filt2 <- TotalReadsFilt + length(filt2)
TotalReadsFilt writeFastq(filt2, "filtered_ENCFF001NQP.fastq.gz", mode = "a")
}
TotalReads
## [1] 25555179
TotalReadsFilt
## [1] 22864597
Following assessment of read quality and any read filtering we applied, we will want to align our reads to the genome so as to identify any genomic locations showing enrichment for aligned reads above background.
Since ChIPseq reads will align continously against our reference genome we can use our genomic aligners we have seen in previous sessions. The resulting BAM file will contain aligned sequence reads for use in further analysis.
First we need to retrieve the sequence information for the genome of interest in FASTA format
We can use the BSgenome libraries to retrieve the full sequence information.
For the mouse mm10 genome we load the package BSgenome.Mmusculus.UCSC.mm10.
library(BSgenome.Mmusculus.UCSC.mm10)
BSgenome.Mmusculus.UCSC.mm10
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # genome: mm10
## # provider: UCSC
## # release date: Dec. 2011
## # 66 sequences:
## # chr1 chr2 chr3
## # chr4 chr5 chr6
## # chr7 chr8 chr9
## # chr10 chr11 chr12
## # chr13 chr14 chr15
## # ... ... ...
## # chrUn_GL456372 chrUn_GL456378 chrUn_GL456379
## # chrUn_GL456381 chrUn_GL456382 chrUn_GL456383
## # chrUn_GL456385 chrUn_GL456387 chrUn_GL456389
## # chrUn_GL456390 chrUn_GL456392 chrUn_GL456393
## # chrUn_GL456394 chrUn_GL456396 chrUn_JH584304
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)
We will only use the major chromosomes for our analysis so we may exclude random and unplaced contigs. Here we cycle through the major chromosomes and create a DNAStringSet object from the retrieved sequences.
<- paste0("chr", c(1:19, "X", "Y", "M"))
mainChromosomes <- lapply(mainChromosomes, function(x) BSgenome.Mmusculus.UCSC.mm10[[x]])
mainChrSeq names(mainChrSeq) <- mainChromosomes
<- DNAStringSet(mainChrSeq)
mainChrSeqSet mainChrSeqSet
## DNAStringSet object of length 22:
## width seq names
## [1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr1
## [2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr2
## [3] 160039680 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr3
## [4] 156508116 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr4
## [5] 151834684 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr5
## ... ... ...
## [18] 90702639 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr18
## [19] 61431566 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr19
## [20] 171031299 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrX
## [21] 91744698 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrY
## [22] 16299 GTTAATGTAGCTTAATAACAA...TACGCAATAAACATTAACAA chrM
Now we have a DNAStringSet object we can use the writeXStringSet to create our FASTA file of sequences to align to.
writeXStringSet(mainChrSeqSet, "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa")
We will be aligning using the subjunc algorithm from the folks behind subread. We can therefore use the Rsubread package. Before we attempt to align our FASTQ files, we will need to first build an index from our reference genome using the buildindex() function.
REMEMBER: Building an index is memory intensive and by default is set to 8GB. This may be too large for your laptop or Desktop.
Luckily we did this for RNAseq, so hopefully you will already have a built index.
library(Rsubread)
buildindex("mm10_mainchrs", "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa", memory = 8000,
indexSplit = TRUE)
We can align our raw sequence data in FASTQ format to the new FASTA file of our mm10 genome sequence using the Rsubread package. Specifically we will be using the align function as this utilizes the subread genomic alignment algorithm.
Note that here we set the phredOffset to be 64. Rsubread may get upset if we set this wrong.
<- align("mm10_mainchrs", "filtered_ENCFF001NQP.fastq.gz", output_format = "BAM",
myMapped output_file = "Myc_Mel_1.bam", type = "dna", phredOffset = 64, nthreads = 4)
One of the most well known group of alignment alogorthims are the Bowtie family. We can access Bowtie2 with the Rbowtie2 package.
The QuasR package allows access to the original Bowtie aligner, but it is a little slow and memory hungry. Check out our session dedicated to alignment for using QuasR.
library(Rbowtie2)
As with Rsubread, the Rbowtie2 package requires us to first to create an index to align to.
We can do this using the bowtie2_build() function, specifying our FASTA file and desired name of index.
bowtie2_build(references = "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa", bt2Index = file.path("BSgenome.Mmusculus.UCSC.mm10.mainChrs"))
We can then align our FASTQ data using the bowtie2() function specifying our newly created index, the desired name of SAM output and an uncompressed FASTQ.
We will need to uncompress our FASTQ then first. Here we use the remove is FALSE setting to maintain the original compressed FASTQ.
library(R.utils)
gunzip("filtered_ENCFF001NQP.fastq.gz", remove = FALSE)
bowtie2(bt2Index = "BSgenome.Mmusculus.UCSC.mm10.mainChrs", samOutput = "ENCFF001NQP.sam",
seq1 = "filtered_ENCFF001NQP.fastq")
Since Rbowtie2 also outputs a SAM file, we would need to need to convert to a BAM file. We can do this with the RSamtools asBam() function.
<- asBam("ENCFF001NQP.sam") bowtieBam
An important consideration when using Rbowtie2 is its input and output of uncompressed files.
On the command line we may stream inputs to Rbowtie2, but in R this isnt an option (yet!)
We would need to make sure we delete any temporary files created (SAM and/or uncompressed FASTQ) to avoid filling up our hard drives. We can delete files in R using the unlink() function.
unlink("ENCFF001NQP.sam")
As before, we sort and index our files using the Rsamtools packages sortBam() and indexBam() functions respectively.
The resulting sorted and indexed BAM file is now ready for use in external programs such as IGV as well as for further downstream analysis in R.
library(Rsamtools)
sortBam("Myc_Mel_1.bam", "SR_Myc_Mel_rep1")
indexBam("SR_Myc_Mel_rep1.bam")
Now we have the index for the BAM file, we can retrieve and plot the number of mapped reads using the idxstatsBam() function.
<- idxstatsBam("SR_Myc_Mel_rep1.bam")
mappedReads <- sum(mappedReads[, "mapped"])
TotalMapped ggplot(mappedReads, aes(x = seqnames, y = mapped)) + geom_bar(stat = "identity") +
coord_flip()
We can also create a bigWig from our sorted, indexed BAM file to allow us to quickly review our data in IGV.
First we use the coverage() function to create an RLElist object containing our coverage scores.
<- coverage("SR_Myc_Mel_rep1.bam")
forBigWig forBigWig
We can now export our RLElist object as a bigWig using the rtracklayer package’s export.bw() function.
library(rtracklayer)
export.bw(forBigWig, con = "SR_Myc_Mel_rep1.bw")
We may wish to normalize our coverage to allow us to compare enrichment across samples.
We can use the weight parameter in the coverage() to scale our reads to the number of mapped reads multiplied by a million (reads per million).
<- coverage("SR_Myc_Mel_rep1.bam", weight = (10^6)/TotalMapped)
forBigWig
forBigWigexport.bw(forBigWig, con = "SR_Myc_Mel_rep1_weighted.bw")