ChIPseq (part 1)


ChIPseq Introduction

Chromatin immunoprecipitation, followed by deep sequencing (ChIPseq) is a well established technique which allows for the genome wide identification of transcription factor binding sites and epigenetic marks.

igv

ChIPseq Introduction

igv

]

  • Cross-linked and protein bound DNA.
  • Enrichment by antibody for specific protein or DNA state.
  • End repair, A-tailed and Illumina adapters added.
  • Fragments sequenced from either one/both ends. ]

The data

Our raw ChIPseq sequencing data will be in FASTQ format.

igv

The data

In this ChIPseq workshop we will be investigating the genome wide binding patterns of the transcription factor Myc in mouse MEL and Ch12 cell lines.

We can retrieve the raw sequencing data from Encode website.

Here we download the sequencing data for the Myc ChIPseq from the Mouse MEL cell line, sample ENCSR000EUA (replicate 1), using the Encode portal.

The direct link to the raw sequecing reads in FASTQ format can be found here.

Download the FASTQ for the other Myc MEL replicate from sample ENCSR000EUA. Direct link is here.

Working with raw ChIPseq data


Working with raw ChIPseq data

Once we have downloaded the raw FASTQ data we can use the ShortRead package to review our sequence data quality.

We have reviewed how to work with raw sequencing data in the FASTQ in Bioconductor session.

First we load the ShortRead library.

library(ShortRead)

Working with raw ChIPseq data

First we will review the raw sequencing reads using functions in the ShortRead package. This is similar to our QC we performed for RNAseq.

We do not need to review all reads in the file to can gain an understanding of data quality. We can simply review a subsample of the reads and save ourselves some time and memory.

Note when we subsample we retrieve random reads from across the entire FASTQ file. This is important as FASTQ files are often ordered by their position on the sequencer.

Reading raw ChIPseq data

We can subsample from a FASTQ file using functions in ShortRead package.

Here we use the FastqSampler and yield function to randomly sample a defined number of reads from a FASTQ file. Here we subsample 1 million reads. This should be enough to have an understanding of the quality of the data.

fqSample <- FastqSampler("~/Downloads/ENCFF001NQP.fastq.gz", n = 10^6)
fastq <- yield(fqSample)

Working with raw ChIPseq data

The resulting object is a ShortReadQ object showing information on the number of cycles, base pairs in reads, and number of reads in memory.

fastq
## class: ShortReadQ
## length: 1000000 reads; width: 36 cycles

Raw ChIPseq data QC

If we wished, we can assess information from the FASTQ file using our familiar accessor functions.

  • sread() - Retrieve sequence of reads.
  • quality() - Retrieve quality of reads as ASCII scores.
  • id() - Retrieve IDs of reads.
readSequences <- sread(fastq)
readQuality <- quality(fastq)
readIDs <- id(fastq)
readSequences
## DNAStringSet object of length 1000000:
##           width seq
##       [1]    36 ATGAGAGTTCCTCTTTCTTACACATGTTTTTTTTTT
##       [2]    36 GGTCANTGTGTTCAGTGTATGCTGCACTTACATTCC
##       [3]    36 CTACCTGCTTCTTATCCAGCCCTCTCTTGTAATAGG
##       [4]    36 GAATTGTTGATAATAACCTTATGCTTCTGTTGCTTA
##       [5]    36 ATTCGTGGAGAGATAATGCGTGTATTTGGTTTTGTC
##       ...   ... ...
##  [999996]    36 GAAATTCCAAAAACTATTTTTAGAACTTTACATATG
##  [999997]    36 GTGGGGGCAGCAGACAAGTCCGGGGGAACAGTGAGC
##  [999998]    36 CAAACAAACAAAACAAAACAAAACAAAAGAGAAGCA
##  [999999]    36 TTGTATCCAGGAGAACCTTAGAATGTTCAGTGATGT
## [1000000]    36 AGGGACCGGCAAGTATTTCCCGCCTCATGTTTTGTC

Quality with raw ChIPseq data

We can check some simple quality metrics for our subsampled FASTQ data.

First, we can review the overall reads’ quality scores.

We use the alphabetScore() function with our read’s qualitys to retrieve the sum quality for every read from our subsample.

readQuality <- quality(fastq)
readQualities <- alphabetScore(readQuality)
readQualities[1:10]
##  [1] 1109 1002 1190 1196  868   72  805  816 1041 1082

Quality with raw ChIPseq data

We can then produce a histogram of quality scores to get a better understanding of the distribution of scores.

library(ggplot2)
toPlot <- data.frame(ReadQ = readQualities)
ggplot(toPlot, aes(x = ReadQ)) + geom_histogram() + theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Base frequency with raw ChIPseq data

We can review the occurrence of DNA bases within reads and well as the occurrence of DNA bases across sequencing cycles using the alphabetFrequency() and alphabetByCycle() functions respectively.

Here we check the overall frequency of A, G, C, T and N (unknown bases) in our sequence reads.

readSequences <- sread(fastq)
readSequences_AlpFreq <- alphabetFrequency(readSequences)
readSequences_AlpFreq[1:3, ]
##      A  C G  T M R W S Y K V H D B N - + .
## [1,] 6  6 4 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 6  8 8 13 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [3,] 6 12 5 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Base frequency with raw ChIPseq data

Once we have the frequency of DNA bases in our sequence reads we can retrieve the sum across all reads.

summed__AlpFreq <- colSums(readSequences_AlpFreq)
summed__AlpFreq[c("A", "C", "G", "T", "N")]
##        A        C        G        T        N 
## 10028851  7841813  7650350 10104255   374731

Assess by cycle with raw ChIPseq data

We can review DNA base occurrence by cycle using the alphabetByCycle() function.

readSequences_AlpbyCycle <- alphabetByCycle(readSequences)
readSequences_AlpbyCycle[1:4, 1:10]
##         cycle
## alphabet   [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
##        A 307393 328036 282541 275320 276282 277553 278262 281389 271127 278521
##        C 179812 173882 205519 223468 210473 220596 213693 217991 220137 222141
##        G 242744 246446 227531 212323 212634 214609 210106 210020 213682 210753
##        T 265988 248240 279490 279258 290871 277974 288655 279444 284173 277758

Assess by cycle with raw ChIPseq data

We often plot this to visualise the base occurrence over cycles to observe any bias. First we arrange the base frequency into a data frame.

AFreq <- readSequences_AlpbyCycle["A", ]
CFreq <- readSequences_AlpbyCycle["C", ]
GFreq <- readSequences_AlpbyCycle["G", ]
TFreq <- readSequences_AlpbyCycle["T", ]
toPlot <- data.frame(Count = c(AFreq, CFreq, GFreq, TFreq), Cycle = rep(1:36, 4),
    Base = rep(c("A", "C", "G", "T"), each = 36))

Assess by cycle with raw ChIPseq data

Now we can plot the frequencies using ggplot2

ggplot(toPlot, aes(y = Count, x = Cycle, colour = Base)) + geom_line() + theme_bw()

Assess by cycle with raw ChIPseq data

We can also assess mean read quality over cycles. This will allow us to identify whether there are any isses with quality dropping off over time.

For this we use the as(read_quality,“matrix”) function first to translate our ASCII quality scores to numeric quality scores.

qualAsMatrix <- as(readQuality, "matrix")
qualAsMatrix[1:2, ]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]   33   33   32   33   33   26   26   16   27    29    32    33    30    22
## [2,]   33   33   18   25   22    4   20   33   31    33    30    33    32    28
##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,]    30    33    33    31    25    32    33    34    33    33    33    33
## [2,]    32    31    33    24    33    31    33    27    30    31    27    19
##      [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
## [1,]    25    33    33    34    34    32    33    33    32    32
## [2,]    25    33    30    30    28    24    30    29    25    22

Assess by cycle with raw ChIPseq data

We can now visualise qualities across cycles using a boxplot.

boxplot(qualAsMatrix[1:1000, ])

Assess by cycle with raw ChIPseq data

In this case the distribution of reads quality scores and read qualities over time look okay. We will often want to access FASTQ samples together to see if any samples stick out by these metrics.

Here we observed a second population of low quality scores so will remove some reads with low quality scores and high unknown bases.

Filtering data


Filtering data

We will want to conserve our memory usage to allow us to deal with loading large files.

Here we set up a FastqStreamer object to read in 100000 reads at a time.

fqStreamer <- FastqStreamer("~/Downloads/ENCFF001NQP.fastq.gz", n = 1e+05)

Filtering data

Now we loop through file, filter reads and write out a FASTQ of our filtered reads.

We are filtering low quality reads and reads with many nonspecific (N) base calls.

TotalReads <- 0
TotalReadsFilt <- 0
while (length(fq <- yield(fqStreamer)) > 0) {
    TotalReads <- TotalReads + length(fq)
    filt1 <- fq[alphabetScore(fq) > 300]
    filt2 <- filt1[alphabetFrequency(sread(filt1))[, "N"] < 10]
    TotalReadsFilt <- TotalReadsFilt + length(filt2)
    writeFastq(filt2, "filtered_ENCFF001NQP.fastq.gz", mode = "a")
}
TotalReads
## [1] 25555179
TotalReadsFilt
## [1] 22864597

Aligning data


Aligning ChIPseq reads

Following assessment of read quality and any read filtering we applied, we will want to align our reads to the genome so as to identify any genomic locations showing enrichment for aligned reads above background.

Since ChIPseq reads will align continously against our reference genome we can use our genomic aligners we have seen in previous sessions. The resulting BAM file will contain aligned sequence reads for use in further analysis.

igv

Creating a reference genome

First we need to retrieve the sequence information for the genome of interest in FASTA format

We can use the BSgenome libraries to retrieve the full sequence information.

For the mouse mm10 genome we load the package BSgenome.Mmusculus.UCSC.mm10.

library(BSgenome.Mmusculus.UCSC.mm10)
BSgenome.Mmusculus.UCSC.mm10
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # genome: mm10
## # provider: UCSC
## # release date: Dec. 2011
## # 66 sequences:
## #   chr1                 chr2                 chr3                
## #   chr4                 chr5                 chr6                
## #   chr7                 chr8                 chr9                
## #   chr10                chr11                chr12               
## #   chr13                chr14                chr15               
## #   ...                  ...                  ...                 
## #   chrUn_GL456372       chrUn_GL456378       chrUn_GL456379      
## #   chrUn_GL456381       chrUn_GL456382       chrUn_GL456383      
## #   chrUn_GL456385       chrUn_GL456387       chrUn_GL456389      
## #   chrUn_GL456390       chrUn_GL456392       chrUn_GL456393      
## #   chrUn_GL456394       chrUn_GL456396       chrUn_JH584304      
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)

Creating a reference genome

We will only use the major chromosomes for our analysis so we may exclude random and unplaced contigs. Here we cycle through the major chromosomes and create a DNAStringSet object from the retrieved sequences.

mainChromosomes <- paste0("chr", c(1:19, "X", "Y", "M"))
mainChrSeq <- lapply(mainChromosomes, function(x) BSgenome.Mmusculus.UCSC.mm10[[x]])
names(mainChrSeq) <- mainChromosomes
mainChrSeqSet <- DNAStringSet(mainChrSeq)
mainChrSeqSet
## DNAStringSet object of length 22:
##          width seq                                          names               
##  [1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr1
##  [2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr2
##  [3] 160039680 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr3
##  [4] 156508116 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr4
##  [5] 151834684 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr5
##  ...       ... ...
## [18]  90702639 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr18
## [19]  61431566 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr19
## [20] 171031299 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrX
## [21]  91744698 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrY
## [22]     16299 GTTAATGTAGCTTAATAACAA...TACGCAATAAACATTAACAA chrM

Creating a reference genome.

Now we have a DNAStringSet object we can use the writeXStringSet to create our FASTA file of sequences to align to.

writeXStringSet(mainChrSeqSet, "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa")

Creating an Rsubread index

We will be aligning using the subjunc algorithm from the folks behind subread. We can therefore use the Rsubread package. Before we attempt to align our FASTQ files, we will need to first build an index from our reference genome using the buildindex() function.

The buildindex() function simply takes the parameters of our desired index name and the FASTA file to build index from.

REMEMBER: Building an index is memory intensive and by default is set to 8GB. This may be too large for your laptop or Desktop.

Luckily we did this for RNAseq, so hopefully you will already have a built index.

library(Rsubread)
buildindex("mm10_mainchrs", "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa", memory = 8000,
    indexSplit = TRUE)

Rsubread ChIPseq alignment

We can align our raw sequence data in FASTQ format to the new FASTA file of our mm10 genome sequence using the Rsubread package. Specifically we will be using the align function as this utilizes the subread genomic alignment algorithm.

The align() function accepts arguments for the index to align to, the FASTQ to align, the name of output BAM, the mode of alignment (rna or dna) and the phredOffset.

Note that here we set the phredOffset to be 64. Rsubread may get upset if we set this wrong.

myMapped <- align("mm10_mainchrs", "filtered_ENCFF001NQP.fastq.gz", output_format = "BAM",
    output_file = "Myc_Mel_1.bam", type = "dna", phredOffset = 64, nthreads = 4)

Rbowtie2 ChIPseq alignment

One of the most well known group of alignment alogorthims are the Bowtie family. We can access Bowtie2 with the Rbowtie2 package.

The QuasR package allows access to the original Bowtie aligner, but it is a little slow and memory hungry. Check out our session dedicated to alignment for using QuasR.

library(Rbowtie2)

Rbowtie2 ChIPseq alignment

As with Rsubread, the Rbowtie2 package requires us to first to create an index to align to.

We can do this using the bowtie2_build() function, specifying our FASTA file and desired name of index.

bowtie2_build(references = "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa", bt2Index = file.path("BSgenome.Mmusculus.UCSC.mm10.mainChrs"))

Rbowtie2 ChIPseq alignment

We can then align our FASTQ data using the bowtie2() function specifying our newly created index, the desired name of SAM output and an uncompressed FASTQ.

We will need to uncompress our FASTQ then first. Here we use the remove is FALSE setting to maintain the original compressed FASTQ.

library(R.utils)
gunzip("filtered_ENCFF001NQP.fastq.gz", remove = FALSE)

bowtie2(bt2Index = "BSgenome.Mmusculus.UCSC.mm10.mainChrs", samOutput = "ENCFF001NQP.sam",
    seq1 = "filtered_ENCFF001NQP.fastq")

Rbowtie2 ChIPseq alignment

Since Rbowtie2 also outputs a SAM file, we would need to need to convert to a BAM file. We can do this with the RSamtools asBam() function.

bowtieBam <- asBam("ENCFF001NQP.sam")

Rbowtie2 ChIPseq alignment

An important consideration when using Rbowtie2 is its input and output of uncompressed files.

On the command line we may stream inputs to Rbowtie2, but in R this isnt an option (yet!)

We would need to make sure we delete any temporary files created (SAM and/or uncompressed FASTQ) to avoid filling up our hard drives. We can delete files in R using the unlink() function.

unlink("ENCFF001NQP.sam")

Sort and index reads

As before, we sort and index our files using the Rsamtools packages sortBam() and indexBam() functions respectively.

The resulting sorted and indexed BAM file is now ready for use in external programs such as IGV as well as for further downstream analysis in R.

library(Rsamtools)
sortBam("Myc_Mel_1.bam", "SR_Myc_Mel_rep1")
indexBam("SR_Myc_Mel_rep1.bam")

Mapped data


Mapped reads

Now we have the index for the BAM file, we can retrieve and plot the number of mapped reads using the idxstatsBam() function.

mappedReads <- idxstatsBam("SR_Myc_Mel_rep1.bam")
TotalMapped <- sum(mappedReads[, "mapped"])
ggplot(mappedReads, aes(x = seqnames, y = mapped)) + geom_bar(stat = "identity") +
    coord_flip()

Create a bigWig

We can also create a bigWig from our sorted, indexed BAM file to allow us to quickly review our data in IGV.

First we use the coverage() function to create an RLElist object containing our coverage scores.

forBigWig <- coverage("SR_Myc_Mel_rep1.bam")
forBigWig

Create a bigWig

We can now export our RLElist object as a bigWig using the rtracklayer package’s export.bw() function.

library(rtracklayer)
export.bw(forBigWig, con = "SR_Myc_Mel_rep1.bw")

Create a bigWig

We may wish to normalize our coverage to allow us to compare enrichment across samples.

We can use the weight parameter in the coverage() to scale our reads to the number of mapped reads multiplied by a million (reads per million).

forBigWig <- coverage("SR_Myc_Mel_rep1.bam", weight = (10^6)/TotalMapped)
forBigWig
export.bw(forBigWig, con = "SR_Myc_Mel_rep1_weighted.bw")

BAM and bigWig

Time for an exercise!

Exercise on ChIPseq data can be found here

Answers to exercise

Answers can be found here

R code for solutions can be found here