ChIPseq In Bioconductor (part1)

class: center, middle, inverse, title-slide

.title[
# ChIPseq In Bioconductor (part1)<br />
<html><br />
<br />
<hr color='#EB811B' size=1px width=796px><br />
</html>
]
.author[
### Rockefeller University, Bioinformatics Resource Centre
]
.date[
### <a href="http://rockefelleruniversity.github.io/RU_ChIPseq/" class="uri">http://rockefelleruniversity.github.io/RU_ChIPseq/</a>
]

---

## ChIPseq Introduction

Chromatin immunoprecipitation, followed by deep sequencing (**ChIPseq**) is a well established technique which allows for the genome wide identification of transcription factor binding sites and epigenetic marks.

---
## ChIPseq Introduction

.pull-left[
<div align="center">
<img src="imgs/chipOverview2.png" alt="igv" height="500" width="300">
</div>
]

.pull-right[

* Cross-linked and protein bound DNA.
* Enrichment by antibody for specific protein or DNA state.
* End repair, A-tailed and Illumina adapters added.
* Fragments sequenced from either one/both ends.
]

---
## The data

Our raw ChIPseq sequencing data will be in FASTQ format.

---
## The data

In this ChIPseq workshop we will be investigating the genome wide binding patterns of the transcription factor Myc in mouse MEL and Ch12 cell lines.

We can retrieve the raw sequencing data from Encode website.

Here we download the sequencing data for the Myc ChIPseq from the Mouse MEL cell line[, sample **ENCSR000EUA** (replicate 1), using the Encode portal.](https://www.encodeproject.org/experiments/ENCSR000EUA/)

The direct link to the raw sequecing reads in FASTQ format can be found [here.](https://www.encodeproject.org/files/ENCFF001NQP/@@download/ENCFF001NQP.fastq.gz)

Download the FASTQ for the other Myc MEL replicate from [sample ENCSR000EUA](https://www.encodeproject.org/experiments/ENCSR000EUA/). Direct link is [here](https://www.encodeproject.org/files/ENCFF001NQQ/@@download/ENCFF001NQQ.fastq.gz).

---
class: inverse, center, middle

# Working with raw ChIPseq data

---

## Working with raw ChIPseq data

Once we have downloaded the raw FASTQ data we can use the [ShortRead package](https://bioconductor.org/packages/release/bioc/html/ShortRead.html) to review our sequence data quality.

We have reviewed how to work with raw sequencing data in the [**FASTQ in Bioconductor** session.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#1)

First we load the [ShortRead library.](https://bioconductor.org/packages/release/bioc/html/ShortRead.html)

``` r
library(ShortRead)
```

---
## Working with raw ChIPseq data

First we will review the raw sequencing reads using functions in the [ShortRead package.](https://bioconductor.org/packages/release/bioc/html/ShortRead.html) This is similar to our QC we performed for RNAseq.

We do not need to review all reads in the file to can gain an understanding of data quality. We can simply review a subsample of the reads and save ourselves some time and memory.

Note when we subsample we retrieve random reads from across the entire FASTQ file. This is important as FASTQ files are often ordered by their position on the sequencer.

---
## Reading raw ChIPseq data

We can subsample from a FASTQ file using functions in **ShortRead** package.

Here we use the [**FastqSampler** and **yield** function](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#41) to randomly sample a defined number of reads from a FASTQ file. Here we subsample 1 million reads. This should be enough to have an understanding of the quality of the data.

``` r
fqSample <- FastqSampler("~/Downloads/ENCFF001NQP.fastq.gz", n = 10^6)
fastq <- yield(fqSample)
```

---
## Working with raw ChIPseq data

The resulting object is a [ShortReadQ object](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#10) showing information on the number of cycles, base pairs in reads, and number of reads in memory.

``` r
fastq
```

```
## class: ShortReadQ
## length: 1000000 reads; width: 36 cycles
```

---

## Raw ChIPseq data QC

If we wished, we can assess information from the FASTQ file using our [familiar accessor functions.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#15)

* **sread()** - Retrieve sequence of reads.
* **quality()** - Retrieve quality of reads as ASCII scores.
* **id()** - Retrieve IDs of reads.

``` r
readSequences <- sread(fastq)
readQuality <- quality(fastq)
readIDs <- id(fastq)
readSequences
```

```
## DNAStringSet object of length 1000000:
##           width seq
##       [1]    36 ATGAGAGTTCCTCTTTCTTACACATGTTTTTTTTTT
##       [2]    36 GGTCANTGTGTTCAGTGTATGCTGCACTTACATTCC
##       [3]    36 CTACCTGCTTCTTATCCAGCCCTCTCTTGTAATAGG
##       [4]    36 GAATTGTTGATAATAACCTTATGCTTCTGTTGCTTA
##       [5]    36 ATTCGTGGAGAGATAATGCGTGTATTTGGTTTTGTC
##       ...   ... ...
##  [999996]    36 GAAATTCCAAAAACTATTTTTAGAACTTTACATATG
##  [999997]    36 GTGGGGGCAGCAGACAAGTCCGGGGGAACAGTGAGC
##  [999998]    36 CAAACAAACAAAACAAAACAAAACAAAAGAGAAGCA
##  [999999]    36 TTGTATCCAGGAGAACCTTAGAATGTTCAGTGATGT
## [1000000]    36 AGGGACCGGCAAGTATTTCCCGCCTCATGTTTTGTC
```

---
## Quality with raw ChIPseq data

We can check some simple quality metrics for our subsampled FASTQ data.

First, we can review the overall reads' quality scores.

We use the [**alphabetScore()** function with our read's qualitys](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#28) to retrieve the sum quality for every read from our subsample.

``` r
readQuality <- quality(fastq)
readQualities <- alphabetScore(readQuality)
readQualities[1:10]
```

```
##  [1] 1109 1002 1190 1196  868   72  805  816 1041 1082
```

---
## Quality with raw ChIPseq data

We can then produce a histogram of quality scores to get a better understanding of the distribution of scores.

``` r
library(ggplot2)
toPlot <- data.frame(ReadQ = readQualities)
ggplot(toPlot, aes(x = ReadQ)) + geom_histogram() + theme_minimal()
```

```
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
```

![](ChIPseq_In_Bioconductor_files/figure-html/mycRep1ReadsQScoresPlot-1.png)

---

## Base frequency with raw ChIPseq data

We can review the occurrence of DNA bases within reads and well as the occurrence of DNA bases across sequencing cycles using the [**alphabetFrequency()**](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#18) and [**alphabetByCycle()**](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#30) functions respectively.

Here we check the overall frequency of **A, G, C, T and N (unknown bases)** in our sequence reads.

``` r
readSequences <- sread(fastq)
readSequences_AlpFreq <- alphabetFrequency(readSequences)
readSequences_AlpFreq[1:3, ]
```

```
##      A  C G  T M R W S Y K V H D B N - + .
## [1,] 6  6 4 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 6  8 8 13 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [3,] 6 12 5 13 0 0 0 0 0 0 0 0 0 0 0 0 0 0
```

---
## Base frequency with raw ChIPseq data

Once we have the frequency of DNA bases in our sequence reads we can retrieve the sum across all reads.

``` r
summed__AlpFreq <- colSums(readSequences_AlpFreq)
summed__AlpFreq[c("A", "C", "G", "T", "N")]
```

```
##        A        C        G        T        N 
## 10028851  7841813  7650350 10104255   374731
```

---