ChIPseq (part 2)

Data

In todays session we will continue to review the Myc ChIPseq we were working on in our last session.

This include Myc ChIPseq for MEL and Ch12 celllines as well as their input controls.

Information and files for the Myc ChIPseq in MEL cell line can be found here

Information and files for the Myc ChIPseq in Ch12 cell line can be found here

Input control can be found for MEL cell line can be found here

Input control can be found for Ch12 cell line can be found here.

Quality Control

ChIPseq has many sources of potential noise including * Varying efficiency of antibodies * Non-specific binding * Library complexity * ChIP artifacts and background

Many of these sources of noise can be assessed using some well established methodology.

Quality Control References

For some discussions:

Encode quality metrics.

Large-scale quality analysis of published ChIPseq data. Marinov GK, Kundaje A, Park PJ, Wold BJ. G3 (Bethesda). 2014 Feb 19;4(2)

Overestimation of artefact duplicates in ChIPseq.

Systematic evaluation of factors influencing ChIPseq fidelity.Nat Methods. Chen Y, Negre N, Li Q, Mieczkowska JO, Slattery M, Liu T, Zhang Y, Kim TK, He HH, Zieba J, Ruan Y, Bickel PJ, Myers RM, Wold BJ, White KP, Lieb JD, Liu XS. 2012 Jun;9(6)

When and what QC is useful.

Impact of artifact removal on ChIP quality metrics in ChIPseq and ChIP-exo data.Front Genet. 2014 Apr 10;5:75.Carroll TS, Liang Z, Salama R, Stark R, de Santiago I

Always have an appropriate input

Input samples are typically made from fragmented DNA prior to IP enrichment.
Allows for control of artefact regions which occur across samples.
NEVER run ChIPseq without considering which input to use.

For example: When using tumour samples for ChIPseq, it is important to have matched input samples. Differing conditions of same tissue may share common input.

Quality metrics for ChIPseq

The ChIPQC package wraps some of the metrics into a Bioconductor package and takes care to measure these metrics under the appropriate condition.

To run a single sample we can use the ChIPQCsample() function, the relevant unfiltered BAM file and we are recommended to supply a blacklist as a BED file or GRanges and Genome name.

You can find a Blacklist for most genomes at Anshul Kundaje’s site or directly from the Encode websites

QCresult <- ChIPQCsample(reads = "/pathTo/myChIPreads.bam", genome = "mm10", blacklist = "/pathTo/mm10_Blacklist.bed")

Quality control with ChIPQC

We download the blacklist for mm10 from Encode

We can then provide an initial analysis of our ChIPseq samples quality using the ChIPQCsample() function from the ChIPQC package.

Here we evaluate the quality of samples we aligned in the prior session with Rsubread. The returned object is a ChIPQCsample object.

library(ChIPQC)
toBlkList <- "~/Downloads/ENCFF547MET.bed.gz"
chipqc_MycMel_rep1 <- ChIPQCsample("SR_Myc_Mel_rep1.bam", annotation = "mm10", blacklist = toBlkList,
    chromosomes = paste0("chr", 1:10))
class(chipqc_MycMel_rep1)

## [1] "ChIPQCsample"
## attr(,"package")
## [1] "ChIPQC"

Quality control with ChIPQC

We can display our ChIPQCsample object which will show a basic summary of our ChIPseq quality.

chipqc_MycMel_rep1

##                  ChIPQCsample

## Number of Mapped reads: 10355431

## Number of Mapped reads passing MapQ filter: 8445736

## Percentage Of Reads as Non-Duplicates (NRF): 100(0)

## Percentage Of Reads in Blacklisted Regions: 8

## SSD: 4.15396194667776

## Fragment Length Cross-Coverage:

## Relative Cross-Coverage:

## Percentage Of Reads in GenomicFeature:

##                         ProportionOfCounts
## BlackList                       0.09376116
## LongPromoter20000to2000         0.39108812
## Promoters2000to500              0.06601580
## Promoters500                    0.05096595
## All5utrs                        0.02653351
## Alltranscripts                  0.73319436
## Allcds                          0.09148486
## Allintrons                      0.61883381
## All3utrs                        0.02559232

## Percentage Of Reads in Peaks: NA

## Number of Peaks: 0

## GRanges object with 0 ranges and 0 metadata columns:
##    seqnames    ranges strand
##       <Rle> <IRanges>  <Rle>
##   -------
##   seqinfo: no sequences

QC of multiple samples

It is best to review ChIPseq quality versus your input controls and other Myc samples which we are using (or even external data if you do not have your own).

This will allow us to identify expected patterns of ChIPseq enrichment in our samples versus controls as well as spot any outlier samples by these metrics.

We can run ChIPQCsample() on all our samples of interest using an lapply.

bamsToQC <- c("Sorted_Myc_Ch12_1.bam", "Sorted_Myc_Ch12_2.bam", "Sorted_Myc_MEL_1.bam", "Sorted_Myc_MEL_2.bam",
    "Sorted_Input_MEL.bam", "Sorted_Input_Ch12.bam")
myQC <- bplapply(bamsToQC, ChIPQCsample, annotation = "mm10", blacklist = toBlkList, chromosomes = paste0("chr",
    1:10))
names(myQC) <- bamsToQC

QC of multiple samples

All ChIPQC functions can work with a named list of ChIPQCsample objects to aggregate scores into table as well as plots.

Here we use the QCmetrics() function to give an overview of quality metrics.

QCmetrics(myQC)

##                          Reads Map% Filt% Dup% ReadL FragL RelCC  SSD RiP% RiBL%
## Sorted_Myc_Ch12_1.bam 10993821  100  21.9    0    35   186 1.070 3.82   NA  12.8
## Sorted_Myc_Ch12_2.bam 10060460  100  18.4    0    35   146 1.310 2.84   NA  10.1
## Sorted_Myc_MEL_1.bam  10111080  100  20.8    0    35   177 1.220 3.42   NA  11.9
## Sorted_Myc_MEL_2.bam  10686984  100  23.4    0    35   177 1.050 4.20   NA  14.0
## Sorted_Input_MEL.bam   8591703  100  23.6    0    36   184 0.620 4.51   NA  15.9
## Sorted_Input_Ch12.bam 10429244  100  23.7    0    35   182 0.275 4.26   NA  12.8

Assessing fragment length

The prediction of fragment length is an essential part of ChIPseq affecting peaks calling, summit identification and coverage profiles.

The use of cross-correlation or cross-coverage allows for an assessment of reads clustering by strand and so a measure of quality.

Assessing fragment length

offset

Assessing fragment length

In ChIPseq typically short single end reads of dsDNA.
dsDNA single end sequencing means
5’ of fragment will be sequenced on “+” strand
3’ of fragment end will be on “-” strand.
Although we only have partial sequence of strand, with predicted fragment length we can predict the whole fragment
“+” reads should extend only in positive direction
“-” reads only in negative

Assessing fragment length

offset

Assessing fragment length

offset

Assessing fragment length

offset

$offset$

Cross-coverage plot

offset

Cross-coverage plot

The plotCC function can be used to plot our cross-coverage profiles

The plotCC() function accepts our list of ChIPQC sample objects and a facetBy argument to allow us to group our cross-coverage profiles.

plotCC(myQC, facetBy = "Sample")

Cross-coverage plot

We can include metadata as a data.frame where the first column is our sample names to allow us to group our plot in different ways.

myMeta <- data.frame(Sample = names(myQC), Tissue = c("Ch12", "Ch12", "MEL", "MEL", "MEL", "Ch12"),
    Antibody = c(rep("Myc", 4), rep("Input", 2)))
myMeta

##                  Sample Tissue Antibody
## 1 Sorted_Myc_Ch12_1.bam   Ch12      Myc
## 2 Sorted_Myc_Ch12_2.bam   Ch12      Myc
## 3  Sorted_Myc_MEL_1.bam    MEL      Myc
## 4  Sorted_Myc_MEL_2.bam    MEL      Myc
## 5  Sorted_Input_MEL.bam    MEL    Input
## 6 Sorted_Input_Ch12.bam   Ch12    Input

Cross-coverage plot

We can now include our metadata to the addMetaData argument which will allow us to facetBy the supplied metadata columns.

Additionally here we use the colourBy parameter to add colour to antibody groups.

plotCC(myQC, facetBy = "Tissue", addMetaData = myMeta, colourBy = "Antibody")

Cross-coverage plot

All plots in ChIPQC are in fact built in ggplot2 so we can edit and update our plot like all ggplot objects.

plotCC(myQC, facetBy = "Tissue", addMetaData = myMeta, colourBy = "Antibody") + theme_bw() +
    ggtitle("ChIPQC results")

Blacklists and SSD

Blacklists

ChIPseq will often show the presence of common artifacts, such as ultra-high signal regions. Such regions can confound peak calling, fragment length estimation and QC metrics. Anshul Kundaje created the DAC blacklist as a reference to help deal with these regions.

offset

Blacklist affects many metrics

offset

Blacklists and SSD

SSD is one of these measures that is sensitive to blacklisted artifacts. SSD is a measure of standard deviation of signal across the genome with higher scores reflecting significant pile-up of reads. SSD can therefore be used to assess both the extent of ultra high signals and the signal. But first blacklisted regions must be removed.

offset

Standardized Standard Deviation

ChIPQC calculates SSD before and after removing signal coming from Blacklisted regions.

The plotSSD() function plots samples’s pre-blacklisting score in red and post-blacklisting score in blue.

Higher scores for pre-blacklisted SSD can suggest a strong background signal in blacklisted regions for that sample.

plotSSD(myQC) + xlim(0, 5)

Standardized Standard Deviation

Since SSD score is strongly affected by blacklisting it may be necessary to change the axis to see any differences between samples for post-blacklisting scores.

Higher post-blacklisted SSD scores reflect samples with stronger peak signal.

plotSSD(myQC) + xlim(0.2, 0.8)

Library complexity and enrichment

Library complexity

A potential source of noise in ChIPseq is overamplification of the ChIPseq library during a PCR step. This can lead to large number of duplicate reads which may confound peak calling.

Duplication

We should compare our duplication rate across samples to identify any sample experiencing overamplification and so potential of a lower complexity.

The flagtagcounts() function reports can report the number of duplicates and total mapped reads and so from there we can calculate our duplication rate.

myFlags <- flagtagcounts(myQC)
myFlags["DuplicateByChIPQC", ]/myFlags["Mapped", ]

## Sorted_Myc_Ch12_1.bam Sorted_Myc_Ch12_2.bam  Sorted_Myc_MEL_1.bam  Sorted_Myc_MEL_2.bam 
##            0.07203883            0.08633293            0.15987076            0.06253850 
##  Sorted_Input_MEL.bam Sorted_Input_Ch12.bam 
##            0.14269010            0.13873057

Enrichment for reads across genes

We can also use ChIPQC to review where our distribution of reads across gene features using the plotRegi() function.

Here we expect ChIPseq signal to be stronger in 5’UTRs and promoters when compared to input samples.

p <- plotRegi(myQC)

Enrichment for reads across genes.

Peak Calling

Peak calling

To identify regions of Myc transcription factor binding we can use a peak caller.

Although many peak callers are available in R and beyond, the most popular and widely used peak caller remains MACS2.

MACS2 calls peaks in a few simple steps.

Predict fragment length.
Shift reads to center of predicted fragment.
Scan across genome and identify regions enriched compare to control samples using count based statistic.

Installing programs that aren’t in R

There is no R package for MACS2 (well it just got released but we haven’t tested it yet). It is possible to install it on Mac and Linux using the Anaconda package repository (unfortunately there is not a Windows implementation). Anaconda is a huge collection of version controlled packages that can be installed through the conda package management system. With conda it is easy to create and manage environments that have a variety of different packages in them.

Though there is no MACS2 R package, we can interact with the anaconda package system using the R package Herper. This was created by us here at the BRC and is part of Bioconductor.

BiocManager::install("Herper")
library(Herper)

Install MACS2 with Herper

First, we will use Herper to install MACS2 with the install_CondaTools function. We just need to tell install_CondaTools what tool/s you want and the name of the environment you want to build.

salmon_paths <- install_CondaTools(tools = "macs2", env = "ChIPseq_analysis")
salmon_paths

Behind the scenes, Herper will install the most minimal version of conda (called miniconda), and then will create a new environment into which MACS2 will be installed. When you run the function it prints out where MACS2 is installed. There are additional arguments to control where miniconda is installed using pathToMiniConda and also update an existing environment with updateEnv.

Running MACS2

To run MACS2 to we simply need to supply.

A BAM file to find enriched regions in. (specified after -t)
A Name for peak calls (specified after –name).
An output folder to write peaks into (specified after –outdir).
Optionally, but highly recommended, we can identify a control to compare to (specified after –c).

macs2 callpeak -t Sorted_Myc_MEL_1.bam
               –name Mel_Rep1 
               –-outdir PeakDirectory
               -c Sorted_Input_MEL.bam

Running MACS2 in R

Herper allows us to run conda packages from within R. MACS2 has been installed into ChIPseq_analysis. So we can use this environment from R using with_CondaEnv().

myChIP <- "Sorted_Myc_MEL_1.bam"
myControl <- "Sorted_Input_MEL.bam"

with_CondaEnv("ChIPseq_analysis",
                      system2(command="macs2",args =c("callpeak", 
                      "-t", myChIP,
                      "-n", "Mel_Rep1",
                      "–-outdir","PeakDirectory",
                      "-c", myControl)),
                        
                        stdout = TRUE))

Working with Peaks

MACS peak calls can be found in our specied output directory with the suffix and extension “_peaks.xls”.

MACS peaks come as a tab seperated file thinly disguised as a “.xls”.

In addition to the genomic coordinates of peaks, these files contain useful information on the samples, parameters and version used for peak calling at the top.

macsPeaks <- "data/Mel1_peaks.xls"

macsPeaks_DF <- read.delim(macsPeaks)
macsPeaks_DF[1:8, ]
## [1] "# Command line: callpeak -t Sorted_Myc_MEL_1.bam -n Mel1 -c Sorted_Input_MEL.bam"
## [2] "# ARGUMENTS LIST:"                                                               
## [3] "# name = Mel1"                                                                   
## [4] "# format = AUTO"                                                                 
## [5] "# ChIP-seq file = ['Sorted_Myc_MEL_1.bam']"                                      
## [6] "# control file = ['Sorted_Input_MEL.bam']"                                       
## [7] "# effective genome size = 2.70e+09"                                              
## [8] "# band width = 300"

Importing MACS peaks

We can import peak files therefore using read.delim function. Note we have set comment.char argument to # to exclude additional information on peak calling parameters stored within the MACS peak file.

macsPeaks <- "data/Mel1_peaks.xls"

macsPeaks_DF <- read.delim(macsPeaks, comment.char = "#")
macsPeaks_DF[1:2, ]
##    chr   start     end length abs_summit pileup X.log10.pvalue. fold_enrichment
## 1 chr1 4785371 4785642    272    4785563  20.89        10.66553         5.33590
## 2 chr1 5082993 5083247    255    5083123  33.42        12.68072         4.30257
##   X.log10.qvalue.        name
## 1         7.37727 Mel1_peak_1
## 2         9.27344 Mel1_peak_2

Converting MACS peaks

Now we have the information in a table we can create a GRanges object.

GRanges objects are made of chromosome names and intervals stored as IRanges.

library(GenomicRanges)
macsPeaks_GR <- GRanges(seqnames = macsPeaks_DF[, "chr"], IRanges(macsPeaks_DF[, "start"], macsPeaks_DF[,
    "end"]))
macsPeaks_GR
## GRanges object with 16757 ranges and 0 metadata columns:
##           seqnames            ranges strand
##              <Rle>         <IRanges>  <Rle>
##       [1]     chr1   4785371-4785642      *
##       [2]     chr1   5082993-5083247      *
##       [3]     chr1   7397544-7398115      *
##       [4]     chr1   7616290-7616433      *
##       [5]     chr1   8134747-8134893      *
##       ...      ...               ...    ...
##   [16753]     chrY   2657144-2657294      *
##   [16754]     chrY 90784142-90784289      *
##   [16755]     chrY 90818471-90818771      *
##   [16756]     chrY 90824549-90824905      *
##   [16757]     chrY 90825407-90825575      *
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths

Peaks as GRanges

As we have seen before elements in GRanges can accessed and set using various GRanges functions. Here we can deconstruct our object back to contig names and interval ranges.

seqnames(macsPeaks_GR)

## factor-Rle of length 16757 with 21 runs
##   Lengths:   916   822  1795   582   596   470 ...   787  1089   836   954   395     7
##   Values : chr1  chr10 chr11 chr12 chr13 chr14 ... chr6  chr7  chr8  chr9  chrX  chrY 
## Levels(21): chr1 chr10 chr11 chr12 chr13 chr14 chr15 ... chr5 chr6 chr7 chr8 chr9 chrX chrY

ranges(macsPeaks_GR)

## IRanges object with 16757 ranges and 0 metadata columns:
##               start       end     width
##           <integer> <integer> <integer>
##       [1]   4785371   4785642       272
##       [2]   5082993   5083247       255
##       [3]   7397544   7398115       572
##       [4]   7616290   7616433       144
##       [5]   8134747   8134893       147
##       ...       ...       ...       ...
##   [16753]   2657144   2657294       151
##   [16754]  90784142  90784289       148
##   [16755]  90818471  90818771       301
##   [16756]  90824549  90824905       357
##   [16757]  90825407  90825575       169

Peaks as GRanges

GRanges objects may have metadata attached. Here we attach some useful information on our peaks including the summit position and the fold enrichment over input.

mcols(macsPeaks_GR) <- macsPeaks_DF[, c("abs_summit", "fold_enrichment")]
macsPeaks_GR

## GRanges object with 16757 ranges and 2 metadata columns:
##           seqnames            ranges strand | abs_summit fold_enrichment
##              <Rle>         <IRanges>  <Rle> |  <integer>       <numeric>
##       [1]     chr1   4785371-4785642      * |    4785563         5.33590
##       [2]     chr1   5082993-5083247      * |    5083123         4.30257
##       [3]     chr1   7397544-7398115      * |    7397837         9.58306
##       [4]     chr1   7616290-7616433      * |    7616380         7.65860
##       [5]     chr1   8134747-8134893      * |    8134873         2.94486
##       ...      ...               ...    ... .        ...             ...
##   [16753]     chrY   2657144-2657294      * |    2657196         4.25981
##   [16754]     chrY 90784142-90784289      * |   90784227         6.91704
##   [16755]     chrY 90818471-90818771      * |   90818595         5.63525
##   [16756]     chrY 90824549-90824905      * |   90824771         5.21309
##   [16757]     chrY 90825407-90825575      * |   90825485         3.09097
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths

Importing MACS peaks - rtrackalyer

Also included in the MACS2 output directory is a ‘.narrowPeak’ file. This is a type of interval/bed file and as we have done before, we can import these with the rtracklayer package. The ‘format’ argument must be set to ‘narrowPeak’.

library(rtracklayer)
macsPeaks_GR_np <- import("data/Mel1_peaks.narrowPeak", format = "narrowPeak")
macsPeaks_GR_np
## GRanges object with 16757 ranges and 6 metadata columns:
##           seqnames            ranges strand |            name     score signalValue
##              <Rle>         <IRanges>  <Rle> |     <character> <numeric>   <numeric>
##       [1]     chr1   4785371-4785642      * |     Mel1_peak_1        73     5.33590
##       [2]     chr1   5082993-5083247      * |     Mel1_peak_2        92     4.30257
##       [3]     chr1   7397544-7398115      * |     Mel1_peak_3       222     9.58306
##       [4]     chr1   7616290-7616433      * |     Mel1_peak_4        73     7.65860
##       [5]     chr1   8134747-8134893      * |     Mel1_peak_5        29     2.94486
##       ...      ...               ...    ... .             ...       ...         ...
##   [16753]     chrY   2657144-2657294      * | Mel1_peak_16753        21     4.25981
##   [16754]     chrY 90784142-90784289      * | Mel1_peak_16754       160     6.91704
##   [16755]     chrY 90818471-90818771      * | Mel1_peak_16755        74     5.63525
##   [16756]     chrY 90824549-90824905      * | Mel1_peak_16756        92     5.21309
##   [16757]     chrY 90825407-90825575      * | Mel1_peak_16757        24     3.09097
##              pValue    qValue      peak
##           <numeric> <numeric> <integer>
##       [1]  10.66553   7.37727       192
##       [2]  12.68072   9.27344       130
##       [3]  26.16516  22.20383       293
##       [4]  10.68563   7.39636        90
##       [5]   5.86828   2.96627       126
##       ...       ...       ...       ...
##   [16753]   4.98931   2.19466        52
##   [16754]  19.76291  16.02769        85
##   [16755]  10.73033   7.44072       124
##   [16756]  12.62006   9.21552       222
##   [16757]   5.26630   2.43505        78
##   -------
##   seqinfo: 21 sequences from an unspecified genome; no seqlengths

Filter peaks in blacklisted regions

We will want to remove any peaks overlapping blacklisted regions prior to any downstream analysis. We can do this using simple overlapping with GRanges objects.

library(rtracklayer)
blkList <- import.bed(toBlkList)
macsPeaks_GR <- macsPeaks_GR[!macsPeaks_GR %over% blkList]

Peak Annotation

Annotation of peaks to genes

So far we have been working with ChIPseq peaks corresponding to transcription factor binding. Transcription factors, as implied in the name, can affect the expression of their target genes.

The target of a transcription factor is hard to assertain from ChIPseq data alone and so often we will annotate peaks to genes by a simple set of rules:

Peaks are typically annotated to a gene if * They overlap the gene. * The gene is the closest (and within a minimum distance).

Peak annotation

A useful package for annotation of peaks to genes is ChIPseeker.

By using pre-defined annotation in the from of a TXDB object for mouse (mm10 genome), ChIPseeker will provide us with an overview of where peaks land in the gene and distance to TSS sites.

First load the libraries we require for the next part.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
library(org.Mm.eg.db)
library(GenomeInfoDb)
library(ChIPseeker)

Peak annotation

The annotatePeak function accepts a GRanges object of the regions to annotate, a TXDB object for gene locations and a database object name to retrieve gene names from.

peakAnno <- annotatePeak(macsPeaks_GR, tssRegion = c(-500, 500), TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,
    annoDb = "org.Mm.eg.db")

## >> preparing features information...      2025-06-05 20:01:16 
## >> identifying nearest features...        2025-06-05 20:01:16 
## >> calculating distance from peak to TSS...   2025-06-05 20:01:17 
## >> assigning genomic annotation...        2025-06-05 20:01:17 
## >> adding gene annotation...          2025-06-05 20:01:19 
## >> assigning chromosome lengths           2025-06-05 20:01:19 
## >> done...                    2025-06-05 20:01:19

class(peakAnno)

## [1] "csAnno"
## attr(,"package")
## [1] "ChIPseeker"

Peak annotation

The result is a csAnno object containing annotation for peaks and overall annotation statistics.

peakAnno

## Annotated peaks generated by ChIPseeker
## 16753/16753  peaks were annotated
## Genomic Annotation Summary:
##              Feature  Frequency
## 9           Promoter 41.3299111
## 4             5' UTR  0.3581448
## 3             3' UTR  1.9697965
## 1           1st Exon  1.9280129
## 7         Other Exon  2.4891064
## 2         1st Intron 13.1319764
## 8       Other Intron 16.1284546
## 6 Downstream (<=300)  0.1969796
## 5  Distal Intergenic 22.4676177

Peak annotation

The csAnno object contains the information on annotation of individual peaks to genes.

To extract this from the csAnno object the ChIPseeker functions as.GRanges or as.data.frame can be used to produce the respective object with peaks and their associated genes.

peakAnno_GR <- as.GRanges(peakAnno)
peakAnno_DF <- as.data.frame(peakAnno)

Peak annotation

peakAnno_GR[1:2, ]

## GRanges object with 2 ranges and 14 metadata columns:
##       seqnames          ranges strand | abs_summit fold_enrichment  annotation   geneChr
##          <Rle>       <IRanges>  <Rle> |  <integer>       <numeric> <character> <integer>
##   [1]     chr1 4785371-4785642      * |    4785563         5.33590    Promoter         1
##   [2]     chr1 5082993-5083247      * |    5083123         4.30257    Promoter         1
##       geneStart   geneEnd geneLength geneStrand      geneId          transcriptId
##       <integer> <integer>  <integer>  <integer> <character>           <character>
##   [1]   4783572   4785692       2121          2       27395  ENSMUST00000132625.1
##   [2]   5083080   5162529      79450          1      108664 ENSMUST00000044369.12
##       distanceToTSS            ENSEMBL      SYMBOL               GENENAME
##           <numeric>        <character> <character>            <character>
##   [1]            50 ENSMUSG00000033845      Mrpl15 mitochondrial riboso..
##   [2]             0 ENSMUSG00000033793     Atp6v1h ATPase, H+ transport..
##   -------
##   seqinfo: 21 sequences from mm10 genome

Vizualising peak annotation

Now we have the annotated peaks from ChIPseeker we can use some of ChIPseeker’s plotting functions to display distribution of peaks in gene features. Here we use the plotAnnoBar function to plot this as a bar chart but plotAnnoPie would produce a similar plot as a pie chart.

plotAnnoBar(peakAnno)

Vizualising peak annotation

Similarly we can plot the distribution of peaks around TSS sites.

plotDistToTSS(peakAnno)

Vizualising peak annotation

ChIPseeker can also offer a succinct plot to describe the overlap between annotations.

upsetplot(peakAnno, vennpie = F)

Time for an exercise!

Exercise on ChIP-seq data can be found here

Answers to exercise

Answers can be found here

R code for solutions can be found here

ChIPseq In Bioconductor (part2)

Rockefeller University, Bioinformatics Resource Center

http://rockefelleruniversity.github.io/RU_ChIPseq/

ChIPseq (part 2)

Data

Quality Control

Quality Control

Quality Control References

Always have an appropriate input

Quality metrics for ChIPseq

Quality control with ChIPQC

Quality control with ChIPQC

QC of multiple samples

QC of multiple samples

Assessing fragment length

Assessing fragment length

Assessing fragment length

Assessing fragment length

Assessing fragment length

Assessing fragment length

Assessing fragment length

Cross-coverage plot

Cross-coverage plot

Cross-coverage plot

Cross-coverage plot

Cross-coverage plot

Blacklists and SSD

Blacklists

Blacklist affects many metrics

Blacklists and SSD

Standardized Standard Deviation

Standardized Standard Deviation

Library complexity and enrichment

Library complexity

Duplication

Enrichment for reads across genes

Enrichment for reads across genes.

Peak Calling

Peak calling

Installing programs that aren’t in R

Install MACS2 with Herper

Running MACS2

Running MACS2 in R

Working with Peaks

Importing MACS peaks

Converting MACS peaks

Peaks as GRanges

Peaks as GRanges

Importing MACS peaks - rtrackalyer

Filter peaks in blacklisted regions

Peak Annotation

Annotation of peaks to genes

Peak annotation

Peak annotation

Peak annotation

Peak annotation

Peak annotation

Vizualising peak annotation

Vizualising peak annotation

Vizualising peak annotation

Time for an exercise!

Answers to exercise