ATAC-seq, MNase-seq and DNase-seq

ATAC,MNase,DNase - https://www.the-scientist.com/?articles.view/articleNo/44772/title/Reveling-in-the-Revealed/

DNase-seq - Enzymatic digestion to extract signal from open chromatin around transcription factor binding sites.
MNase-seq - Enzymatic digestion to extract signal repesenting nucleosome positioning.
ATAC-seq - Uses transposases and offers a method to simultaneously extract signal from transcription factors binding sites and nucleosome positions from a single sample.

Working with ATAC-seq data in R/Bioconductor

Introduction

In this workshop we will look at some of the basics of ATAC-seq analysis in R using publically available data.

We will look at the alignment, post-alignment processing, plotting of ATAC-seq data over TSSs and motif occurrences, peak calling for nucleosome free regions and differential ATAC-seq signal.

Covering today.

Aligning ATAC-seq data.
Plotting fragment lengths from ATAC-seq
Creation of nucleosome free signal tracks.
Peak calling for nucleosome free regions.
Annotation and functional enrichment analysis of ATAC-seq.
Differential nucleosome free regions from ATAC-seq data.

The Sequencing Data.

In this workshop we will make use of three sets of published data.

The first dataset is from original ATAC-seq paper.

Transposition of native chromatin for multimodal regulatory analysis and personal epigenomics Jason D. Buenrostro, Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, and William J. Greenleaf

In particular, we will make use of the ATACseq_50k_Rep2 sample GEO - GSM1155958 Data can be retrieved in fastq format from ENA

SAMN02192806 - here

For the second dataset we take ATAC-seq generated by Bing Ren at UCSD as part of the Encode consortium. Links to data and sample information are included in list below.

Liver day 12 - ENCSR302LIV
Kidney day 15 - ENCSR023QZX
Hindbrain day 12 - ENCSR088UYE

Finally I have processed some of the data from Christina Leslie’ lab at MSKCC exactly as described in this workshop so we can review some of the characteristics of ATAC-seq data alongside the same data processed by Encode’s pipeline.

The raw data and processed BAM file is available from Encodes portal

T-Reg - ENCSR724UJS

The Reference Data

For ATAC-seq analysis we will require a few pieces of reference data.

This includes-

Reference genome in fasta format - IGenomes provides reference genomes for major organims. If you want the exact reference i use here you can download from here
Gene models - We will retrieve these from TxDb Bioconductor annotation packages.
Blacklists - Artefact regions specific to genomes. These can be found in Encode portal here

The Tools.

In this workshop we will concentrate on using R and Bioconductor to process your data.

Bioconductor has many packages to handle sequencing data including BAM files, BED files, bigWigs and Fasta files. Some of these tools may require large memory or care to process data in chunks. We process full size ATAC-seq data routinely in R but for some examples in this practical we will use just one chromosome to save time.

Required R packages and versions.

We will need the latest version of R v(>=3.4) and Bioconductor (v3.6) with the following packages.

# install.packages(knitr)

# install.packages(rmdformats)

# install.packages(dplyr)

# install.packages(DT)

# install.packages(tidyr)

# install.packages(ggplot2)

# install.packages(magrittr)

# install.packages(devtools)

# source('https://bioconductor.org/biocLite.R')

# ## Needed for mac and Linux only biocLite(Rsubread) ##

# biocLite(Rsamtools)

# biocLite(GenomicAlignments)

# biocLite(TxDb.Hsapiens.UCSC.hg19.knownGene)

# biocLite(soGGi)

# biocLite(rtracklayer)

# biocLite(ChIPQC)

# biocLite(ChIPseeker)

# biocLite(rGREAT)

# biocLite(limma)

# biocLite(DESeq2)

# biocLite(TxDb.Mmusculus.UCSC.mm10.knownGene)

# biocLite(tracktables)

# biocLite(clusterProfiler)

# biocLite(org.Mm.eg.db)

# biocLite(MotifDb)

# biocLite(Biostrings)

# biocLite(BSgenome.Hsapiens.UCSC.hg19)

# # Finally we need development version of soGGi (named here 1.10.4) # not
# version on Bioconductor (1.10.0)
# devtools::install_github('ThomasCarroll/soGGi')

IGV genome browser

We will also make use of IGV genome browser to review some results.

IGV is available from Broad here.

Processed Data in workshop

We start with public sequencing data in links below and use reference data in Bioconductor and/or IGenomes.

We do not run to all the steps together today so i include processed data from various points in analysis at links below

Essentials

BAM file and BAI index from our alignment/sorting/indexing.

SAMN02192806 - Greenleaf BAM - Full BAM file for Greenleaf example produced following in our Rsubread alignment, sorting and indexing.
SAMN02192806 - Greenleaf BAI index - BAI index file for BAM in Greenleaf example produced following in our alignment, sorting and indexing.

Small BAM, peak calls and directory structure.

ATAC_Workshop_Essential.zip - Require additional workshop files and directory structure.

Once you have downloaded the above and unzipped ATAC_Workshop.zip, you should move the Sorted_ATAC_50K_2.bam and Sorted_ATAC_50K_2.bam.bai file into ATAC_Workshop/ATAC_Data/ATAC_BAM/

You should also copy the RU_ATAC_Workshop.Rmd to ATAC_Workshop/ directory and open then to make sure all relative paths are correct.

Not essential

Same as above but with BAMs for counting as well as small BAM, peak calls and directory structure.

ATAC_Workshop.zip - Additional workshop files and directory structure.

Bigwigs for IGV.

Bigwigs - BigWigs to review in IGV.

Greenleaf dataset fastq to BAM processing.

In this section we will work alittle with the Greenleaf dataset. We will process one sample of the Greenleaf data from fastq to BAM to allow us to review some of the features of ATAC-seq data and to create some processed files for review and further analysis.

Creating a reference genome

The IGenomes repositories allow us to retrieve a consistent set of reference annotation for our genome of choice.

Here i have renamed the Fasta to something more meaningful in isolation – hg19_Genome.fa.

We will use the Rsubread package to align our ATAC-seq data so first we must build an appropriate index from our Fasta for the Rsubread aligner.

This takes awhile so run this after workshop if interested

Takes ~ 30 minutes on 3.1 GHz Intel Core i7 Mac pro

library(Rsubread)

genome <- "ATAC_Data/ATAC_Reference/hg19_Genome.fa"
indexForSubread <- gsub("\\.fa$", "", genome)

buildindex(indexForSubread, genome, indexSplit = FALSE)

Aligning Sequence Reads to the Genome.

Now we have a reference genome, we can align our paired-end ATAC-seq reads using Rsubread.

Here we can use a standard alignment for DNA but we increase the maximum allowed fragment length to capture long fragments representing poly-nucleosome signal.

The maximum allowed fragment length set here is based on parameters used within Greenleaf study.

This takes awhile so run this after workshop if interested

Takes ~ 50 minutes on 3.1 GHz Intel Core i7 Mac pro

read1 <- "ATAC_Data/ATAC_FQs/SRR891269_1.fastq.gz"
read2 <- "ATAC_Data/ATAC_FQs/SRR891269_2.fastq.gz"
outBAM <- "ATAC_50K_2.bam"

align(indexForSubread, readfile1 = read1, readfile2 = read2, output_file = outBAM, 
    nthreads = 2, type = 1, unique = TRUE, maxFragLength = 2000)

Sorting and Indexing

Following alignment we would want to sort and index our BAM file for use with external tool.

First we sort our aligned data by sequence order (not Read Name here).

We then index our file allowing for rapid access of particular genomic locations by other programs (e.g IGV, Samtools) and by R/Bioconductor packaes we will use.

This takes awhile so run this after workshop if interested

Takes ~ 15 minutes on 3.1 GHz Intel Core i7 Mac pro

library(Rsamtools)
sortedBAM <- file.path(dirname(outBAM), paste0("Sorted_", basename(outBAM)))

sortBam(outBAM, gsub("\\.bam", "", basename(sortedBAM)))
indexBam(sortedBAM)

So around ~ 1hr 30 minutes in total to get to here.

sortedBAM <- "ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2.bam"

Number of mapped reads

We can now check the mapping rate to make sure we do not have any problems with the aligmnent. Here we use Rsubread’s propmapped to give us number of mapped, unmapped and mapping percentage

This takes awhile so run this after workshop if interested

Takes ~ 15 minutes on 3.1 GHz Intel Core i7 Mac pro

library(Rsubread)
pmapped <- propmapped(sortedBAM)
pmapped

                                    Sample NumTotal NumMapped PropMapped
1 ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2.bam 56598621  55521254   0.980965

Distribution of mapped reads

We can see we have a great mapping rate but we would also want to check the distribution of mapped reads across chromosomes.

ATAC-seq is known have high signal on the mitochondrial chromosomes and so we can check for that here.

In this example, we see an extreme case where the mapping rate to mitochondrial genome is extremely high.

library(Rsamtools)
library(ggplot2)
library(magrittr)

idxstatsBam(sortedBAM) %>% ggplot(aes(seqnames, mapped, fill = seqnames)) + 
    geom_bar(stat = "identity") + coord_flip()

Lets see if it is always that bad.

So we take some data from Bing Ren’s lab (processed from fastq) and from our neighbour’s at MSKCC and see how it looks there.

We will also take some ChIP-seq to identify a potential similar trend there.

Distribution of reads

Greenleaf dataset post-alignment processing.

Now we have the processed the ATAC-seq paired-end data we can start to work with alignments.

We will identify the expected fragment length distribution and split our data into that representing nucleosome free and mono-nucleosomes.

Reading in mapped reads

We will read our newly aligned data using the GenomicAlignments package.

Here we only read properly paired, uniquely mapped reads mapping to Chromosome 20. We also retrieve the read name, mapq scores and importantly the insert sizes.

library(GenomicAlignments)

atacReads <- readGAlignmentPairs(sortedBAM, param = ScanBamParam(mapqFilter = 1, 
    flag = scanBamFlag(isPaired = TRUE, isProperPair = TRUE), what = c("qname", 
        "mapq", "isize"), which = GRanges("chr20", IRanges(1, 63025520))))
# length(atacReads)
atacReads

GAlignmentPairs object with 279210 pairs, strandMode=1, and 0 metadata columns:
           seqnames strand   :               ranges  --
              <Rle>  <Rle>   :            <IRanges>  --
       [1]    chr20      +   :       [60000, 60029]  --
       [2]    chr20      +   :       [60000, 60029]  --
       [3]    chr20      +   :       [60392, 60441]  --
       [4]    chr20      +   :       [60436, 60485]  --
       [5]    chr20      -   :       [60665, 60714]  --
       ...      ...    ... ...                  ... ...
  [279206]    chr20      -   : [62962073, 62962122]  --
  [279207]    chr20      +   : [62962999, 62963048]  --
  [279208]    chr20      -   : [62964817, 62964866]  --
  [279209]    chr20      -   : [62965224, 62965273]  --
  [279210]    chr20      +   : [62965378, 62965427]  --
                         ranges
                      <IRanges>
       [1]       [60361, 60410]
       [2]       [60361, 60410]
       [3]       [60570, 60619]
       [4]       [60446, 60495]
       [5]       [60487, 60536]
       ...                  ...
  [279206] [62961997, 62962046]
  [279207] [62963057, 62963106]
  [279208] [62964689, 62964738]
  [279209] [62965221, 62965270]
  [279210] [62965437, 62965486]
  -------
  seqinfo: 25 sequences from an unspecified genome

Retrieving insert sizes

Now we have read in the paired aligned data into R, we can retreive the insert sizes from the elementMetadata attached to reads.

Since properly paired reads will have same insert size length we extract insert sizes from read1.

atacReads_read1 <- GenomicAlignments::first(atacReads)
insertSizes <- abs(elementMetadata(atacReads_read1)$isize)
head(insertSizes)

[1] 431 431 228  60 228 176

Plotting insert sizes

ATAC-seq should represent a mix of fragment lengths corresponding to nucleosome free, mononucleosome and poly-nucleosome fractions.

We can use the newly acquired insert lengths for chromosome 20 to plot the distribution of all fragment lengths.

library(magrittr)
library(dplyr)
library(ggplot2)
fragLenPlot <- table(insertSizes) %>% data.frame %>% rename(InsertSize = insertSizes, 
    Count = Freq) %>% mutate(InsertSize = as.numeric(as.vector(InsertSize)), 
    Count = as.numeric(as.vector(Count))) %>% ggplot(aes(x = InsertSize, y = Count)) + 
    geom_line()

fragLenPlot + theme_bw()

fragLenPlot + scale_y_continuous(trans = "log2") + theme_bw()

Plotting insert sizes with Greenleaf open, mono- and di-nucleosome profiles

This looks very similar to the image from the Greenleaf paper.

We can now annotate our nucleosome free (< 100bp), mono-nucleosome (180bp-247bp) and di-nucleosome (315-437) as in the Greenleaf study.

fragLenPlot + geom_vline(xintercept = c(180, 247), colour = "red") + geom_vline(xintercept = c(315, 
    437), colour = "darkblue") + geom_vline(xintercept = c(100), colour = "darkgreen") + 
    theme_bw()

fragLenPlot + scale_y_continuous(trans = "log2") + geom_vline(xintercept = c(180, 
    247), colour = "red") + geom_vline(xintercept = c(315, 437), colour = "darkblue") + 
    geom_vline(xintercept = c(100), colour = "darkgreen") + theme_bw()

Again we can make a quick comparison with some other datasets.

Fragment length distributions

Plotting ATAC-seq signal of TSSs (Retrieving TSSs regions)

From the fragment length plots shown above it would seem we have evidence of data coming both from nucleosome occupied and nucleosome free regions.

We expect nucleosome free regions to be present at TSSs of active genes and nucleosome signal to be be more enriched in regions surrounding the TSS

To investigate this we will produce a meta-TSS plot of average signal of the different fragment length signals across TSSs.

First then we need a set of TSS positions which we can retrieve from the gene models in the TxDb.Hsapiens.UCSC.hg19.knownGene package.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TSSs <- resize(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), fix = "start", 1)
TSSs

GRanges object with 23056 ranges and 1 metadata column:
        seqnames                 ranges strand |     gene_id
           <Rle>              <IRanges>  <Rle> | <character>
      1    chr19 [ 58874214,  58874214]      - |           1
     10     chr8 [ 18248755,  18248755]      + |          10
    100    chr20 [ 43280376,  43280376]      - |         100
   1000    chr18 [ 25757445,  25757445]      - |        1000
  10000     chr1 [244006886, 244006886]      - |       10000
    ...      ...                    ...    ... .         ...
   9991     chr9 [115095944, 115095944]      - |        9991
   9992    chr21 [ 35736323,  35736323]      + |        9992
   9993    chr22 [ 19109967,  19109967]      - |        9993
   9994     chr6 [ 90539619,  90539619]      + |        9994
   9997    chr22 [ 50964905,  50964905]      - |        9997
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

Plotting ATAC-seq signal of TSSs (Creating open, mono- and di-nucleosome signal profiles)

Now we have all the TSS positions across the genome we can use the soGGi package to make plots of average signal across TSSs for different fragment length ranges.

This takes awhile so run this after workshop if interested

Takes ~ 20 minutes on 3.1 GHz Intel Core i7 Mac pro

library(soGGi)

# Nucleosome free
nucFree <- regionPlot(bamFile = sortedBAM, testRanges = TSSs, style = "point", 
    format = "bam", paired = TRUE, minFragmentLength = 0, maxFragmentLength = 100, 
    forceFragment = 50)

# Mononucleosome
monoNuc <- regionPlot(bamFile = sortedBAM, testRanges = TSSs, style = "point", 
    format = "bam", paired = TRUE, minFragmentLength = 180, maxFragmentLength = 240, 
    forceFragment = 80)

# Dinucleosome
diNuc <- regionPlot(bamFile = sortedBAM, testRanges = TSSs, style = "point", 
    format = "bam", paired = TRUE, minFragmentLength = 315, maxFragmentLength = 437, 
    forceFragment = 160)

# nucFree_gL <- nucFree monoNuc_gL <- monoNuc diNuc_gL <- diNuc
# save(monoNuc_gL,nucFree_gL,diNuc_gL,file='ATAC_Data/ATAC_RData/gL_soGGiResults.RData')

Plotting ATAC-seq signal of TSSs (Plotting open, mono- and di-nucleosome signal profiles)

From the meta-TSS plots of different fragment lengths we can see the expected distribution of signal.

Nucleosome free fragments having a peak around the the TSS and nucleosome occupied signal peaking around the TSS with a strong peak at the +1 nucleosome.

library(soGGi)
load(file = "ATAC_Data/ATAC_RData/gL_soGGiResults.RData")
plotRegion(nucFree_gL, outliers = 0.01)

plotRegion(monoNuc_gL, outliers = 0.01)

plotRegion(diNuc_gL, outliers = 0.01)

Again we can look at some of other datasets for mono-nucleosomes to get and idea what we may see.

Mononnucleosome plots

Subsetting ATAC-seq reads files by insert sizes.

We may wish to divide our aligned reads into reads representing nucleosome free and nucleosome occupied.

Here we create BAM files for the reads representing nucleosome free, mono and di nucleosome by using insert sizes to filtering read.

atacReads_Open <- atacReads[insertSizes < 100, ]
atacReads_MonoNuc <- atacReads[insertSizes > 180 & insertSizes < 240, ]
atacReads_diNuc <- atacReads[insertSizes > 315 & insertSizes < 437, ]

Creating BAM files split by insert sizes.

The resulting reads can be written back to BAM files for use in other parts of our analysis or for visualisation in programs such as IGV by functions in the rtracklayer package.

openRegionBam <- gsub("\\.bam", "_openRegions\\.bam", sortedBAM)
monoNucBam <- gsub("\\.bam", "_monoNuc\\.bam", sortedBAM)
diNucBam <- gsub("\\.bam", "_diNuc\\.bam", sortedBAM)

library(rtracklayer)
export(atacReads_Open, openRegionBam, format = "bam")
export(atacReads_MonoNuc, monoNucBam, format = "bam")
# export(atacReads_Open,diNucBam,format = 'bam')

Creating an open region bigWig.

We can make it significantly quicker to review the pile-up of ATAC-seq signal in a genome browser by creating a bigWig file.

Additional normalisation to total mapped reads and duplicate removal could be applied at this point.

openRegionBigWig <- gsub("\\.bam", "_openRegions\\.bw", sortedBAM)
openRegionRPMBigWig <- gsub("\\.bam", "_openRegionsRPM\\.bw", sortedBAM)
atacFragments_Open <- granges(atacReads_Open)
export.bw(coverage(atacFragments_Open), openRegionBigWig)

bigWig in IGV

Greenleaf dataset - Finding Open Regions.

A common goal in ATAC-seq is to identify open chromatin regions/nucleosome free regions.

For the remainder of the workshop we will look at analysing the nucleosome free portions of the our ATAC-seq data.

Peak calling for nucleosome free regions.

As we discussed, the reads with insert sizes less than 100bp represent fragments coming from open chromatin and around transcription factors bound to the DNA.

There are many methods available to call nucleosome free regions from ATAC-seq data with many borrowed from ChIP-seq analysis.

Peak calling using MACS2

One very popular and standard peak caller for ATAC-seq is MAC2.

MACS2 is well established for identifying punctate peaks found in ChIP-seq data from transcription factors.

MACS2 website can be found here with information on parameters and use cases.

Single end peak calling.

With single end sequencing from ATAC-seq we do not know how long the fragments are.

To identify open regions therefore requires some different parameters for MACS2 peak calling.

One strategy employed is to shift read 5’ ends by -100 and then extend from this by 200bp. Considering the expected size of our nucleosome free fragments this should provide a pile-up over nucelosome regions suitable for MACS2 window size.

MACS2 callpeak -t singleEnd.bam --nomodel --shift -100 --extsize 200 --format BAM -g MyGenome

Alternatively for the nucleosome occupied data we can adjust shift and extension to centre the signal on nucleosome centres (nucleosomes wrapped in 147bp of DNA).

MACS2 callpeak -t singleEnd.bam --nomodel --shift 37 --extsize 73 --format BAM -g MyGenome

Paired end peak calling.

If we have sequenced paired-end data then we do know the fragment lengths and can provide pre-filter BAM files to MACS2.

We have to simply tell MACS2 that the data is paired using the format argument.

By default MACS2 will guess it is single end BAM.

MACS2 callpeak -t pairedEnd.bam -f BAMPE --outdir path/to/output/ --name pairedEndPeakName -g MyGenome

Following peak calling we would get 3 files.

Name.narrowPeak – Narrow peak format suitable for IGV and further analysis
Name_peaks.xls – Peak table suitable for review in excel.(not actually a xls but a tsv)
summits.bed – Summit positions for peaks useful for finding motifs and plotting

QC for low quality, duplicates and signal distribution.

Before we remove any data we can get a quick assessment of our reads in peaks, duplication rate, low quality reads and reads in artefact regions from ChIPQC.

library(ChIPQC)
library(rtracklayer)
library(DT)
library(dplyr)
library(tidyr)

blkList <- import.bed("ATAC_Data/ATAC_blacklists/ENCFF001TDO.bed.gz")
openRegionPeaks <- "ATAC_Data/ATAC_Peaks/Sorted_ATAC_50K_2_Small_Paired_peaks.narrowPeak"

qcRes <- ChIPQCsample("ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2_openRegions.bam", 
    peaks = openRegionPeaks, annotation = "hg19", chromosomes = "chr20", blacklist = blkList, 
    verboseT = FALSE)

done
Calculating coverage
Calculating Summits on  chr20  ..[1] 1

QCmetrics(qcRes) %>% t %>% data.frame %>% dplyr:::select(Reads, starts_with(c("Filt")), 
    starts_with(c("RiP")), starts_with(c("RiBL"))) %>% datatable(rownames = NULL)

flagtagcounts(qcRes) %>% t %>% data.frame %>% mutate(Dup_Percent = (DuplicateByChIPQC/Mapped) * 
    100) %>% dplyr:::select(Mapped, Dup_Percent) %>% datatable(rownames = NULL)

Blacklisted Regions

A potential confounder of analysis and a source of noise in ATAC-seq (and ChIPseq/DNAseq etc) are high signal arteact regions within the genome.

Such regions have been previously defined for many model organisms and are available from the encode portal here.

blacklist in IGV

Remove blacklisted peaks

Since blacklisted regions may confound our analysis we remove any peaks which have been called there.

Removing blacklists too early can hide some of the qc issues in your data. The blacklist should always be considered in your analysis and recommended to removed data from these regions once QC is considered.

MacsCalls_chr20 <- granges(qcRes[seqnames(qcRes) %in% "chr20"])

data.frame(Blacklisted = sum(MacsCalls_chr20 %over% blkList), Not_Blacklisted = sum(!MacsCalls_chr20 %over% 
    blkList))

  Blacklisted Not_Blacklisted
1           2             596

MacsCalls_chr20_filtered <- MacsCalls_chr20[!MacsCalls_chr20 %over% blkList]

Greenleaf dataset - Annotating Open Regions.

It is often of interest to associate identified open/nucleosome free regions to genomic features such as genes and enhancers.

Once annotated to genes or enhancers’ genes, we can start to associate ATAC-seq data to characteristics of these genes. (functional annotation, expression changes, other epigenetic states)

Annotating peaks to genes.

A simple approach to annotating nucleosome free region to genes is to associate regions to their closest gene or within a window around a genes transcriptional start site.

We can use the chipseeker library to identify genes closest to our regions and to give us simple summaries and visualisations of this annotation.

We use the gene models from TxDb.Hsapiens.UCSC.hg19.knownGene and supply this to ChIPseeker packages annotatePeak function.

ChIPseeker’s csAnno object will then show breakdown of percentages of peaks in genomic regions.

library(ChIPseeker)
MacsCalls_chr20_filteredAnno <- annotatePeak(MacsCalls_chr20_filtered, TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene)

>> preparing features information...         2018-02-16 16:32:33 
>> identifying nearest features...       2018-02-16 16:32:33 
>> calculating distance from peak to TSS...  2018-02-16 16:32:33 
>> assigning genomic annotation...       2018-02-16 16:32:33 
>> assigning chromosome lengths          2018-02-16 16:32:45 
>> done...                   2018-02-16 16:32:45

MacsCalls_chr20_filteredAnno

Annotated peaks generated by ChIPseeker
596/596  peaks were annotated
Genomic Annotation Summary:
              Feature  Frequency
9    Promoter (<=1kb) 40.2684564
10   Promoter (1-2kb)  2.0134228
11   Promoter (2-3kb)  2.1812081
4              5' UTR  0.1677852
3              3' UTR  1.0067114
1            1st Exon  0.1677852
7          Other Exon  1.5100671
2          1st Intron  6.0402685
8        Other Intron 19.9664430
6  Downstream (<=3kb)  0.8389262
5   Distal Intergenic 25.8389262

Displaying annotation distribution

As well as showing us a table of the annotation distribution we can visualise this using the plotAnnoPie and plotAnnoBar functions.

plotAnnoPie(MacsCalls_chr20_filteredAnno)

plotAnnoBar(MacsCalls_chr20_filteredAnno)

Visualising overlaps between genomic regions.

It is important to note that ChIPseeker is making some critical choices for you here. An important decision is the order of preference in annotating our peaks when they fall in multiple gene regions .. ie. (5’UTR and promoter).

The overlap between annotation can be seen by way of an upset plot.

upsetplot(MacsCalls_chr20_filteredAnno)

Retrieving annotated Nucleosome-free regions.

With this information we can then subset our peaks/nuc free regions to those only landing in TSS regions (+/- 500)

MacsGranges_Anno <- as.GRanges(MacsCalls_chr20_filteredAnno)
TSS_MacsGranges_Anno <- MacsGranges_Anno[abs(MacsGranges_Anno$distanceToTSS) < 
    500]
TSS_MacsGranges_Anno

GRanges object with 216 ranges and 9 metadata columns:
        seqnames               ranges strand |       annotation   geneChr
           <Rle>            <IRanges>  <Rle> |      <character> <integer>
    [1]    chr20     [278001, 278160]      * | Promoter (<=1kb)        20
    [2]    chr20     [305954, 306099]      * | Promoter (<=1kb)        20
    [3]    chr20     [327710, 327787]      * | Promoter (<=1kb)        20
    [4]    chr20     [443103, 443346]      * | Promoter (<=1kb)        20
    [5]    chr20     [524349, 524527]      * | Promoter (<=1kb)        20
    ...      ...                  ...    ... .              ...       ...
  [212]    chr20 [62289923, 62290008]      * | Promoter (<=1kb)        20
  [213]    chr20 [62370927, 62371002]      * | Promoter (<=1kb)        20
  [214]    chr20 [62496497, 62496604]      * | Promoter (<=1kb)        20
  [215]    chr20 [62611409, 62611498]      * | Promoter (<=1kb)        20
  [216]    chr20 [62694306, 62694450]      * | Promoter (<=1kb)        20
        geneStart   geneEnd geneLength geneStrand      geneId transcriptId
        <integer> <integer>  <integer>  <integer> <character>  <character>
    [1]    278204    280963       2760          1       85364   uc002wdf.3
    [2]    306215    310872       4658          1        6666   uc002wdh.4
    [3]    327370    335512       8143          1       80023   uc002wdi.4
    [4]    416124    443187      27064          2      128637   uc002wds.3
    [5]    463338    524482      61145          2        1457   uc002wdw.1
    ...       ...       ...        ...        ...         ...          ...
  [212]  62289667  62305446      15780          1       51750   uc002yfv.2
  [213]  62371211  62375403       4193          1       56731   uc002ygq.3
  [214]  62496581  62522898      26318          1        7165   uc002ygy.3
  [215]  62605466  62611362       5897          2      140700   uc002yhn.2
  [216]  62694010  62703700       9691          1        6919   uc021wgq.1
        distanceToTSS
            <numeric>
    [1]           -44
    [2]          -116
    [3]           340
    [4]             0
    [5]             0
    ...           ...
  [212]           256
  [213]          -209
  [214]             0
  [215]           -47
  [216]           296
  -------
  seqinfo: 1 sequence from hg19 genome

Functional Analysis of Nucleosome-free regions - 1.

Another common step to ATAC-seq analysis is to identify any functional enrichment in genes associated to nucleosome free regions.

One approach is to take the genes we identified from ChIPseeker as having nuclesome free regions and test these for functional enrichment using standard tools.

Another approach which is well suited to ATAC-seq is that implemented in GREAT.

GREAT first defines potential regulatory regions for all genes and then then tests for enrichment accounting for total size of regulatory regions for a pathway/GO term/gene set.

GREAT is queried typically through its web portal here but we can take advantage of the r interface in rGREAT.

rGREAT by default will limit our queries and so we may need to be patient. This can be adjusted in rGREAT options.

We can submit our peak calls to GREAT using the submitGreatJob function and review available categories of results using availableCategories.

library(rGREAT)
seqlevelsStyle(MacsCalls_chr20_filtered) <- "UCSC"

great_Job <- submitGreatJob(MacsCalls_chr20_filtered, species = "hg19")
availableCategories(great_Job)

[1] "GO"                               "Phenotype Data and Human Disease"
[3] "Pathway Data"                     "Gene Expression"                 
[5] "Regulatory Motifs"                "Gene Families"

Functional Analysis of Nucleosome-free regions - 2.

For this example we select the results tables for the GO category using getEnrichmentTables functions and then review the results for Biological processes.

great_ResultTable = getEnrichmentTables(great_Job, category = "GO")
names(great_ResultTable)

[1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"

great_ResultTable[["GO Biological Process"]][1:4, ]

          ID                                             name
1 GO:0035723        interleukin-15-mediated signaling pathway
2 GO:2000438    negative regulation of monocyte extravasation
3 GO:2000560 positive regulation of CD24 biosynthetic process
4 GO:2000437             regulation of monocyte extravasation
  Binom_Genome_Fraction Binom_Expected Binom_Observed_Region_Hits
1          0.0003871583      0.2307463                         15
2          0.0003871583      0.2307463                         15
3          0.0003871583      0.2307463                         15
4          0.0004297120      0.2561083                         15
  Binom_Fold_Enrichment Binom_Region_Set_Coverage Binom_Raw_PValue
1              65.00645                0.02516779     1.451510e-22
2              65.00645                0.02516779     1.451510e-22
3              65.00645                0.02516779     1.451510e-22
4              58.56896                0.02516779     6.777858e-22
  Hyper_Total_Genes Hyper_Expected Hyper_Observed_Gene_Hits
1                 1     0.01973283                        1
2                 1     0.01973283                        1
3                 1     0.01973283                        1
4                 2     0.03946566                        1
  Hyper_Fold_Enrichment Hyper_Gene_Set_Coverage Hyper_Term_Gene_Coverage
1              50.67697             0.002808989                      1.0
2              50.67697             0.002808989                      1.0
3              50.67697             0.002808989                      1.0
4              25.33848             0.002808989                      0.5
  Hyper_Raw_PValue
1       0.01973283
2       0.01973283
3       0.01973283
4       0.03907735

save(great_ResultTable, file = "ATAC_Data/ATAC_RData/Great_Results.RData")

Differential ATAC-seq

We have briefly reviewed the processing and initial analysis of one ATAC-seq sample using R.

In the next part we will look at how we can identify changes in open regions using R/Bioconductor.

Here we will take an approach akin that in Diffbind and reasonably esatablished in ATAC-seq analysis.

First, We will define a set of non-redundant peaks present in at least 2 samples and use these to assess changes in nuc-free ATAC-seq signal using DESeq2.

Identifying a set of non-redundant peaks.

blacklist in IGV

Here we will use soGGi to produce merge our open regions from all samples into a set of non-redundant (no overlapping regions) open regions present in any sample.

peaks <- dir("ATAC_Data/ATAC_Peaks_forCounting/", pattern = "*.narrowPeak", 
    full.names = TRUE)
myPeaks <- lapply(peaks, ChIPQC:::GetGRanges, simple = TRUE)

names(myPeaks) <- c("HindBrain_1", "HindBrain_2", "Kidney_1", "Kidney_2", "Liver_1", 
    "Liver_2")
Group <- factor(c("HindBrain", "HindBrain", "Kidney", "Kidney", "Liver", "Liver"))
consensusToCount <- soGGi:::runConsensusRegions(GRangesList(myPeaks), "none")

blklist <- import.bed("ATAC_Data/ATAC_blacklists/ENCFF547MET.bed.gz")

consensusToCount <- consensusToCount[!consensusToCount %over% blklist & !seqnames(consensusToCount) %in% 
    "chrM"]

consensusToCount

GRanges object with 89654 ranges and 7 metadata columns:
          seqnames               ranges strand | HindBrain_1 HindBrain_2
             <Rle>            <IRanges>  <Rle> |   <numeric>   <numeric>
      [1]     chr1   [3130336, 3130413]      * |           0           0
      [2]     chr1   [3400112, 3400250]      * |           0           0
      [3]     chr1   [3433954, 3434095]      * |           0           0
      [4]     chr1   [3515020, 3515102]      * |           0           0
      [5]     chr1   [3575895, 3576078]      * |           0           0
      ...      ...                  ...    ... .         ...         ...
  [89650]     chrY [90805058, 90805151]      * |           0           0
  [89651]     chrY [90807391, 90807562]      * |           0           0
  [89652]     chrY [90808761, 90808841]      * |           0           1
  [89653]     chrY [90812961, 90813237]      * |           1           1
  [89654]     chrY [90829708, 90829782]      * |           0           0
           Kidney_1  Kidney_2   Liver_1   Liver_2    consensusIDs
          <numeric> <numeric> <numeric> <numeric>        <factor>
      [1]         0         1         0         0     consensus_1
      [2]         0         1         0         0     consensus_2
      [3]         1         1         0         0     consensus_3
      [4]         0         1         0         0     consensus_4
      [5]         1         1         0         0     consensus_5
      ...       ...       ...       ...       ...             ...
  [89650]         0         0         1         0 consensus_89765
  [89651]         1         0         0         1 consensus_89766
  [89652]         0         0         0         0 consensus_89767
  [89653]         1         0         1         1 consensus_89768
  [89654]         0         1         0         0 consensus_89769
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths

Overlap of nucleosome free regions among replicates.

Now we have a set of non-redundant peaks we can review the correspondance between replicates by venn diagrams of peak overlaps.

Here we pass the overlap matrix generated by soGGi to limma’s vennDiagram function.

library(limma)

as.data.frame(elementMetadata(consensusToCount)) %>% dplyr::select(starts_with("Liver")) %>% 
    vennDiagram(main = "Overlap for Liver open regions")

as.data.frame(elementMetadata(consensusToCount)) %>% dplyr::select(starts_with("Kidney")) %>% 
    vennDiagram(main = "Overlap for Kidney open regions")

as.data.frame(elementMetadata(consensusToCount)) %>% dplyr::select(starts_with("HindBrain")) %>% 
    vennDiagram(main = "Overlap for Hindbrain open regions")

PCA of overlaps (occupancy analysis).

We can also Diffbind style PCA analysis (Occupancy analysis in Diffbind) of peak overlaps to get an overall view of correspondance between peak calls.

Here we pass the matrix of peak overlaps from soGGi to prcomp function and plot the results in ggplot2.

library(tidyr)

myPlot <- as.data.frame(elementMetadata(consensusToCount)) %>% dplyr::select(-consensusIDs) %>% 
    as.matrix %>% t %>% prcomp %>% .$x %>% data.frame %>% mutate(Samples = rownames(.)) %>% 
    mutate(Group = gsub("_\\d", "", Samples)) %>% ggplot(aes(x = PC1, y = PC2, 
    colour = Group)) + geom_point(size = 5)

myPlot

Counting for differential ATAC-seq.

The presense or absense of a peak does not fully capture the changes in ATAC-seq signal observed in a genome broswer. Identifying changes of ATAC-seq signal within peaks will allow us to better capture ATAC-seq signal differences.

To do this we will borrow some methods from RNA-seq, namely DESeq2, to evaluate changes in ATAC-seq signal between groups/tissues.

First we will filter our peaks in a manner similar to Diffbind, where we keep only peaks which are present in at least two replicates.

library(Rsubread)
occurrences <- elementMetadata(consensusToCount) %>% as.data.frame %>% dplyr::select(-consensusIDs) %>% 
    rowSums

table(occurrences) %>% rev %>% cumsum

    6     5     4     3     2     1 
 6818  9147 12350 17079 36320 89654

consensusToCount <- consensusToCount[occurrences >= 2, ]

consensusToCount

GRanges object with 36320 ranges and 7 metadata columns:
          seqnames               ranges strand | HindBrain_1 HindBrain_2
             <Rle>            <IRanges>  <Rle> |   <numeric>   <numeric>
      [1]     chr1   [3433954, 3434095]      * |           0           0
      [2]     chr1   [3575895, 3576078]      * |           0           0
      [3]     chr1   [3671534, 3671885]      * |           0           0
      [4]     chr1   [3994858, 3995006]      * |           1           1
      [5]     chr1   [4560267, 4560344]      * |           0           1
      ...      ...                  ...    ... .         ...         ...
  [36316]     chrY [90784207, 90784298]      * |           0           0
  [36317]     chrY [90784944, 90785099]      * |           1           1
  [36318]     chrY [90801504, 90801651]      * |           1           0
  [36319]     chrY [90807391, 90807562]      * |           0           0
  [36320]     chrY [90812961, 90813237]      * |           1           1
           Kidney_1  Kidney_2   Liver_1   Liver_2    consensusIDs
          <numeric> <numeric> <numeric> <numeric>        <factor>
      [1]         1         1         0         0     consensus_3
      [2]         1         1         0         0     consensus_5
      [3]         1         1         1         0     consensus_8
      [4]         0         0         0         0    consensus_11
      [5]         1         0         0         0    consensus_18
      ...       ...       ...       ...       ...             ...
  [36316]         0         1         1         0 consensus_89760
  [36317]         1         1         1         1 consensus_89761
  [36318]         1         1         1         1 consensus_89764
  [36319]         1         0         0         1 consensus_89766
  [36320]         1         0         1         1 consensus_89768
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths

Now we have to set of regions to count in we can use Rsubread to count paired reads landing in peaks.

Note that Rsubread allows for maximum and minimum fragment lengths!

This takes awhile so run this after workshop if interested

Takes ~ 10 minutes on 3.1 GHz Intel Core i7 Mac pro

bamsToCount <- dir("ATAC_Data/ATAC_BAM_forCounting/", full.names = TRUE, pattern = "*.\\.bam$")
# indexBam(bamsToCount)
regionsToCount <- data.frame(GeneID = paste("ID", seqnames(consensusToCount), 
    start(consensusToCount), end(consensusToCount), sep = "_"), Chr = seqnames(consensusToCount), 
    Start = start(consensusToCount), End = end(consensusToCount), Strand = strand(consensusToCount))
fcResults <- featureCounts(bamsToCount, annot.ext = regionsToCount, isPairedEnd = TRUE, 
    countMultiMappingReads = FALSE, maxFragLength = 100)
myCounts <- fcResults$counts
colnames(myCounts) <- c("HindBrain_1", "HindBrain_2", "Kidney_1", "Kidney_2", 
    "Liver_1", "Liver_2")
save(myCounts, file = "ATAC_Data/ATAC_RData/countsFromATAC.RData")

DESeq2 for differential ATAC-seq.

With our counts of fragments in nucleosome free regions we can now contruct a DESeq2 object and perform a PCA again but this time using signal within peaks, not just occurrence in regions.

We pass the GRanges of regions we count to DESeqDataSetFromMatrix function so as to access these from DESeq2 later.

library(DESeq2)
load("ATAC_Data/ATAC_RData/countsFromATAC.RData")
metaData <- data.frame(Group, row.names = colnames(myCounts))
atacDDS <- DESeqDataSetFromMatrix(myCounts, metaData, ~Group, rowRanges = consensusToCount)
atacDDS <- DESeq(atacDDS)
atac_Rlog <- rlog(atacDDS)
plotPCA(atac_Rlog, intgroup = "Group", ntop = nrow(atac_Rlog))

With the new DESeq2 object we can now test for any differences in ATAC-seq signal between groups.

In this example we look at differences between hindbrain and liver samples.

We return a GRanges object here to allow us to perform some more GenomicRanges operations.

library(DESeq2)
library(BSgenome.Mmusculus.UCSC.mm10)
library(tracktables)


LiverMinusHindbrain <- results(atacDDS, c("Group", "Liver", "HindBrain"), format = "GRanges")
LiverMinusHindbrain <- LiverMinusHindbrain[order(LiverMinusHindbrain$pvalue)]
LiverMinusHindbrain

GRanges object with 36320 ranges and 6 metadata columns:
                              seqnames                 ranges strand |
                                 <Rle>              <IRanges>  <Rle> |
      ID_chr9_3020206_3020334     chr9 [  3020206,   3020334]      * |
    ID_chr4_22488379_22488925     chr4 [ 22488379,  22488925]      * |
      ID_chr9_3014116_3014203     chr9 [  3014116,   3014203]      * |
    ID_chr3_82357817_82358333     chr3 [ 82357817,  82358333]      * |
  ID_chrX_143483004_143483128     chrX [143483004, 143483128]      * |
                          ...      ...                    ...    ... .
    ID_chr9_32610720_32610969     chr9   [32610720, 32610969]      * |
   ID_chr14_49245536_49245707    chr14   [49245536, 49245707]      * |
   ID_chr11_76849776_76849927    chr11   [76849776, 76849927]      * |
    ID_chr7_78103567_78103861     chr7   [78103567, 78103861]      * |
   ID_chr15_79546476_79547001    chr15   [79546476, 79547001]      * |
                                      baseMean        log2FoldChange
                                     <numeric>             <numeric>
      ID_chr9_3020206_3020334 808.082684547964     -2.07082993359443
    ID_chr4_22488379_22488925 96.8201223931814     -5.47105948747546
      ID_chr9_3014116_3014203 1150.93297203881     -2.09547966787039
    ID_chr3_82357817_82358333 109.359670001547     -3.37179393569715
  ID_chrX_143483004_143483128 1130.41224005714     -1.84455679230995
                          ...              ...                   ...
    ID_chr9_32610720_32610969 6.66631784355501   0.00134623548612957
   ID_chr14_49245536_49245707 5.30434054149352  -0.00128543817342215
   ID_chr11_76849776_76849927 5.68601304439561 -0.000711667882566276
    ID_chr7_78103567_78103861 6.11939394444521  0.000555215030505441
   ID_chr15_79546476_79547001 41.6805411145682 -6.33148740213062e-05
                                          lfcSE                  stat
                                      <numeric>             <numeric>
      ID_chr9_3020206_3020334 0.114353675887734     -18.1089931523272
    ID_chr4_22488379_22488925 0.368459509247201     -14.8484686924036
      ID_chr9_3014116_3014203 0.142226461967964     -14.7334022015003
    ID_chr3_82357817_82358333 0.262912834732154     -12.8247597312327
  ID_chrX_143483004_143483128 0.147518280630087     -12.5039200865912
                          ...               ...                   ...
    ID_chr9_32610720_32610969  2.05663183774856  0.000654582634295557
   ID_chr14_49245536_49245707  2.08212150475798 -0.000617369433284618
   ID_chr11_76849776_76849927  2.07645599521139 -0.000342731983826041
    ID_chr7_78103567_78103861  2.06416309335544  0.000268978276131709
   ID_chr15_79546476_79547001 0.322978826608929 -0.000196034132286849
                                            pvalue                 padj
                                         <numeric>            <numeric>
      ID_chr9_3020206_3020334 2.70653384394425e-73 9.83013092120553e-69
    ID_chr4_22488379_22488925 7.11759576902117e-50 1.29255539165424e-45
      ID_chr9_3014116_3014203 3.93391832760457e-49 4.76266378861993e-45
    ID_chr3_82357817_82358333 1.19155375973668e-37 1.08193081384091e-33
  ID_chrX_143483004_143483128 7.10589146356843e-36 5.16171955913611e-32
                          ...                  ...                  ...
    ID_chr9_32610720_32610969    0.999477718659624    0.999587805422335
   ID_chr14_49245536_49245707    0.999507410492162    0.999589975743462
   ID_chr11_76849776_76849927    0.999726539446966    0.999781593499471
    ID_chr7_78103567_78103861    0.999785386388871     0.99981291427748
   ID_chr15_79546476_79547001     0.99984358739346     0.99984358739346
  -------
  seqinfo: 22 sequences from an unspecified genome; no seqlengths

We can subset to only open regions within promoters and then create a table to review the results in IGV using makebedtable function in tracktables package.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
toOverLap <- promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, 500, 500)
LiverMinusHindbrain <- LiverMinusHindbrain[(!is.na(LiverMinusHindbrain$padj) & 
    LiverMinusHindbrain$padj < 0.05) & LiverMinusHindbrain %over% toOverLap, 
    ]
# LiverMinusHindbrain <-
# LiverMinusHindbrain[!is.na(LiverMinusHindbrain$padj) &
# LiverMinusHindbrain$padj < 0.05,]
makebedtable(LiverMinusHindbrain, "LiverMinusHindbrain.html", "ATAC_Data/")

[1] "ATAC_Data//LiverMinusHindbrain.html"

Annotation for differential ATAC-seq.

In the final part we can annotate our differential ATAC-seq regions to genes and then use gene information to test enrichment for GO sets.

Since we have subset regions to those within +/- 500bp of a TSS we can use a standard enrichment analysis at this point. Here we use clusterProfiler to identify enrichment

library(clusterProfiler)
library(ChIPseeker)

anno_LiverMinusHindbrain <- annotatePeak(LiverMinusHindbrain, TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene)

>> preparing features information...         2018-02-16 16:33:41 
>> identifying nearest features...       2018-02-16 16:33:42 
>> calculating distance from peak to TSS...  2018-02-16 16:33:42 
>> assigning genomic annotation...       2018-02-16 16:33:42 
>> assigning chromosome lengths          2018-02-16 16:33:52 
>> done...                   2018-02-16 16:33:52

go1 <- enrichGO(as.data.frame(as.GRanges(anno_LiverMinusHindbrain)[as.GRanges(anno_LiverMinusHindbrain)$log2FoldChange > 
    0])$geneId, OrgDb = "org.Mm.eg.db", ont = "BP", maxGSSize = 5000)
go2 <- enrichGO(as.data.frame(as.GRanges(anno_LiverMinusHindbrain)[as.GRanges(anno_LiverMinusHindbrain)$log2FoldChange < 
    0])$geneId, OrgDb = "org.Mm.eg.db", ont = "BP", maxGSSize = 5000)

head(go1, 10) %>% dplyr::select(ID, Description, pvalue, p.adjust) %>% datatable(elementId = "goEle1")

head(go2, 10) %>% dplyr::select(ID, Description, pvalue, p.adjust) %>% datatable(elementId = "goEle2")

Export annotated, differential ATAC-seq.

Finally, we can write out the table of annotated differential ATAC-seq peaks to review in Excel or as bed files.

anno_LiverMinusHindbrain_GRanges <- as.GRanges(anno_LiverMinusHindbrain)
anno_LiverMinusHindbrain_GRanges_Up <- anno_LiverMinusHindbrain[elementMetadata(anno_LiverMinusHindbrain)$log2FoldChange > 
    0]
anno_LiverMinusHindbrain_GRanges_Down <- anno_LiverMinusHindbrain[elementMetadata(anno_LiverMinusHindbrain)$log2FoldChange < 
    0]
export.bed(anno_LiverMinusHindbrain_GRanges_Up, "LiverMinusHindbrain_Up.bed")
export.bed(anno_LiverMinusHindbrain_GRanges_Down, "LiverMinusHindbrain_Down.bed")

anno_LiverMinusHindbrain_df <- as.data.frame(anno_LiverMinusHindbrain)
write.table(anno_LiverMinusHindbrain_df, "LiverMinusHindbrain.csv", quote = FALSE, 
    row.names = FALSE, sep = ",")

Cutting sites from ATAC-seq data

ATAC-seq should generate shorter fragments (our nucleosome free regions) around smaller protected areas such as transcription factor binding sites.

We can therefore look for the pile-up of cut-sites around motifs of interest within different tissues/celltypes/samples.

To produce cut-sites from our BAM file we first resize our reads to 1bp and make the shift of 4/-5 bp depending on strand to adjust for expected shift from insertion of Tn5 transposase.

Here we will identify CTCF motifs passing an arbitary cut-off and then use soGGi to plot cut-sites around them

library(MotifDb)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg19)

CTCF <- query(MotifDb, c("CTCF"))
CTCF <- as.list(CTCF)
myRes <- matchPWM(CTCF[[1]], BSgenome.Hsapiens.UCSC.hg19[["chr20"]])
toCompare <- GRanges("chr20", ranges(myRes))

read1 <- first(atacReads_Open)
read2 <- second(atacReads_Open)
Firsts <- resize(granges(read1), fix = "start", 1)
First_Pos_toCut <- shift(granges(Firsts[strand(read1) == "+"]), 4)
First_Neg_toCut <- shift(granges(Firsts[strand(read1) == "-"]), -5)


Seconds <- resize(granges(read2), fix = "start", 1)
Second_Pos_toCut <- shift(granges(Seconds[strand(read2) == "+"]), 4)
Second_Neg_toCut <- shift(granges(Seconds[strand(read2) == "-"]), -5)

test_toCut <- c(First_Pos_toCut, First_Neg_toCut, Second_Pos_toCut, Second_Neg_toCut)
cutsCoverage <- coverage(test_toCut)
cutsCoverage20 <- cutsCoverage["chr20"]
CTCF_Cuts_open <- regionPlot(cutsCoverage20, testRanges = toCompare, style = "point", 
    format = "rlelist", distanceAround = 500)
plotRegion(CTCF_Cuts_open, outliers = 0.001) + ggtitle("NucFree Cuts Centred on CTCF")

Analysis of ATAC-seq data in R and Bioconductor

Thomas Carroll

2018-02-16

ATAC-seq, MNase-seq and DNase-seq

Working with ATAC-seq data in R/Bioconductor

Introduction

Covering today.

The Sequencing Data.

The Reference Data

The Tools.

Required R packages and versions.

IGV genome browser

Processed Data in workshop

Greenleaf dataset fastq to BAM processing.

Creating a reference genome

Aligning Sequence Reads to the Genome.

Sorting and Indexing

Number of mapped reads

Distribution of mapped reads

Greenleaf dataset post-alignment processing.

Reading in mapped reads

Retrieving insert sizes

Plotting insert sizes

Plotting insert sizes with Greenleaf open, mono- and di-nucleosome profiles

Plotting ATAC-seq signal of TSSs (Retrieving TSSs regions)

Plotting ATAC-seq signal of TSSs (Creating open, mono- and di-nucleosome signal profiles)

Plotting ATAC-seq signal of TSSs (Plotting open, mono- and di-nucleosome signal profiles)

Subsetting ATAC-seq reads files by insert sizes.

Creating BAM files split by insert sizes.

Creating an open region bigWig.

Greenleaf dataset - Finding Open Regions.

Peak calling for nucleosome free regions.

Peak calling using MACS2

Single end peak calling.

Paired end peak calling.

QC for low quality, duplicates and signal distribution.

Blacklisted Regions

Remove blacklisted peaks

Greenleaf dataset - Annotating Open Regions.

Annotating peaks to genes.

Displaying annotation distribution

Visualising overlaps between genomic regions.

Retrieving annotated Nucleosome-free regions.

Functional Analysis of Nucleosome-free regions - 1.

Functional Analysis of Nucleosome-free regions - 2.

Differential ATAC-seq

Identifying a set of non-redundant peaks.

Overlap of nucleosome free regions among replicates.

PCA of overlaps (occupancy analysis).

Counting for differential ATAC-seq.

DESeq2 for differential ATAC-seq.

Annotation for differential ATAC-seq.

Export annotated, differential ATAC-seq.

Cutting sites from ATAC-seq data