ATAC-seq, MNase-seq and DNase-seq
DNase-seq - Enzymatic digestion to extract signal from open chromatin around transcription factor binding sites.
MNase-seq - Enzymatic digestion to extract signal repesenting nucleosome positioning.
ATAC-seq - Uses transposases and offers a method to simultaneously extract signal from transcription factors binding sites and nucleosome positions from a single sample.
Working with ATAC-seq data in R/Bioconductor
Introduction
In this workshop we will look at some of the basics of ATAC-seq analysis in R using publically available data.
We will look at the alignment, post-alignment processing, plotting of ATAC-seq data over TSSs and motif occurrences, peak calling for nucleosome free regions and differential ATAC-seq signal.
Covering today.
Aligning ATAC-seq data.
Plotting fragment lengths from ATAC-seq
Creation of nucleosome free signal tracks.
Peak calling for nucleosome free regions.
Annotation and functional enrichment analysis of ATAC-seq.
Differential nucleosome free regions from ATAC-seq data.
The Sequencing Data.
In this workshop we will make use of three sets of published data.
The first dataset is from original ATAC-seq paper.
Transposition of native chromatin for multimodal regulatory analysis and personal epigenomics Jason D. Buenrostro, Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, and William J. Greenleaf
In particular, we will make use of the ATACseq_50k_Rep2 sample GEO - GSM1155958 Data can be retrieved in fastq format from ENA
- SAMN02192806 - here
For the second dataset we take ATAC-seq generated by Bing Ren at UCSD as part of the Encode consortium. Links to data and sample information are included in list below.
Liver day 12 - ENCSR302LIV
Kidney day 15 - ENCSR023QZX
Hindbrain day 12 - ENCSR088UYE
Finally I have processed some of the data from Christina Leslie’ lab at MSKCC exactly as described in this workshop so we can review some of the characteristics of ATAC-seq data alongside the same data processed by Encode’s pipeline.
The raw data and processed BAM file is available from Encodes portal
- T-Reg - ENCSR724UJS
The Reference Data
For ATAC-seq analysis we will require a few pieces of reference data.
This includes-
Reference genome in fasta format - IGenomes provides reference genomes for major organims. If you want the exact reference i use here you can download from here
Gene models - We will retrieve these from TxDb Bioconductor annotation packages.
Blacklists - Artefact regions specific to genomes. These can be found in Encode portal here
The Tools.
In this workshop we will concentrate on using R and Bioconductor to process your data.
Bioconductor has many packages to handle sequencing data including BAM files, BED files, bigWigs and Fasta files. Some of these tools may require large memory or care to process data in chunks. We process full size ATAC-seq data routinely in R but for some examples in this practical we will use just one chromosome to save time.
Required R packages and versions.
We will need the latest version of R v(>=3.4) and Bioconductor (v3.6) with the following packages.
# install.packages(knitr)
# install.packages(rmdformats)
# install.packages(dplyr)
# install.packages(DT)
# install.packages(tidyr)
# install.packages(ggplot2)
# install.packages(magrittr)
# install.packages(devtools)
# source('https://bioconductor.org/biocLite.R')
# ## Needed for mac and Linux only biocLite(Rsubread) ##
# biocLite(Rsamtools)
# biocLite(GenomicAlignments)
# biocLite(TxDb.Hsapiens.UCSC.hg19.knownGene)
# biocLite(soGGi)
# biocLite(rtracklayer)
# biocLite(ChIPQC)
# biocLite(ChIPseeker)
# biocLite(rGREAT)
# biocLite(limma)
# biocLite(DESeq2)
# biocLite(TxDb.Mmusculus.UCSC.mm10.knownGene)
# biocLite(tracktables)
# biocLite(clusterProfiler)
# biocLite(org.Mm.eg.db)
# biocLite(MotifDb)
# biocLite(Biostrings)
# biocLite(BSgenome.Hsapiens.UCSC.hg19)
# # Finally we need development version of soGGi (named here 1.10.4) # not
# version on Bioconductor (1.10.0)
# devtools::install_github('ThomasCarroll/soGGi')
IGV genome browser
We will also make use of IGV genome browser to review some results.
IGV is available from Broad here.
Processed Data in workshop
We start with public sequencing data in links below and use reference data in Bioconductor and/or IGenomes.
We do not run to all the steps together today so i include processed data from various points in analysis at links below
Essentials
BAM file and BAI index from our alignment/sorting/indexing.
SAMN02192806 - Greenleaf BAM - Full BAM file for Greenleaf example produced following in our Rsubread alignment, sorting and indexing.
SAMN02192806 - Greenleaf BAI index - BAI index file for BAM in Greenleaf example produced following in our alignment, sorting and indexing.
Small BAM, peak calls and directory structure.
- ATAC_Workshop_Essential.zip - Require additional workshop files and directory structure.
Once you have downloaded the above and unzipped ATAC_Workshop.zip, you should move the Sorted_ATAC_50K_2.bam and Sorted_ATAC_50K_2.bam.bai file into ATAC_Workshop/ATAC_Data/ATAC_BAM/
You should also copy the RU_ATAC_Workshop.Rmd to ATAC_Workshop/ directory and open then to make sure all relative paths are correct.
Not essential
Same as above but with BAMs for counting as well as small BAM, peak calls and directory structure.
- ATAC_Workshop.zip - Additional workshop files and directory structure.
Bigwigs for IGV.
- Bigwigs - BigWigs to review in IGV.
Greenleaf dataset fastq to BAM processing.
In this section we will work alittle with the Greenleaf dataset. We will process one sample of the Greenleaf data from fastq to BAM to allow us to review some of the features of ATAC-seq data and to create some processed files for review and further analysis.
Creating a reference genome
The IGenomes repositories allow us to retrieve a consistent set of reference annotation for our genome of choice.
Here i have renamed the Fasta to something more meaningful in isolation – hg19_Genome.fa.
We will use the Rsubread package to align our ATAC-seq data so first we must build an appropriate index from our Fasta for the Rsubread aligner.
This takes awhile so run this after workshop if interested
Takes ~ 30 minutes on 3.1 GHz Intel Core i7 Mac pro
library(Rsubread)
genome <- "ATAC_Data/ATAC_Reference/hg19_Genome.fa"
indexForSubread <- gsub("\\.fa$", "", genome)
buildindex(indexForSubread, genome, indexSplit = FALSE)
Aligning Sequence Reads to the Genome.
Now we have a reference genome, we can align our paired-end ATAC-seq reads using Rsubread.
Here we can use a standard alignment for DNA but we increase the maximum allowed fragment length to capture long fragments representing poly-nucleosome signal.
The maximum allowed fragment length set here is based on parameters used within Greenleaf study.
This takes awhile so run this after workshop if interested
Takes ~ 50 minutes on 3.1 GHz Intel Core i7 Mac pro
read1 <- "ATAC_Data/ATAC_FQs/SRR891269_1.fastq.gz"
read2 <- "ATAC_Data/ATAC_FQs/SRR891269_2.fastq.gz"
outBAM <- "ATAC_50K_2.bam"
align(indexForSubread, readfile1 = read1, readfile2 = read2, output_file = outBAM,
nthreads = 2, type = 1, unique = TRUE, maxFragLength = 2000)
Sorting and Indexing
Following alignment we would want to sort and index our BAM file for use with external tool.
First we sort our aligned data by sequence order (not Read Name here).
We then index our file allowing for rapid access of particular genomic locations by other programs (e.g IGV, Samtools) and by R/Bioconductor packaes we will use.
This takes awhile so run this after workshop if interested
Takes ~ 15 minutes on 3.1 GHz Intel Core i7 Mac pro
library(Rsamtools)
sortedBAM <- file.path(dirname(outBAM), paste0("Sorted_", basename(outBAM)))
sortBam(outBAM, gsub("\\.bam", "", basename(sortedBAM)))
indexBam(sortedBAM)
So around ~ 1hr 30 minutes in total to get to here.
sortedBAM <- "ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2.bam"
Number of mapped reads
We can now check the mapping rate to make sure we do not have any problems with the aligmnent. Here we use Rsubread’s propmapped to give us number of mapped, unmapped and mapping percentage
This takes awhile so run this after workshop if interested
Takes ~ 15 minutes on 3.1 GHz Intel Core i7 Mac pro
library(Rsubread)
pmapped <- propmapped(sortedBAM)
pmapped
Sample NumTotal NumMapped PropMapped
1 ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2.bam 56598621 55521254 0.980965
Distribution of mapped reads
We can see we have a great mapping rate but we would also want to check the distribution of mapped reads across chromosomes.
ATAC-seq is known have high signal on the mitochondrial chromosomes and so we can check for that here.
In this example, we see an extreme case where the mapping rate to mitochondrial genome is extremely high.
library(Rsamtools)
library(ggplot2)
library(magrittr)
idxstatsBam(sortedBAM) %>% ggplot(aes(seqnames, mapped, fill = seqnames)) +
geom_bar(stat = "identity") + coord_flip()
Lets see if it is always that bad.
So we take some data from Bing Ren’s lab (processed from fastq) and from our neighbour’s at MSKCC and see how it looks there.
We will also take some ChIP-seq to identify a potential similar trend there.
Greenleaf dataset post-alignment processing.
Now we have the processed the ATAC-seq paired-end data we can start to work with alignments.
We will identify the expected fragment length distribution and split our data into that representing nucleosome free and mono-nucleosomes.
Reading in mapped reads
We will read our newly aligned data using the GenomicAlignments package.
Here we only read properly paired, uniquely mapped reads mapping to Chromosome 20. We also retrieve the read name, mapq scores and importantly the insert sizes.
library(GenomicAlignments)
atacReads <- readGAlignmentPairs(sortedBAM, param = ScanBamParam(mapqFilter = 1,
flag = scanBamFlag(isPaired = TRUE, isProperPair = TRUE), what = c("qname",
"mapq", "isize"), which = GRanges("chr20", IRanges(1, 63025520))))
# length(atacReads)
atacReads
GAlignmentPairs object with 279210 pairs, strandMode=1, and 0 metadata columns:
seqnames strand : ranges --
<Rle> <Rle> : <IRanges> --
[1] chr20 + : [60000, 60029] --
[2] chr20 + : [60000, 60029] --
[3] chr20 + : [60392, 60441] --
[4] chr20 + : [60436, 60485] --
[5] chr20 - : [60665, 60714] --
... ... ... ... ... ...
[279206] chr20 - : [62962073, 62962122] --
[279207] chr20 + : [62962999, 62963048] --
[279208] chr20 - : [62964817, 62964866] --
[279209] chr20 - : [62965224, 62965273] --
[279210] chr20 + : [62965378, 62965427] --
ranges
<IRanges>
[1] [60361, 60410]
[2] [60361, 60410]
[3] [60570, 60619]
[4] [60446, 60495]
[5] [60487, 60536]
... ...
[279206] [62961997, 62962046]
[279207] [62963057, 62963106]
[279208] [62964689, 62964738]
[279209] [62965221, 62965270]
[279210] [62965437, 62965486]
-------
seqinfo: 25 sequences from an unspecified genome
Retrieving insert sizes
Now we have read in the paired aligned data into R, we can retreive the insert sizes from the elementMetadata attached to reads.
Since properly paired reads will have same insert size length we extract insert sizes from read1.
atacReads_read1 <- GenomicAlignments::first(atacReads)
insertSizes <- abs(elementMetadata(atacReads_read1)$isize)
head(insertSizes)
[1] 431 431 228 60 228 176
Plotting insert sizes
ATAC-seq should represent a mix of fragment lengths corresponding to nucleosome free, mononucleosome and poly-nucleosome fractions.
We can use the newly acquired insert lengths for chromosome 20 to plot the distribution of all fragment lengths.
library(magrittr)
library(dplyr)
library(ggplot2)
fragLenPlot <- table(insertSizes) %>% data.frame %>% rename(InsertSize = insertSizes,
Count = Freq) %>% mutate(InsertSize = as.numeric(as.vector(InsertSize)),
Count = as.numeric(as.vector(Count))) %>% ggplot(aes(x = InsertSize, y = Count)) +
geom_line()
fragLenPlot + theme_bw()
fragLenPlot + scale_y_continuous(trans = "log2") + theme_bw()
Plotting insert sizes with Greenleaf open, mono- and di-nucleosome profiles
This looks very similar to the image from the Greenleaf paper.
We can now annotate our nucleosome free (< 100bp), mono-nucleosome (180bp-247bp) and di-nucleosome (315-437) as in the Greenleaf study.
fragLenPlot + geom_vline(xintercept = c(180, 247), colour = "red") + geom_vline(xintercept = c(315,
437), colour = "darkblue") + geom_vline(xintercept = c(100), colour = "darkgreen") +
theme_bw()
fragLenPlot + scale_y_continuous(trans = "log2") + geom_vline(xintercept = c(180,
247), colour = "red") + geom_vline(xintercept = c(315, 437), colour = "darkblue") +
geom_vline(xintercept = c(100), colour = "darkgreen") + theme_bw()
Again we can make a quick comparison with some other datasets.
Plotting ATAC-seq signal of TSSs (Retrieving TSSs regions)
From the fragment length plots shown above it would seem we have evidence of data coming both from nucleosome occupied and nucleosome free regions.
We expect nucleosome free regions to be present at TSSs of active genes and nucleosome signal to be be more enriched in regions surrounding the TSS
To investigate this we will produce a meta-TSS plot of average signal of the different fragment length signals across TSSs.
First then we need a set of TSS positions which we can retrieve from the gene models in the TxDb.Hsapiens.UCSC.hg19.knownGene package.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TSSs <- resize(genes(TxDb.Hsapiens.UCSC.hg19.knownGene), fix = "start", 1)
TSSs
GRanges object with 23056 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <character>
1 chr19 [ 58874214, 58874214] - | 1
10 chr8 [ 18248755, 18248755] + | 10
100 chr20 [ 43280376, 43280376] - | 100
1000 chr18 [ 25757445, 25757445] - | 1000
10000 chr1 [244006886, 244006886] - | 10000
... ... ... ... . ...
9991 chr9 [115095944, 115095944] - | 9991
9992 chr21 [ 35736323, 35736323] + | 9992
9993 chr22 [ 19109967, 19109967] - | 9993
9994 chr6 [ 90539619, 90539619] + | 9994
9997 chr22 [ 50964905, 50964905] - | 9997
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
Plotting ATAC-seq signal of TSSs (Creating open, mono- and di-nucleosome signal profiles)
Now we have all the TSS positions across the genome we can use the soGGi package to make plots of average signal across TSSs for different fragment length ranges.
This takes awhile so run this after workshop if interested
Takes ~ 20 minutes on 3.1 GHz Intel Core i7 Mac pro
library(soGGi)
# Nucleosome free
nucFree <- regionPlot(bamFile = sortedBAM, testRanges = TSSs, style = "point",
format = "bam", paired = TRUE, minFragmentLength = 0, maxFragmentLength = 100,
forceFragment = 50)
# Mononucleosome
monoNuc <- regionPlot(bamFile = sortedBAM, testRanges = TSSs, style = "point",
format = "bam", paired = TRUE, minFragmentLength = 180, maxFragmentLength = 240,
forceFragment = 80)
# Dinucleosome
diNuc <- regionPlot(bamFile = sortedBAM, testRanges = TSSs, style = "point",
format = "bam", paired = TRUE, minFragmentLength = 315, maxFragmentLength = 437,
forceFragment = 160)
# nucFree_gL <- nucFree monoNuc_gL <- monoNuc diNuc_gL <- diNuc
# save(monoNuc_gL,nucFree_gL,diNuc_gL,file='ATAC_Data/ATAC_RData/gL_soGGiResults.RData')
Plotting ATAC-seq signal of TSSs (Plotting open, mono- and di-nucleosome signal profiles)
From the meta-TSS plots of different fragment lengths we can see the expected distribution of signal.
Nucleosome free fragments having a peak around the the TSS and nucleosome occupied signal peaking around the TSS with a strong peak at the +1 nucleosome.
library(soGGi)
load(file = "ATAC_Data/ATAC_RData/gL_soGGiResults.RData")
plotRegion(nucFree_gL, outliers = 0.01)
plotRegion(monoNuc_gL, outliers = 0.01)
plotRegion(diNuc_gL, outliers = 0.01)
Again we can look at some of other datasets for mono-nucleosomes to get and idea what we may see.
Subsetting ATAC-seq reads files by insert sizes.
We may wish to divide our aligned reads into reads representing nucleosome free and nucleosome occupied.
Here we create BAM files for the reads representing nucleosome free, mono and di nucleosome by using insert sizes to filtering read.
atacReads_Open <- atacReads[insertSizes < 100, ]
atacReads_MonoNuc <- atacReads[insertSizes > 180 & insertSizes < 240, ]
atacReads_diNuc <- atacReads[insertSizes > 315 & insertSizes < 437, ]
Creating BAM files split by insert sizes.
The resulting reads can be written back to BAM files for use in other parts of our analysis or for visualisation in programs such as IGV by functions in the rtracklayer package.
openRegionBam <- gsub("\\.bam", "_openRegions\\.bam", sortedBAM)
monoNucBam <- gsub("\\.bam", "_monoNuc\\.bam", sortedBAM)
diNucBam <- gsub("\\.bam", "_diNuc\\.bam", sortedBAM)
library(rtracklayer)
export(atacReads_Open, openRegionBam, format = "bam")
export(atacReads_MonoNuc, monoNucBam, format = "bam")
# export(atacReads_Open,diNucBam,format = 'bam')
Creating an open region bigWig.
We can make it significantly quicker to review the pile-up of ATAC-seq signal in a genome browser by creating a bigWig file.
Additional normalisation to total mapped reads and duplicate removal could be applied at this point.
openRegionBigWig <- gsub("\\.bam", "_openRegions\\.bw", sortedBAM)
openRegionRPMBigWig <- gsub("\\.bam", "_openRegionsRPM\\.bw", sortedBAM)
atacFragments_Open <- granges(atacReads_Open)
export.bw(coverage(atacFragments_Open), openRegionBigWig)
Greenleaf dataset - Finding Open Regions.
A common goal in ATAC-seq is to identify open chromatin regions/nucleosome free regions.
For the remainder of the workshop we will look at analysing the nucleosome free portions of the our ATAC-seq data.
Peak calling for nucleosome free regions.
As we discussed, the reads with insert sizes less than 100bp represent fragments coming from open chromatin and around transcription factors bound to the DNA.
There are many methods available to call nucleosome free regions from ATAC-seq data with many borrowed from ChIP-seq analysis.
Peak calling using MACS2
One very popular and standard peak caller for ATAC-seq is MAC2.
MACS2 is well established for identifying punctate peaks found in ChIP-seq data from transcription factors.
MACS2 website can be found here with information on parameters and use cases.
Single end peak calling.
With single end sequencing from ATAC-seq we do not know how long the fragments are.
To identify open regions therefore requires some different parameters for MACS2 peak calling.
One strategy employed is to shift read 5’ ends by -100 and then extend from this by 200bp. Considering the expected size of our nucleosome free fragments this should provide a pile-up over nucelosome regions suitable for MACS2 window size.
MACS2 callpeak -t singleEnd.bam --nomodel --shift -100 --extsize 200 --format BAM -g MyGenome
Alternatively for the nucleosome occupied data we can adjust shift and extension to centre the signal on nucleosome centres (nucleosomes wrapped in 147bp of DNA).
MACS2 callpeak -t singleEnd.bam --nomodel --shift 37 --extsize 73 --format BAM -g MyGenome
Paired end peak calling.
If we have sequenced paired-end data then we do know the fragment lengths and can provide pre-filter BAM files to MACS2.
We have to simply tell MACS2 that the data is paired using the format argument.
By default MACS2 will guess it is single end BAM.
MACS2 callpeak -t pairedEnd.bam -f BAMPE --outdir path/to/output/ --name pairedEndPeakName -g MyGenome
Following peak calling we would get 3 files.
Name.narrowPeak – Narrow peak format suitable for IGV and further analysis
Name_peaks.xls – Peak table suitable for review in excel.(not actually a xls but a tsv)
summits.bed – Summit positions for peaks useful for finding motifs and plotting
QC for low quality, duplicates and signal distribution.
Before we remove any data we can get a quick assessment of our reads in peaks, duplication rate, low quality reads and reads in artefact regions from ChIPQC.
library(ChIPQC)
library(rtracklayer)
library(DT)
library(dplyr)
library(tidyr)
blkList <- import.bed("ATAC_Data/ATAC_blacklists/ENCFF001TDO.bed.gz")
openRegionPeaks <- "ATAC_Data/ATAC_Peaks/Sorted_ATAC_50K_2_Small_Paired_peaks.narrowPeak"
qcRes <- ChIPQCsample("ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2_openRegions.bam",
peaks = openRegionPeaks, annotation = "hg19", chromosomes = "chr20", blacklist = blkList,
verboseT = FALSE)
done
Calculating coverage
Calculating Summits on chr20 ..[1] 1
QCmetrics(qcRes) %>% t %>% data.frame %>% dplyr:::select(Reads, starts_with(c("Filt")),
starts_with(c("RiP")), starts_with(c("RiBL"))) %>% datatable(rownames = NULL)
flagtagcounts(qcRes) %>% t %>% data.frame %>% mutate(Dup_Percent = (DuplicateByChIPQC/Mapped) *
100) %>% dplyr:::select(Mapped, Dup_Percent) %>% datatable(rownames = NULL)
Blacklisted Regions
A potential confounder of analysis and a source of noise in ATAC-seq (and ChIPseq/DNAseq etc) are high signal arteact regions within the genome.
Such regions have been previously defined for many model organisms and are available from the encode portal here.
Remove blacklisted peaks
Since blacklisted regions may confound our analysis we remove any peaks which have been called there.
Removing blacklists too early can hide some of the qc issues in your data. The blacklist should always be considered in your analysis and recommended to removed data from these regions once QC is considered.
MacsCalls_chr20 <- granges(qcRes[seqnames(qcRes) %in% "chr20"])
data.frame(Blacklisted = sum(MacsCalls_chr20 %over% blkList), Not_Blacklisted = sum(!MacsCalls_chr20 %over%
blkList))
Blacklisted Not_Blacklisted
1 2 596
MacsCalls_chr20_filtered <- MacsCalls_chr20[!MacsCalls_chr20 %over% blkList]
Greenleaf dataset - Annotating Open Regions.
It is often of interest to associate identified open/nucleosome free regions to genomic features such as genes and enhancers.
Once annotated to genes or enhancers’ genes, we can start to associate ATAC-seq data to characteristics of these genes. (functional annotation, expression changes, other epigenetic states)
Annotating peaks to genes.
A simple approach to annotating nucleosome free region to genes is to associate regions to their closest gene or within a window around a genes transcriptional start site.
We can use the chipseeker library to identify genes closest to our regions and to give us simple summaries and visualisations of this annotation.
We use the gene models from TxDb.Hsapiens.UCSC.hg19.knownGene and supply this to ChIPseeker packages annotatePeak function.
ChIPseeker’s csAnno object will then show breakdown of percentages of peaks in genomic regions.
library(ChIPseeker)
MacsCalls_chr20_filteredAnno <- annotatePeak(MacsCalls_chr20_filtered, TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene)
>> preparing features information... 2018-02-16 16:32:33
>> identifying nearest features... 2018-02-16 16:32:33
>> calculating distance from peak to TSS... 2018-02-16 16:32:33
>> assigning genomic annotation... 2018-02-16 16:32:33
>> assigning chromosome lengths 2018-02-16 16:32:45
>> done... 2018-02-16 16:32:45
MacsCalls_chr20_filteredAnno
Annotated peaks generated by ChIPseeker
596/596 peaks were annotated
Genomic Annotation Summary:
Feature Frequency
9 Promoter (<=1kb) 40.2684564
10 Promoter (1-2kb) 2.0134228
11 Promoter (2-3kb) 2.1812081
4 5' UTR 0.1677852
3 3' UTR 1.0067114
1 1st Exon 0.1677852
7 Other Exon 1.5100671
2 1st Intron 6.0402685
8 Other Intron 19.9664430
6 Downstream (<=3kb) 0.8389262
5 Distal Intergenic 25.8389262
Displaying annotation distribution
As well as showing us a table of the annotation distribution we can visualise this using the plotAnnoPie and plotAnnoBar functions.
plotAnnoPie(MacsCalls_chr20_filteredAnno)
plotAnnoBar(MacsCalls_chr20_filteredAnno)
Visualising overlaps between genomic regions.
It is important to note that ChIPseeker is making some critical choices for you here. An important decision is the order of preference in annotating our peaks when they fall in multiple gene regions .. ie. (5’UTR and promoter).
The overlap between annotation can be seen by way of an upset plot.
upsetplot(MacsCalls_chr20_filteredAnno)
Retrieving annotated Nucleosome-free regions.
With this information we can then subset our peaks/nuc free regions to those only landing in TSS regions (+/- 500)
MacsGranges_Anno <- as.GRanges(MacsCalls_chr20_filteredAnno)
TSS_MacsGranges_Anno <- MacsGranges_Anno[abs(MacsGranges_Anno$distanceToTSS) <
500]
TSS_MacsGranges_Anno
GRanges object with 216 ranges and 9 metadata columns:
seqnames ranges strand | annotation geneChr
<Rle> <IRanges> <Rle> | <character> <integer>
[1] chr20 [278001, 278160] * | Promoter (<=1kb) 20
[2] chr20 [305954, 306099] * | Promoter (<=1kb) 20
[3] chr20 [327710, 327787] * | Promoter (<=1kb) 20
[4] chr20 [443103, 443346] * | Promoter (<=1kb) 20
[5] chr20 [524349, 524527] * | Promoter (<=1kb) 20
... ... ... ... . ... ...
[212] chr20 [62289923, 62290008] * | Promoter (<=1kb) 20
[213] chr20 [62370927, 62371002] * | Promoter (<=1kb) 20
[214] chr20 [62496497, 62496604] * | Promoter (<=1kb) 20
[215] chr20 [62611409, 62611498] * | Promoter (<=1kb) 20
[216] chr20 [62694306, 62694450] * | Promoter (<=1kb) 20
geneStart geneEnd geneLength geneStrand geneId transcriptId
<integer> <integer> <integer> <integer> <character> <character>
[1] 278204 280963 2760 1 85364 uc002wdf.3
[2] 306215 310872 4658 1 6666 uc002wdh.4
[3] 327370 335512 8143 1 80023 uc002wdi.4
[4] 416124 443187 27064 2 128637 uc002wds.3
[5] 463338 524482 61145 2 1457 uc002wdw.1
... ... ... ... ... ... ...
[212] 62289667 62305446 15780 1 51750 uc002yfv.2
[213] 62371211 62375403 4193 1 56731 uc002ygq.3
[214] 62496581 62522898 26318 1 7165 uc002ygy.3
[215] 62605466 62611362 5897 2 140700 uc002yhn.2
[216] 62694010 62703700 9691 1 6919 uc021wgq.1
distanceToTSS
<numeric>
[1] -44
[2] -116
[3] 340
[4] 0
[5] 0
... ...
[212] 256
[213] -209
[214] 0
[215] -47
[216] 296
-------
seqinfo: 1 sequence from hg19 genome
Functional Analysis of Nucleosome-free regions - 1.
Another common step to ATAC-seq analysis is to identify any functional enrichment in genes associated to nucleosome free regions.
One approach is to take the genes we identified from ChIPseeker as having nuclesome free regions and test these for functional enrichment using standard tools.
Another approach which is well suited to ATAC-seq is that implemented in GREAT.
GREAT first defines potential regulatory regions for all genes and then then tests for enrichment accounting for total size of regulatory regions for a pathway/GO term/gene set.
GREAT is queried typically through its web portal here but we can take advantage of the r interface in rGREAT.
rGREAT by default will limit our queries and so we may need to be patient. This can be adjusted in rGREAT options.
We can submit our peak calls to GREAT using the submitGreatJob function and review available categories of results using availableCategories.
library(rGREAT)
seqlevelsStyle(MacsCalls_chr20_filtered) <- "UCSC"
great_Job <- submitGreatJob(MacsCalls_chr20_filtered, species = "hg19")
availableCategories(great_Job)
[1] "GO" "Phenotype Data and Human Disease"
[3] "Pathway Data" "Gene Expression"
[5] "Regulatory Motifs" "Gene Families"
Functional Analysis of Nucleosome-free regions - 2.
For this example we select the results tables for the GO category using getEnrichmentTables functions and then review the results for Biological processes.
great_ResultTable = getEnrichmentTables(great_Job, category = "GO")
names(great_ResultTable)
[1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"
great_ResultTable[["GO Biological Process"]][1:4, ]
ID name
1 GO:0035723 interleukin-15-mediated signaling pathway
2 GO:2000438 negative regulation of monocyte extravasation
3 GO:2000560 positive regulation of CD24 biosynthetic process
4 GO:2000437 regulation of monocyte extravasation
Binom_Genome_Fraction Binom_Expected Binom_Observed_Region_Hits
1 0.0003871583 0.2307463 15
2 0.0003871583 0.2307463 15
3 0.0003871583 0.2307463 15
4 0.0004297120 0.2561083 15
Binom_Fold_Enrichment Binom_Region_Set_Coverage Binom_Raw_PValue
1 65.00645 0.02516779 1.451510e-22
2 65.00645 0.02516779 1.451510e-22
3 65.00645 0.02516779 1.451510e-22
4 58.56896 0.02516779 6.777858e-22
Hyper_Total_Genes Hyper_Expected Hyper_Observed_Gene_Hits
1 1 0.01973283 1
2 1 0.01973283 1
3 1 0.01973283 1
4 2 0.03946566 1
Hyper_Fold_Enrichment Hyper_Gene_Set_Coverage Hyper_Term_Gene_Coverage
1 50.67697 0.002808989 1.0
2 50.67697 0.002808989 1.0
3 50.67697 0.002808989 1.0
4 25.33848 0.002808989 0.5
Hyper_Raw_PValue
1 0.01973283
2 0.01973283
3 0.01973283
4 0.03907735
save(great_ResultTable, file = "ATAC_Data/ATAC_RData/Great_Results.RData")
Differential ATAC-seq
We have briefly reviewed the processing and initial analysis of one ATAC-seq sample using R.
In the next part we will look at how we can identify changes in open regions using R/Bioconductor.
Here we will take an approach akin that in Diffbind and reasonably esatablished in ATAC-seq analysis.
First, We will define a set of non-redundant peaks present in at least 2 samples and use these to assess changes in nuc-free ATAC-seq signal using DESeq2.
Identifying a set of non-redundant peaks.
Here we will use soGGi to produce merge our open regions from all samples into a set of non-redundant (no overlapping regions) open regions present in any sample.
peaks <- dir("ATAC_Data/ATAC_Peaks_forCounting/", pattern = "*.narrowPeak",
full.names = TRUE)
myPeaks <- lapply(peaks, ChIPQC:::GetGRanges, simple = TRUE)
names(myPeaks) <- c("HindBrain_1", "HindBrain_2", "Kidney_1", "Kidney_2", "Liver_1",
"Liver_2")
Group <- factor(c("HindBrain", "HindBrain", "Kidney", "Kidney", "Liver", "Liver"))
consensusToCount <- soGGi:::runConsensusRegions(GRangesList(myPeaks), "none")
blklist <- import.bed("ATAC_Data/ATAC_blacklists/ENCFF547MET.bed.gz")
consensusToCount <- consensusToCount[!consensusToCount %over% blklist & !seqnames(consensusToCount) %in%
"chrM"]
consensusToCount
GRanges object with 89654 ranges and 7 metadata columns:
seqnames ranges strand | HindBrain_1 HindBrain_2
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 [3130336, 3130413] * | 0 0
[2] chr1 [3400112, 3400250] * | 0 0
[3] chr1 [3433954, 3434095] * | 0 0
[4] chr1 [3515020, 3515102] * | 0 0
[5] chr1 [3575895, 3576078] * | 0 0
... ... ... ... . ... ...
[89650] chrY [90805058, 90805151] * | 0 0
[89651] chrY [90807391, 90807562] * | 0 0
[89652] chrY [90808761, 90808841] * | 0 1
[89653] chrY [90812961, 90813237] * | 1 1
[89654] chrY [90829708, 90829782] * | 0 0
Kidney_1 Kidney_2 Liver_1 Liver_2 consensusIDs
<numeric> <numeric> <numeric> <numeric> <factor>
[1] 0 1 0 0 consensus_1
[2] 0 1 0 0 consensus_2
[3] 1 1 0 0 consensus_3
[4] 0 1 0 0 consensus_4
[5] 1 1 0 0 consensus_5
... ... ... ... ... ...
[89650] 0 0 1 0 consensus_89765
[89651] 1 0 0 1 consensus_89766
[89652] 0 0 0 0 consensus_89767
[89653] 1 0 1 1 consensus_89768
[89654] 0 1 0 0 consensus_89769
-------
seqinfo: 22 sequences from an unspecified genome; no seqlengths
Overlap of nucleosome free regions among replicates.
Now we have a set of non-redundant peaks we can review the correspondance between replicates by venn diagrams of peak overlaps.
Here we pass the overlap matrix generated by soGGi to limma’s vennDiagram function.
library(limma)
as.data.frame(elementMetadata(consensusToCount)) %>% dplyr::select(starts_with("Liver")) %>%
vennDiagram(main = "Overlap for Liver open regions")
as.data.frame(elementMetadata(consensusToCount)) %>% dplyr::select(starts_with("Kidney")) %>%
vennDiagram(main = "Overlap for Kidney open regions")
as.data.frame(elementMetadata(consensusToCount)) %>% dplyr::select(starts_with("HindBrain")) %>%
vennDiagram(main = "Overlap for Hindbrain open regions")
PCA of overlaps (occupancy analysis).
We can also Diffbind style PCA analysis (Occupancy analysis in Diffbind) of peak overlaps to get an overall view of correspondance between peak calls.
Here we pass the matrix of peak overlaps from soGGi to prcomp function and plot the results in ggplot2.
library(tidyr)
myPlot <- as.data.frame(elementMetadata(consensusToCount)) %>% dplyr::select(-consensusIDs) %>%
as.matrix %>% t %>% prcomp %>% .$x %>% data.frame %>% mutate(Samples = rownames(.)) %>%
mutate(Group = gsub("_\\d", "", Samples)) %>% ggplot(aes(x = PC1, y = PC2,
colour = Group)) + geom_point(size = 5)
myPlot
Counting for differential ATAC-seq.
The presense or absense of a peak does not fully capture the changes in ATAC-seq signal observed in a genome broswer. Identifying changes of ATAC-seq signal within peaks will allow us to better capture ATAC-seq signal differences.
To do this we will borrow some methods from RNA-seq, namely DESeq2, to evaluate changes in ATAC-seq signal between groups/tissues.
First we will filter our peaks in a manner similar to Diffbind, where we keep only peaks which are present in at least two replicates.
library(Rsubread)
occurrences <- elementMetadata(consensusToCount) %>% as.data.frame %>% dplyr::select(-consensusIDs) %>%
rowSums
table(occurrences) %>% rev %>% cumsum
6 5 4 3 2 1
6818 9147 12350 17079 36320 89654
consensusToCount <- consensusToCount[occurrences >= 2, ]
consensusToCount
GRanges object with 36320 ranges and 7 metadata columns:
seqnames ranges strand | HindBrain_1 HindBrain_2
<Rle> <IRanges> <Rle> | <numeric> <numeric>
[1] chr1 [3433954, 3434095] * | 0 0
[2] chr1 [3575895, 3576078] * | 0 0
[3] chr1 [3671534, 3671885] * | 0 0
[4] chr1 [3994858, 3995006] * | 1 1
[5] chr1 [4560267, 4560344] * | 0 1
... ... ... ... . ... ...
[36316] chrY [90784207, 90784298] * | 0 0
[36317] chrY [90784944, 90785099] * | 1 1
[36318] chrY [90801504, 90801651] * | 1 0
[36319] chrY [90807391, 90807562] * | 0 0
[36320] chrY [90812961, 90813237] * | 1 1
Kidney_1 Kidney_2 Liver_1 Liver_2 consensusIDs
<numeric> <numeric> <numeric> <numeric> <factor>
[1] 1 1 0 0 consensus_3
[2] 1 1 0 0 consensus_5
[3] 1 1 1 0 consensus_8
[4] 0 0 0 0 consensus_11
[5] 1 0 0 0 consensus_18
... ... ... ... ... ...
[36316] 0 1 1 0 consensus_89760
[36317] 1 1 1 1 consensus_89761
[36318] 1 1 1 1 consensus_89764
[36319] 1 0 0 1 consensus_89766
[36320] 1 0 1 1 consensus_89768
-------
seqinfo: 22 sequences from an unspecified genome; no seqlengths
Now we have to set of regions to count in we can use Rsubread to count paired reads landing in peaks.
Note that Rsubread allows for maximum and minimum fragment lengths!
This takes awhile so run this after workshop if interested
Takes ~ 10 minutes on 3.1 GHz Intel Core i7 Mac pro
bamsToCount <- dir("ATAC_Data/ATAC_BAM_forCounting/", full.names = TRUE, pattern = "*.\\.bam$")
# indexBam(bamsToCount)
regionsToCount <- data.frame(GeneID = paste("ID", seqnames(consensusToCount),
start(consensusToCount), end(consensusToCount), sep = "_"), Chr = seqnames(consensusToCount),
Start = start(consensusToCount), End = end(consensusToCount), Strand = strand(consensusToCount))
fcResults <- featureCounts(bamsToCount, annot.ext = regionsToCount, isPairedEnd = TRUE,
countMultiMappingReads = FALSE, maxFragLength = 100)
myCounts <- fcResults$counts
colnames(myCounts) <- c("HindBrain_1", "HindBrain_2", "Kidney_1", "Kidney_2",
"Liver_1", "Liver_2")
save(myCounts, file = "ATAC_Data/ATAC_RData/countsFromATAC.RData")
DESeq2 for differential ATAC-seq.
With our counts of fragments in nucleosome free regions we can now contruct a DESeq2 object and perform a PCA again but this time using signal within peaks, not just occurrence in regions.
We pass the GRanges of regions we count to DESeqDataSetFromMatrix function so as to access these from DESeq2 later.
library(DESeq2)
load("ATAC_Data/ATAC_RData/countsFromATAC.RData")
metaData <- data.frame(Group, row.names = colnames(myCounts))
atacDDS <- DESeqDataSetFromMatrix(myCounts, metaData, ~Group, rowRanges = consensusToCount)
atacDDS <- DESeq(atacDDS)
atac_Rlog <- rlog(atacDDS)
plotPCA(atac_Rlog, intgroup = "Group", ntop = nrow(atac_Rlog))
With the new DESeq2 object we can now test for any differences in ATAC-seq signal between groups.
In this example we look at differences between hindbrain and liver samples.
We return a GRanges object here to allow us to perform some more GenomicRanges operations.
library(DESeq2)
library(BSgenome.Mmusculus.UCSC.mm10)
library(tracktables)
LiverMinusHindbrain <- results(atacDDS, c("Group", "Liver", "HindBrain"), format = "GRanges")
LiverMinusHindbrain <- LiverMinusHindbrain[order(LiverMinusHindbrain$pvalue)]
LiverMinusHindbrain
GRanges object with 36320 ranges and 6 metadata columns:
seqnames ranges strand |
<Rle> <IRanges> <Rle> |
ID_chr9_3020206_3020334 chr9 [ 3020206, 3020334] * |
ID_chr4_22488379_22488925 chr4 [ 22488379, 22488925] * |
ID_chr9_3014116_3014203 chr9 [ 3014116, 3014203] * |
ID_chr3_82357817_82358333 chr3 [ 82357817, 82358333] * |
ID_chrX_143483004_143483128 chrX [143483004, 143483128] * |
... ... ... ... .
ID_chr9_32610720_32610969 chr9 [32610720, 32610969] * |
ID_chr14_49245536_49245707 chr14 [49245536, 49245707] * |
ID_chr11_76849776_76849927 chr11 [76849776, 76849927] * |
ID_chr7_78103567_78103861 chr7 [78103567, 78103861] * |
ID_chr15_79546476_79547001 chr15 [79546476, 79547001] * |
baseMean log2FoldChange
<numeric> <numeric>
ID_chr9_3020206_3020334 808.082684547964 -2.07082993359443
ID_chr4_22488379_22488925 96.8201223931814 -5.47105948747546
ID_chr9_3014116_3014203 1150.93297203881 -2.09547966787039
ID_chr3_82357817_82358333 109.359670001547 -3.37179393569715
ID_chrX_143483004_143483128 1130.41224005714 -1.84455679230995
... ... ...
ID_chr9_32610720_32610969 6.66631784355501 0.00134623548612957
ID_chr14_49245536_49245707 5.30434054149352 -0.00128543817342215
ID_chr11_76849776_76849927 5.68601304439561 -0.000711667882566276
ID_chr7_78103567_78103861 6.11939394444521 0.000555215030505441
ID_chr15_79546476_79547001 41.6805411145682 -6.33148740213062e-05
lfcSE stat
<numeric> <numeric>
ID_chr9_3020206_3020334 0.114353675887734 -18.1089931523272
ID_chr4_22488379_22488925 0.368459509247201 -14.8484686924036
ID_chr9_3014116_3014203 0.142226461967964 -14.7334022015003
ID_chr3_82357817_82358333 0.262912834732154 -12.8247597312327
ID_chrX_143483004_143483128 0.147518280630087 -12.5039200865912
... ... ...
ID_chr9_32610720_32610969 2.05663183774856 0.000654582634295557
ID_chr14_49245536_49245707 2.08212150475798 -0.000617369433284618
ID_chr11_76849776_76849927 2.07645599521139 -0.000342731983826041
ID_chr7_78103567_78103861 2.06416309335544 0.000268978276131709
ID_chr15_79546476_79547001 0.322978826608929 -0.000196034132286849
pvalue padj
<numeric> <numeric>
ID_chr9_3020206_3020334 2.70653384394425e-73 9.83013092120553e-69
ID_chr4_22488379_22488925 7.11759576902117e-50 1.29255539165424e-45
ID_chr9_3014116_3014203 3.93391832760457e-49 4.76266378861993e-45
ID_chr3_82357817_82358333 1.19155375973668e-37 1.08193081384091e-33
ID_chrX_143483004_143483128 7.10589146356843e-36 5.16171955913611e-32
... ... ...
ID_chr9_32610720_32610969 0.999477718659624 0.999587805422335
ID_chr14_49245536_49245707 0.999507410492162 0.999589975743462
ID_chr11_76849776_76849927 0.999726539446966 0.999781593499471
ID_chr7_78103567_78103861 0.999785386388871 0.99981291427748
ID_chr15_79546476_79547001 0.99984358739346 0.99984358739346
-------
seqinfo: 22 sequences from an unspecified genome; no seqlengths
We can subset to only open regions within promoters and then create a table to review the results in IGV using makebedtable function in tracktables package.
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
toOverLap <- promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, 500, 500)
LiverMinusHindbrain <- LiverMinusHindbrain[(!is.na(LiverMinusHindbrain$padj) &
LiverMinusHindbrain$padj < 0.05) & LiverMinusHindbrain %over% toOverLap,
]
# LiverMinusHindbrain <-
# LiverMinusHindbrain[!is.na(LiverMinusHindbrain$padj) &
# LiverMinusHindbrain$padj < 0.05,]
makebedtable(LiverMinusHindbrain, "LiverMinusHindbrain.html", "ATAC_Data/")
[1] "ATAC_Data//LiverMinusHindbrain.html"
Annotation for differential ATAC-seq.
In the final part we can annotate our differential ATAC-seq regions to genes and then use gene information to test enrichment for GO sets.
Since we have subset regions to those within +/- 500bp of a TSS we can use a standard enrichment analysis at this point. Here we use clusterProfiler to identify enrichment
library(clusterProfiler)
library(ChIPseeker)
anno_LiverMinusHindbrain <- annotatePeak(LiverMinusHindbrain, TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene)
>> preparing features information... 2018-02-16 16:33:41
>> identifying nearest features... 2018-02-16 16:33:42
>> calculating distance from peak to TSS... 2018-02-16 16:33:42
>> assigning genomic annotation... 2018-02-16 16:33:42
>> assigning chromosome lengths 2018-02-16 16:33:52
>> done... 2018-02-16 16:33:52
go1 <- enrichGO(as.data.frame(as.GRanges(anno_LiverMinusHindbrain)[as.GRanges(anno_LiverMinusHindbrain)$log2FoldChange >
0])$geneId, OrgDb = "org.Mm.eg.db", ont = "BP", maxGSSize = 5000)
go2 <- enrichGO(as.data.frame(as.GRanges(anno_LiverMinusHindbrain)[as.GRanges(anno_LiverMinusHindbrain)$log2FoldChange <
0])$geneId, OrgDb = "org.Mm.eg.db", ont = "BP", maxGSSize = 5000)
head(go1, 10) %>% dplyr::select(ID, Description, pvalue, p.adjust) %>% datatable(elementId = "goEle1")
head(go2, 10) %>% dplyr::select(ID, Description, pvalue, p.adjust) %>% datatable(elementId = "goEle2")
Export annotated, differential ATAC-seq.
Finally, we can write out the table of annotated differential ATAC-seq peaks to review in Excel or as bed files.
anno_LiverMinusHindbrain_GRanges <- as.GRanges(anno_LiverMinusHindbrain)
anno_LiverMinusHindbrain_GRanges_Up <- anno_LiverMinusHindbrain[elementMetadata(anno_LiverMinusHindbrain)$log2FoldChange >
0]
anno_LiverMinusHindbrain_GRanges_Down <- anno_LiverMinusHindbrain[elementMetadata(anno_LiverMinusHindbrain)$log2FoldChange <
0]
export.bed(anno_LiverMinusHindbrain_GRanges_Up, "LiverMinusHindbrain_Up.bed")
export.bed(anno_LiverMinusHindbrain_GRanges_Down, "LiverMinusHindbrain_Down.bed")
anno_LiverMinusHindbrain_df <- as.data.frame(anno_LiverMinusHindbrain)
write.table(anno_LiverMinusHindbrain_df, "LiverMinusHindbrain.csv", quote = FALSE,
row.names = FALSE, sep = ",")
Cutting sites from ATAC-seq data
ATAC-seq should generate shorter fragments (our nucleosome free regions) around smaller protected areas such as transcription factor binding sites.
We can therefore look for the pile-up of cut-sites around motifs of interest within different tissues/celltypes/samples.
To produce cut-sites from our BAM file we first resize our reads to 1bp and make the shift of 4/-5 bp depending on strand to adjust for expected shift from insertion of Tn5 transposase.
Here we will identify CTCF motifs passing an arbitary cut-off and then use soGGi to plot cut-sites around them
library(MotifDb)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg19)
CTCF <- query(MotifDb, c("CTCF"))
CTCF <- as.list(CTCF)
myRes <- matchPWM(CTCF[[1]], BSgenome.Hsapiens.UCSC.hg19[["chr20"]])
toCompare <- GRanges("chr20", ranges(myRes))
read1 <- first(atacReads_Open)
read2 <- second(atacReads_Open)
Firsts <- resize(granges(read1), fix = "start", 1)
First_Pos_toCut <- shift(granges(Firsts[strand(read1) == "+"]), 4)
First_Neg_toCut <- shift(granges(Firsts[strand(read1) == "-"]), -5)
Seconds <- resize(granges(read2), fix = "start", 1)
Second_Pos_toCut <- shift(granges(Seconds[strand(read2) == "+"]), 4)
Second_Neg_toCut <- shift(granges(Seconds[strand(read2) == "-"]), -5)
test_toCut <- c(First_Pos_toCut, First_Neg_toCut, Second_Pos_toCut, Second_Neg_toCut)
cutsCoverage <- coverage(test_toCut)
cutsCoverage20 <- cutsCoverage["chr20"]
CTCF_Cuts_open <- regionPlot(cutsCoverage20, testRanges = toCompare, style = "point",
format = "rlelist", distanceAround = 500)
plotRegion(CTCF_Cuts_open, outliers = 0.001) + ggtitle("NucFree Cuts Centred on CTCF")