ATACseq (Assay for Transposase-Accessible Chromatin using sequencing) uses a transposase to efficiently fragment accessible DNA prior to sequencing. The result provides a method to map the accessible/open chromatin genome wide.
In contrast to other techniques, ATACseq has several advantages including
DNaseseq - Enzymatic digestion to extract signal from open chromatin around transcription factor binding sites.
MNaseseq - Enzymatic digestion to extract signal repesenting nucleosome positioning.
ATACseq - Uses transposases and offers a method to simultaneously extract signal from transcription factors binding sites and nucleosome positions from a single sample.
In this session we will look at some of the basics of ATACseq processing in R using some publically available data.
We will look at the alignment, post-alignment processing and plotting of ATACseq data over TSSs.
In the ATACseq session we will make use of three sets of published data.
The first dataset is from original ATACseq paper.
Transposition of native chromatin for multimodal regulatory analysis and personal epigenomics Jason D. Buenrostro, Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, and William J. Greenleaf.
In particular, we will make use of the ATACseq_50k_Rep2 sample GEO - GSM1155958 Data can be retrieved in FASTQ format from ENA
For the second dataset we take ATACseq generated by Bing Ren at UCSD as part of the ENCODE consortium. It includes samples from several tissues in Mice. Links to data and sample information are included in list below.
Liver day 12 - ENCSR302LIV
Kidney day 15 - ENCSR023QZX
Hindbrain day 12 - ENCSR088UYE
Finally, I have processed some of the data from Christina Leslie’s lab at MSKCC exactly as described in this session, so we can review some of the characteristics of ATACseq data alongside the same data processed by ENCODE’s pipeline in the exercises.
The raw data and processed BAM file is available from ENCODEs portal
FQ files can be found can be found here for read1 and here for read2.
We will also work with the aligned data as a BAM file which can be found here.
For ATACseq analysis we will require a few pieces of reference data.
This includes:
Reference genome in fasta format - We will retrieve these from BSGenome Bioconductor annotation packages.
Gene models - We will retrieve these from TxDb Bioconductor annotation packages.
Blacklists - Artifact regions specific to genomes. These can be found in ENCODE portal here
We start with public sequencing data in links below and use reference data in Bioconductor.
Since some of these processing steps may take a little time I provide links to pre-processed results.
BAM file and BAI index from our alignment/sorting/indexing.
SAMN02192806 - Greenleaf BAM - Full BAM file for Greenleaf example produced following in our Rsubread alignment, sorting and indexing.
SAMN02192806 - Greenleaf BAI index - BAI index file for BAM in Greenleaf example produced following in our alignment, sorting and indexing.
Small BAM, peak calls and directory structure.
Once you have downloaded the above and unzipped ATAC_Workshop.zip, you should move the Sorted_ATAC_50K_2.bam and Sorted_ATAC_50K_2.bam.bai file into ATAC_Workshop/ATAC_Data/ATAC_BAM/
You should also copy the RU_ATAC_Workshop.Rmd to ATAC_Workshop/ directory and open then to make sure all relative paths are correct.
Same as above but with BAMs for counting as well as small BAM, peak calls and directory structure.
Bigwigs for IGV.
Prior to alignemnt we recommend spending some time reviewing the FASTQ files. Some basic QC checks can help us see if there any biases in your artifacts in your sequencing such as unexpected drops in read quality, or nonrandom GC content.
For guides on how to check your FASTQ quality, check out that section in our ChIPseq course.
In this section we will work a little with the Greenleaf dataset.
We will process one sample of the Greenleaf data from FASTQ to BAM to allow us to review some of the features of ATACseq data and to create some processed files for review and further analysis.
First we will need to create a reference genome to align our ATACseq data. We can create a FASTA file for alignment from the Bioconductor BSGenome objects.
We have done this previously for the ChIPseq and RNAseq analysis.
We are working with human data this time, so we will use the BSgenome.Hsapiens.UCSC.hg19 library for the hg19 genome build.
library(BSgenome.Hsapiens.UCSC.hg19)
<- paste0("chr",c(1:21,"X","Y","M"))
mainChromosomes <- lapply(mainChromosomes,
mainChrSeq function(x)BSgenome.Hsapiens.UCSC.hg19[[x]])
names(mainChrSeq) <- mainChromosomes
<- DNAStringSet(mainChrSeq)
mainChrSeqSet writeXStringSet(mainChrSeqSet,
"BSgenome.Hsapiens.UCSC.hg19.mainChrs.fa")
For Rsubread we must build our index prior to Rsubread’s alignment steps.
Rsubread’s buildindex() function simply takes the parameters of our desired index name and the FASTA file to build index from..
Here I additional specify the parameter indexSplit as TRUE in conjunction with memory parameter set to 1000 (1000MB) to control memory usage in Rsubread alignment step.
library(Rsubread)
buildindex("BSgenome.Hsapiens.UCSC.hg19.mainChrs",
"BSgenome.Hsapiens.UCSC.hg19.mainChrs.fa",
indexSplit = TRUE,
memory = 1000)
Now we have an index, we can align our ATACseq reads.
As ATACseq data is typically paired-end sequencing we will need to make some minor adjustments to our alignment steps.
Paired-end sequencing data will most often comes as two files typically with *_1* and *_2* or *_R1* and *_R2* in file name to denote which number in pair, a file is.
<- "ATAC_Data/ATAC_FQs/SRR891269_1.fastq.gz"
read1 <- "ATAC_Data/ATAC_FQs/SRR891269_2.fastq.gz" read2
Our two matched paired-end read files will (most often) contain the same number of reads and the reads will be ordered the same in both files.
The read names will match across files for paired reads with the exception of a 1 or 2 in name to signify if the read was first or second in a pair.
require(ShortRead)
<- readFastq("data/ATACSample_r1.fastq.gz")
read1 <- readFastq("data/ATACSample_r2.fastq.gz")
read2 id(read1)[1:2]
## BStringSet object of length 2:
## width seq
## [1] 59 HISEQ:236:HA03EADXX:1:1101:1147:2237 1:Y:0:TAAGGCGACTCTCTAT
## [2] 59 HISEQ:236:HA03EADXX:1:1101:1201:2248 1:N:0:TAAGGCGACTCTCTAT
id(read2)[1:2]
## BStringSet object of length 2:
## width seq
## [1] 59 HISEQ:236:HA03EADXX:1:1101:1147:2237 2:Y:0:TAAGGCGACTCTCTAT
## [2] 59 HISEQ:236:HA03EADXX:1:1101:1201:2248 2:N:0:TAAGGCGACTCTCTAT
The distance between paired reads is important in ATACseq to allow us to distinguish reads mapping from short or long fragments indicating nucleosome free and nucleosome portions of signal respectively.
Following the alignment step the insert size gives us the total distance between the start of read1 and read2.
We can use a standard alignment for DNA (as for ChIPseq), but we increase the maximum allowed fragment length to capture long fragments representing poly-nucleosome signal.
The maximum allowed fragment length set here is based on parameters used within Greenleaf study. To control maximum allowed fragment lengths I set the maxFragLength parameter to 2000. I also set unique parameter to TRUE to only include uniquely mapping reads.
align("BSgenome.Hsapiens.UCSC.hg19.mainChrs",
readfile1=read1,readfile2=read2,
output_file = "ATAC_50K_2.bam",
nthreads=2,type=1,
unique=TRUE,maxFragLength = 2000)
To use Rbowtie2 we must also build our index prior to aligment. Here we use the bowtie2_build() function specifying the parameters of our FASTA file to build index from and the desired index name.
library(Rbowtie2)
bowtie2_build(references="BSgenome.Hsapiens.UCSC.hg19.mainChrs.fa",
bt2Index="BSgenome.Hsapiens.UCSC.hg19.mainChrs_bowtie2")
Once we have our index we must decompress our FASTQ files using gunzip() function.
gunzip("ATAC_Data/ATAC_FQs/SRR891269_1.fastq.gz")
gunzip("ATAC_Data/ATAC_FQs/SRR891269_2.fastq.gz")
Now we can align our FASTQ to the genome with the bowtie2() function specifying our read1 and read2 to seq1 and seq2 parameters.
Finally we can convert the output SAM file to a more useable BAM file with asBam() function.
NOTE: SAM and uncompressed FASTQ files can use a lot of disk space. Once your done, it is good to recompress the FASTQs and remove the SAM files with the unlink() function.
library(Rsamtools)
bowtie2(bt2Index = "BSgenome.Hsapiens.UCSC.hg19.mainChrs_bowtie2",
samOutput = "ATAC_50K_2_bowtie2.sam",
seq1 = "ATAC_Data/ATAC_FQs/SRR891269_1.fastq",
seq1 = "ATAC_Data/ATAC_FQs/SRR891269_2.fastq"
)asBam("ATAC_50K_2_bowtie2.sam")
Following alignment we would want to sort and index our BAM file for use with external tools.
First we sort our aligned data by sequence order (not Read Name here).
We then index our file allowing for rapid access of particular genomic locations by other programs (e.g IGV, Samtools) and by R/Bioconductor packages we will use.
library(Rsamtools)
<- file.path(dirname(outBAM),
sortedBAM paste0("Sorted_",basename(outBAM))
)
sortBam(outBAM,gsub("\\.bam","",basename(sortedBAM)))
indexBam(sortedBAM)
In ATACseq we will want to check the distribution of mapped reads across chromosomes. We can check the number of mapped reads on every chromosome using the idxstatsBam() function.
ATACseq is known have high signal on the mitochondrial chromosomes and so we can check for that here.
library(Rsamtools)
<- idxstatsBam(sortedBAM) mappedReads
We can now use the mapped reads data frame to make a barplot of reads across chromosomes.
In this example, we see a case where the mapping rate to mitochondrial genome is high.
library(ggplot2)
ggplot(mappedReads, aes(seqnames, mapped, fill = seqnames)) + geom_bar(stat = "identity") +
coord_flip()
Now we have processed the Greenleaf ATACseq paired-end data we can start to work with alignments.
First we will identify the expected fragment length distribution for ATACseq data. We read our newly aligned data using the GenomicAlignments package.
Here we only wants read which are properly paired so we will use the ScanBamParam() and scanBamFlag() functions to control what will be read into R.
We set the scanBamFlag() function parameters isProperPair to TRUE so as to only read in reads paired in alignment within our preset max fragment length (2000bpp)
library(GenomicAlignments)
= scanBamFlag(isProperPair = TRUE) flags
We can now use these flags with the ScanBamParam() function to read in only properly paired reads.
We additionally specify information to be read into R using the what parameter. Importantly we specify the insert size information - isize. To reduce memory footprint we read only information from chromosome 20 by specifying a GRanges object to which parameter.
= ScanBamParam(flag = flags, what = c("qname", "mapq", "isize"), which = GRanges("chr20",
myParam IRanges(1, 63025520)))
myParam
## class: ScanBamParam
## bamFlag (NA unless specified): isProperPair=TRUE
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:
## bamTagFilter:
## bamWhich: 1 ranges
## bamWhat: qname, mapq, isize
## bamMapqFilter: NA
Now we have set up the ScanBamParam object, we can use the readGAlignmentPairs() function to read in our paired-end ATACseq data in a similar way to how we read in single-end ChIP-seq data using the readGAlignments() function.
The resulting object is a GAlignmentPairs object.
<- readGAlignmentPairs(sortedBAM, param = myParam)
atacReads class(atacReads)
## [1] "GAlignmentPairs"
## attr(,"package")
## [1] "GenomicAlignments"
The GAlignmentPairs object contains information on our paired reads.
It stores information on each read in a pair in parallel GAlignments objects.
1:2, ] atacReads[
## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns:
## seqnames strand : ranges -- ranges
## <Rle> <Rle> : <IRanges> -- <IRanges>
## [1] chr20 + : 60000-60029 -- 60361-60410
## [2] chr20 + : 60000-60029 -- 60361-60410
## -------
## seqinfo: 24 sequences from an unspecified genome
We access the GAlignments objects using the first() and second() accessor functions to gain information on the first or second read respectively.
<- first(atacReads)
read1 <- second(atacReads)
read2 1, ] read2[
## GAlignments object with 1 alignment and 3 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr20 - 50M 50 60361 60410 50
## njunc | qname mapq isize
## <integer> | <character> <integer> <integer>
## [1] 0 | SRR891269.17032370 40 -411
## -------
## seqinfo: 24 sequences from an unspecified genome
One of the first we can do is obtain the distribution of MapQ scores for our read1 and read2. We can access this for each read using mcols() function to access the mapq slot of the GAalignments object for each read.
<- mcols(read1)$mapq
read1MapQ <- mcols(read2)$mapq
read2MapQ 1:2] read1MapQ[
## [1] 40 40
We can then use the table() function to summarize the frequency of scores for each read in pair.
<- table(read1MapQ)
read1MapQFreqs <- table(read2MapQ)
read2MapQFreqs read1MapQFreqs
## read1MapQ
## 6 8 10 13 20 40
## 252 1003 3181 11027 26359 264748
read2MapQFreqs
## read2MapQ
## 6 8 10 13 20 40
## 318 1112 4158 12610 27950 260422
Finally we can plot the distributions of MapQ for each read in pair using ggplot2.
library(ggplot2)
<- data.frame(MapQ = c(names(read1MapQFreqs), names(read2MapQFreqs)), Frequency = c(read1MapQFreqs,
toPlot Read = c(rep("Read1", length(read1MapQFreqs)), rep("Read2",
read2MapQFreqs), length(read2MapQFreqs))))
$MapQ <- factor(toPlot$MapQ, levels = unique(sort(as.numeric(toPlot$MapQ))))
toPlotggplot(toPlot, aes(x = MapQ, y = Frequency, fill = MapQ)) + geom_bar(stat = "identity") +
facet_grid(~Read)
Now we have reads in the paired aligned data into R, we can retrieve the insert sizes from the elementMetadata() attached to GAlignments objects of each read pair.
Since properly paired reads will have the same insert size length we extract insert sizes from read1.
<- first(atacReads)
atacReads_read1 <- abs(elementMetadata(atacReads_read1)$isize)
insertSizes head(insertSizes)
## [1] 411 411 228 60 228 176
ATACseq should represent a mix of fragment lengths corresponding to nucleosome free, mononucleosome and polynucleosome fractions.
We can use the table() function to retrieve a vector of the occurrence of each fragment length.
<- table(insertSizes)
fragLenSizes 1:5] fragLenSizes[
## insertSizes
## 50 51 52 53 54
## 28311 2881 3050 2679 2433
Now we can use the newly acquired insert lengths for chromosome 20 to plot the distribution of all fragment lengths.
library(ggplot2)
<- data.frame(InsertSize = as.numeric(names(fragLenSizes)), Count = as.numeric(fragLenSizes))
toPlot <- ggplot(toPlot, aes(x = InsertSize, y = Count)) + geom_line() fragLenPlot
+ theme_bw() fragLenPlot
We can apply a log2 transformation to the counts to clarify the nucloesome patterning.
+ scale_y_continuous(trans = "log2") + theme_bw() fragLenPlot
This looks very similar to the image from the Greenleaf paper.
We can now annotate our nucleosome free (< 100bp), mononucleosome (180bp-247bp) and dinucleosome (315-437) as in the Greenleaf study.
+ scale_y_continuous(trans = "log2") + geom_vline(xintercept = c(180,
fragLenPlot 247), colour = "red") + geom_vline(xintercept = c(315, 437), colour = "darkblue") +
geom_vline(xintercept = c(100), colour = "darkgreen") + theme_bw()
We may wish to divide our aligned reads into reads representing nucleosome free and nucleosome occupied.
Here we create BAM files for the reads representing nucleosome free, mono and di nucleosome by using insert sizes to filter read.
<- atacReads[insertSizes < 100, ]
atacReads_NucFree <- atacReads[insertSizes > 180 & insertSizes < 240, ]
atacReads_MonoNuc <- atacReads[insertSizes > 315 & insertSizes < 437, ] atacReads_diNuc
The resulting reads can be written back to BAM files for use in other parts of our analysis or for visualization in programs such as IGV by functions in the rtracklayer package.
<- gsub("\\.bam", "_nucFreeRegions\\.bam", sortedBAM)
nucFreeRegionBam <- gsub("\\.bam", "_monoNuc\\.bam", sortedBAM)
monoNucBam <- gsub("\\.bam", "_diNuc\\.bam", sortedBAM)
diNucBam
library(rtracklayer)
export(atacReads_NucFree, nucFreeRegionBam, format = "bam")
export(atacReads_MonoNuc, monoNucBam, format = "bam")
export(atacReads_diNuc, diNucBam, format = "bam")
We can recreate the full length fragments from our single end reads to evaluate duplication rate and create a fragment bigwig.
Here we use the granges() function to recreate full fragments from the paired single-end reads.
1, ] atacReads[
## GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
## seqnames strand : ranges -- ranges
## <Rle> <Rle> : <IRanges> -- <IRanges>
## [1] chr20 + : 60000-60029 -- 60361-60410
## -------
## seqinfo: 24 sequences from an unspecified genome
<- granges(atacReads)
atacFragments 1, ] atacFragments[
## GRanges object with 1 range and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr20 60000-60410 +
## -------
## seqinfo: 24 sequences from an unspecified genome
We can use the duplicated() function to identify our non-redundant (non-duplicate) fraction of our full length fragments.
<- sum(duplicated(atacFragments))
duplicatedFragments <- length(atacFragments)
totalFragments <- duplicatedFragments/totalFragments
duplicateRate <- 1 - duplicateRate
nonRedundantFraction nonRedundantFraction
## [1] 0.7491177
We can make it significantly quicker to review the pile-up of ATACseq signal in a genome browser by creating a bigWig file.
Additional normalization to total mapped reads could be applied at this point.
<- gsub("\\.bam", "_openRegionRPM\\.bw", sortedBAM)
openRegionRPMBigWig <- coverage(atacFragments, weight = (10^6/length(atacFragments)))
myCoverage export.bw(myCoverage, openRegionRPMBigWig)
The ATACseqQC library allows us to run many of the ATACseq QC steps we have seen in a single step. It may consume a little more memory but will allow for the inclusion of two more useful metrics called the PCR Bottleneck Coefficients (PBC1 and PBC2).
First we must install the library.
::install("ATACseqQC") BiocManager
As with ChIPQC, the ATACseqQC function contains a workflow function which will acquire much of the required QC with a single argument of the BAM file path.
Since this can be fairly memory heavy, I am just illustrating it here on a BAM file containing just the chromosome 17 reads of the ATACseq data.
library(ATACseqQC)
<- bamQC("~/Downloads/Sorted_ATAC_50K_2_ch17.bam") ATACQC
The resulting ATACQC object has many slots of QC information including duplicateRate, non-redundant fraction, distribution of signal across chromosomes, mitochondrial fraction etc.
These include the PCRbottleneckCoefficient_1 and PCRbottleneckCoefficient_2 values.
names(ATACQC)
## [1] "totalQNAMEs" "duplicateRate"
## [3] "mitochondriaRate" "properPairRate"
## [5] "unmappedRate" "hasUnmappedMateRate"
## [7] "notPassingQualityControlsRate" "nonRedundantFraction"
## [9] "PCRbottleneckCoefficient_1" "PCRbottleneckCoefficient_2"
## [11] "MAPQ" "idxstats"
PCR bottleneck coefficients identify PCR bias/overamplification which may have occurred in preparation of ATAC samples.
The PCRbottleneckCoefficient_1 is calculated as the number of positions in genome with exactly 1 read mapped uniquely compared to the number of positions with at least 1 read.
For example if we have 20 reads. 16 map uniquely to locations. 4 do not map uniquely, instead there are 2 locations, both of which have 2 reads. This would lead us to calculation 16/18. We therefore have a PBC1 of 0.889
Values less than 0.7 indicate severe bottlenecking, between 0.7 and 0.9 indicate moderate bottlenecking. Greater than 0.9 show no bottlenecking.
$PCRbottleneckCoefficient_1 ATACQC
## [1] 0.7406656
The PCRbottleneckCoefficient_2 is our secondary measure of bottlenecking. It is calculated as the number of positions in genome with exactly 1 read mapped uniquely compared to the number of positions with exactly 2 reads mapping uniquely.
We can reuse our example. If we have 20 reads, 16 of which map uniquely. 4 do not map uniquely, instead there are 2 locations, both of which have 2 reads. This would lead us to calculation 16/2. We therefore have a PBC2 of 8.
Values less than 1 indicate severe bottlenecking, between 1 and 3 indicate moderate bottlenecking. Greater than 3 show no bottlenecking.
$PCRbottleneckCoefficient_2 ATACQC
## [1] 3.73014
Exercise on ATACseq data can be found here