ATACseq (part 1)


ATACseq

ATACseq (Assay for Transposase-Accessible Chromatin using sequencing) uses a transposase to efficiently fragment accessible DNA prior to sequencing. The result provides a method to map the accessible/open chromatin genome wide.

In contrast to other techniques, ATACseq has several advantages including

  • Low input material required (> 10,000 cells)
  • Rapid experiment protocol (~ 4 hrs.)

offset

ATACseq, MNaseseq and DNaseseq

offset

  • DNaseseq - Enzymatic digestion to extract signal from open chromatin around transcription factor binding sites.

  • MNaseseq - Enzymatic digestion to extract signal repesenting nucleosome positioning.

  • ATACseq - Uses transposases and offers a method to simultaneously extract signal from transcription factors binding sites and nucleosome positions from a single sample.

Working with ATACseq data

In this session we will look at some of the basics of ATACseq processing in R using some publically available data.

We will look at the alignment, post-alignment processing and plotting of ATACseq data over TSSs.

Data

In the ATACseq session we will make use of three sets of published data.

The first dataset is from original ATACseq paper.

Transposition of native chromatin for multimodal regulatory analysis and personal epigenomics Jason D. Buenrostro, Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, and William J. Greenleaf.

In particular, we will make use of the ATACseq_50k_Rep2 sample GEO - GSM1155958 Data can be retrieved in FASTQ format from ENA

  • SAMN02192806 - here

Data

For the second dataset we take ATACseq generated by Bing Ren at UCSD as part of the ENCODE consortium. It includes samples from several tissues in Mice. Links to data and sample information are included in list below.

Data

Finally, I have processed some of the data from Christina Leslie’s lab at MSKCC exactly as described in this session, so we can review some of the characteristics of ATACseq data alongside the same data processed by ENCODE’s pipeline in the exercises.

The raw data and processed BAM file is available from ENCODEs portal

FQ files can be found can be found here for read1 and here for read2.

We will also work with the aligned data as a BAM file which can be found here.

The Reference Data

For ATACseq analysis we will require a few pieces of reference data.

This includes:

  • Reference genome in fasta format - We will retrieve these from BSGenome Bioconductor annotation packages.

  • Gene models - We will retrieve these from TxDb Bioconductor annotation packages.

  • Blacklists - Artifact regions specific to genomes. These can be found in ENCODE portal here

Processed Data

We start with public sequencing data in links below and use reference data in Bioconductor.

Since some of these processing steps may take a little time I provide links to pre-processed results.

Essentials

BAM file and BAI index from our alignment/sorting/indexing.

Processed Data

Essentials

Small BAM, peak calls and directory structure.

Once you have downloaded the above and unzipped ATAC_Workshop.zip, you should move the Sorted_ATAC_50K_2.bam and Sorted_ATAC_50K_2.bam.bai file into ATAC_Workshop/ATAC_Data/ATAC_BAM/

You should also copy the RU_ATAC_Workshop.Rmd to ATAC_Workshop/ directory and open then to make sure all relative paths are correct.

Processed Data

Not essential

Same as above but with BAMs for counting as well as small BAM, peak calls and directory structure.

Bigwigs for IGV.

  • Bigwigs - BigWigs to review in IGV.

Alignment for ATACseq


FASTQ QC

Prior to alignemnt we recommend spending some time reviewing the FASTQ files. Some basic QC checks can help us see if there any biases in your artifacts in your sequencing such as unexpected drops in read quality, or nonrandom GC content.

For guides on how to check your FASTQ quality, check out that section in our ChIPseq course.

Align Greenleaf FASTQs

In this section we will work a little with the Greenleaf dataset.

We will process one sample of the Greenleaf data from FASTQ to BAM to allow us to review some of the features of ATACseq data and to create some processed files for review and further analysis.

Creating a reference genome

First we will need to create a reference genome to align our ATACseq data. We can create a FASTA file for alignment from the Bioconductor BSGenome objects.

We have done this previously for the ChIPseq and RNAseq analysis.

We are working with human data this time, so we will use the BSgenome.Hsapiens.UCSC.hg19 library for the hg19 genome build.

library(BSgenome.Hsapiens.UCSC.hg19)
mainChromosomes <- paste0("chr",c(1:21,"X","Y","M"))
mainChrSeq <- lapply(mainChromosomes,
                     function(x)BSgenome.Hsapiens.UCSC.hg19[[x]])
names(mainChrSeq) <- mainChromosomes
mainChrSeqSet <- DNAStringSet(mainChrSeq)
writeXStringSet(mainChrSeqSet,
                "BSgenome.Hsapiens.UCSC.hg19.mainChrs.fa")

Creating Rsubread index

For Rsubread we must build our index prior to Rsubread’s alignment steps.

Rsubread’s buildindex() function simply takes the parameters of our desired index name and the FASTA file to build index from..

Here I additional specify the parameter indexSplit as TRUE in conjunction with memory parameter set to 1000 (1000MB) to control memory usage in Rsubread alignment step.

library(Rsubread)
buildindex("BSgenome.Hsapiens.UCSC.hg19.mainChrs",
           "BSgenome.Hsapiens.UCSC.hg19.mainChrs.fa",
           indexSplit = TRUE,
           memory = 1000)

Aligning Sequence Reads

Now we have an index, we can align our ATACseq reads.

As ATACseq data is typically paired-end sequencing we will need to make some minor adjustments to our alignment steps.

Paired-end sequencing data will most often comes as two files typically with *_1* and *_2* or *_R1* and *_R2* in file name to denote which number in pair, a file is.

read1 <- "ATAC_Data/ATAC_FQs/SRR891269_1.fastq.gz"
read2 <- "ATAC_Data/ATAC_FQs/SRR891269_2.fastq.gz"

Aligning Sequence Reads

Our two matched paired-end read files will (most often) contain the same number of reads and the reads will be ordered the same in both files.

The read names will match across files for paired reads with the exception of a 1 or 2 in name to signify if the read was first or second in a pair.

require(ShortRead)
read1 <- readFastq("data/ATACSample_r1.fastq.gz")
read2 <- readFastq("data/ATACSample_r2.fastq.gz")
id(read1)[1:2]
## BStringSet object of length 2:
##     width seq
## [1]    59 HISEQ:236:HA03EADXX:1:1101:1147:2237 1:Y:0:TAAGGCGACTCTCTAT
## [2]    59 HISEQ:236:HA03EADXX:1:1101:1201:2248 1:N:0:TAAGGCGACTCTCTAT
id(read2)[1:2]
## BStringSet object of length 2:
##     width seq
## [1]    59 HISEQ:236:HA03EADXX:1:1101:1147:2237 2:Y:0:TAAGGCGACTCTCTAT
## [2]    59 HISEQ:236:HA03EADXX:1:1101:1201:2248 2:N:0:TAAGGCGACTCTCTAT

Aligning Sequence Reads

The distance between paired reads is important in ATACseq to allow us to distinguish reads mapping from short or long fragments indicating nucleosome free and nucleosome portions of signal respectively.

Following the alignment step the insert size gives us the total distance between the start of read1 and read2.

offset

Aligning Sequence Reads

We can use a standard alignment for DNA (as for ChIPseq), but we increase the maximum allowed fragment length to capture long fragments representing poly-nucleosome signal.

The maximum allowed fragment length set here is based on parameters used within Greenleaf study. To control maximum allowed fragment lengths I set the maxFragLength parameter to 2000. I also set unique parameter to TRUE to only include uniquely mapping reads.

align("BSgenome.Hsapiens.UCSC.hg19.mainChrs",
      readfile1=read1,readfile2=read2,
      output_file = "ATAC_50K_2.bam",
      nthreads=2,type=1,
      unique=TRUE,maxFragLength = 2000)

Aligning with Rbowtie2

To use Rbowtie2 we must also build our index prior to aligment. Here we use the bowtie2_build() function specifying the parameters of our FASTA file to build index from and the desired index name.

library(Rbowtie2)
bowtie2_build(references="BSgenome.Hsapiens.UCSC.hg19.mainChrs.fa", 
              bt2Index="BSgenome.Hsapiens.UCSC.hg19.mainChrs_bowtie2")

Decompressing FASTQ files

Once we have our index we must decompress our FASTQ files using gunzip() function.

gunzip("ATAC_Data/ATAC_FQs/SRR891269_1.fastq.gz")
gunzip("ATAC_Data/ATAC_FQs/SRR891269_2.fastq.gz")

Creating Rbowtie2 index

Now we can align our FASTQ to the genome with the bowtie2() function specifying our read1 and read2 to seq1 and seq2 parameters.

Finally we can convert the output SAM file to a more useable BAM file with asBam() function.

NOTE: SAM and uncompressed FASTQ files can use a lot of disk space. Once your done, it is good to recompress the FASTQs and remove the SAM files with the unlink() function.

library(Rsamtools)
bowtie2(bt2Index = "BSgenome.Hsapiens.UCSC.hg19.mainChrs_bowtie2",
          samOutput = "ATAC_50K_2_bowtie2.sam",
          seq1 = "ATAC_Data/ATAC_FQs/SRR891269_1.fastq",
          seq1 = "ATAC_Data/ATAC_FQs/SRR891269_2.fastq"
        )
asBam("ATAC_50K_2_bowtie2.sam")

Sorting and Indexing

Following alignment we would want to sort and index our BAM file for use with external tools.

First we sort our aligned data by sequence order (not Read Name here).

We then index our file allowing for rapid access of particular genomic locations by other programs (e.g IGV, Samtools) and by R/Bioconductor packages we will use.

library(Rsamtools)
sortedBAM <- file.path(dirname(outBAM),
                       paste0("Sorted_",basename(outBAM))
                       )

sortBam(outBAM,gsub("\\.bam","",basename(sortedBAM)))
indexBam(sortedBAM)

Distribution of mapped reads

In ATACseq we will want to check the distribution of mapped reads across chromosomes. We can check the number of mapped reads on every chromosome using the idxstatsBam() function.

ATACseq is known have high signal on the mitochondrial chromosomes and so we can check for that here.

library(Rsamtools)
mappedReads <- idxstatsBam(sortedBAM)

Distribution of mapped reads

We can now use the mapped reads data frame to make a barplot of reads across chromosomes.

In this example, we see a case where the mapping rate to mitochondrial genome is high.

library(ggplot2)

ggplot(mappedReads, aes(seqnames, mapped, fill = seqnames)) + geom_bar(stat = "identity") +
    coord_flip()

Post-alignment processing


Proper pairs

Now we have processed the Greenleaf ATACseq paired-end data we can start to work with alignments.

First we will identify the expected fragment length distribution for ATACseq data. We read our newly aligned data using the GenomicAlignments package.

Here we only wants read which are properly paired so we will use the ScanBamParam() and scanBamFlag() functions to control what will be read into R.

We set the scanBamFlag() function parameters isProperPair to TRUE so as to only read in reads paired in alignment within our preset max fragment length (2000bpp)

library(GenomicAlignments)
flags = scanBamFlag(isProperPair = TRUE)

Proper pairs

We can now use these flags with the ScanBamParam() function to read in only properly paired reads.

We additionally specify information to be read into R using the what parameter. Importantly we specify the insert size information - isize. To reduce memory footprint we read only information from chromosome 20 by specifying a GRanges object to which parameter.

myParam = ScanBamParam(flag = flags, what = c("qname", "mapq", "isize"), which = GRanges("chr20",
    IRanges(1, 63025520)))
myParam
## class: ScanBamParam
## bamFlag (NA unless specified): isProperPair=TRUE
## bamSimpleCigar: FALSE
## bamReverseComplement: FALSE
## bamTag:  
## bamTagFilter:
## bamWhich: 1 ranges
## bamWhat: qname, mapq, isize
## bamMapqFilter: NA

Proper pairs

Now we have set up the ScanBamParam object, we can use the readGAlignmentPairs() function to read in our paired-end ATACseq data in a similar way to how we read in single-end ChIP-seq data using the readGAlignments() function.

The resulting object is a GAlignmentPairs object.

atacReads <- readGAlignmentPairs(sortedBAM, param = myParam)
class(atacReads)
## [1] "GAlignmentPairs"
## attr(,"package")
## [1] "GenomicAlignments"

GAlignmentPairs

The GAlignmentPairs object contains information on our paired reads.

It stores information on each read in a pair in parallel GAlignments objects.

atacReads[1:2, ]
## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns:
##       seqnames strand :      ranges --      ranges
##          <Rle>  <Rle> :   <IRanges> --   <IRanges>
##   [1]    chr20      + : 60000-60029 -- 60361-60410
##   [2]    chr20      + : 60000-60029 -- 60361-60410
##   -------
##   seqinfo: 24 sequences from an unspecified genome

GAlignmentPairs

We access the GAlignments objects using the first() and second() accessor functions to gain information on the first or second read respectively.

read1 <- first(atacReads)
read2 <- second(atacReads)
read2[1, ]
## GAlignments object with 1 alignment and 3 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]    chr20      -         50M        50     60361     60410        50
##           njunc |              qname      mapq     isize
##       <integer> |        <character> <integer> <integer>
##   [1]         0 | SRR891269.17032370        40      -411
##   -------
##   seqinfo: 24 sequences from an unspecified genome

Retrieve MapQ scores

One of the first we can do is obtain the distribution of MapQ scores for our read1 and read2. We can access this for each read using mcols() function to access the mapq slot of the GAalignments object for each read.

read1MapQ <- mcols(read1)$mapq
read2MapQ <- mcols(read2)$mapq
read1MapQ[1:2]
## [1] 40 40

MapQ score frequencies

We can then use the table() function to summarize the frequency of scores for each read in pair.

read1MapQFreqs <- table(read1MapQ)
read2MapQFreqs <- table(read2MapQ)
read1MapQFreqs
## read1MapQ
##      6      8     10     13     20     40 
##    252   1003   3181  11027  26359 264748
read2MapQFreqs
## read2MapQ
##      6      8     10     13     20     40 
##    318   1112   4158  12610  27950 260422

Plot MapQ scores

Finally we can plot the distributions of MapQ for each read in pair using ggplot2.

library(ggplot2)
toPlot <- data.frame(MapQ = c(names(read1MapQFreqs), names(read2MapQFreqs)), Frequency = c(read1MapQFreqs,
    read2MapQFreqs), Read = c(rep("Read1", length(read1MapQFreqs)), rep("Read2",
    length(read2MapQFreqs))))
toPlot$MapQ <- factor(toPlot$MapQ, levels = unique(sort(as.numeric(toPlot$MapQ))))
ggplot(toPlot, aes(x = MapQ, y = Frequency, fill = MapQ)) + geom_bar(stat = "identity") +
    facet_grid(~Read)

Retrieving insert sizes

Now we have reads in the paired aligned data into R, we can retrieve the insert sizes from the elementMetadata() attached to GAlignments objects of each read pair.

Since properly paired reads will have the same insert size length we extract insert sizes from read1.

atacReads_read1 <- first(atacReads)
insertSizes <- abs(elementMetadata(atacReads_read1)$isize)
head(insertSizes)
## [1] 411 411 228  60 228 176

Plotting insert sizes

ATACseq should represent a mix of fragment lengths corresponding to nucleosome free, mononucleosome and polynucleosome fractions.

We can use the table() function to retrieve a vector of the occurrence of each fragment length.

fragLenSizes <- table(insertSizes)
fragLenSizes[1:5]
## insertSizes
##    50    51    52    53    54 
## 28311  2881  3050  2679  2433

Plotting insert sizes

Now we can use the newly acquired insert lengths for chromosome 20 to plot the distribution of all fragment lengths.

library(ggplot2)
toPlot <- data.frame(InsertSize = as.numeric(names(fragLenSizes)), Count = as.numeric(fragLenSizes))
fragLenPlot <- ggplot(toPlot, aes(x = InsertSize, y = Count)) + geom_line()

Plotting insert sizes

fragLenPlot + theme_bw()

Plotting insert sizes

We can apply a log2 transformation to the counts to clarify the nucloesome patterning.

fragLenPlot + scale_y_continuous(trans = "log2") + theme_bw()

Plotting insert sizes

This looks very similar to the image from the Greenleaf paper.

We can now annotate our nucleosome free (< 100bp), mononucleosome (180bp-247bp) and dinucleosome (315-437) as in the Greenleaf study.

fragLenPlot + scale_y_continuous(trans = "log2") + geom_vline(xintercept = c(180,
    247), colour = "red") + geom_vline(xintercept = c(315, 437), colour = "darkblue") +
    geom_vline(xintercept = c(100), colour = "darkgreen") + theme_bw()

Subsetting ATACseq reads by insert sizes

We may wish to divide our aligned reads into reads representing nucleosome free and nucleosome occupied.

Here we create BAM files for the reads representing nucleosome free, mono and di nucleosome by using insert sizes to filter read.

atacReads_NucFree <- atacReads[insertSizes < 100, ]
atacReads_MonoNuc <- atacReads[insertSizes > 180 & insertSizes < 240, ]
atacReads_diNuc <- atacReads[insertSizes > 315 & insertSizes < 437, ]

Creating BAM files split by insert sizes

The resulting reads can be written back to BAM files for use in other parts of our analysis or for visualization in programs such as IGV by functions in the rtracklayer package.

nucFreeRegionBam <- gsub("\\.bam", "_nucFreeRegions\\.bam", sortedBAM)
monoNucBam <- gsub("\\.bam", "_monoNuc\\.bam", sortedBAM)
diNucBam <- gsub("\\.bam", "_diNuc\\.bam", sortedBAM)

library(rtracklayer)
export(atacReads_NucFree, nucFreeRegionBam, format = "bam")
export(atacReads_MonoNuc, monoNucBam, format = "bam")
export(atacReads_diNuc, diNucBam, format = "bam")

Creating fragment GRanges

We can recreate the full length fragments from our single end reads to evaluate duplication rate and create a fragment bigwig.

Here we use the granges() function to recreate full fragments from the paired single-end reads.

atacReads[1, ]
## GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns:
##       seqnames strand :      ranges --      ranges
##          <Rle>  <Rle> :   <IRanges> --   <IRanges>
##   [1]    chr20      + : 60000-60029 -- 60361-60410
##   -------
##   seqinfo: 24 sequences from an unspecified genome
atacFragments <- granges(atacReads)
atacFragments[1, ]
## GRanges object with 1 range and 0 metadata columns:
##       seqnames      ranges strand
##          <Rle>   <IRanges>  <Rle>
##   [1]    chr20 60000-60410      +
##   -------
##   seqinfo: 24 sequences from an unspecified genome

Creating fragment GRanges

We can use the duplicated() function to identify our non-redundant (non-duplicate) fraction of our full length fragments.

duplicatedFragments <- sum(duplicated(atacFragments))
totalFragments <- length(atacFragments)
duplicateRate <- duplicatedFragments/totalFragments
nonRedundantFraction <- 1 - duplicateRate
nonRedundantFraction
## [1] 0.7491177

Creating an open region bigWig

We can make it significantly quicker to review the pile-up of ATACseq signal in a genome browser by creating a bigWig file.

Additional normalization to total mapped reads could be applied at this point.

openRegionRPMBigWig <- gsub("\\.bam", "_openRegionRPM\\.bw", sortedBAM)
myCoverage <- coverage(atacFragments, weight = (10^6/length(atacFragments)))
export.bw(myCoverage, openRegionRPMBigWig)

offset

ATACseqQC


ATACseqQC

The ATACseqQC library allows us to run many of the ATACseq QC steps we have seen in a single step. It may consume a little more memory but will allow for the inclusion of two more useful metrics called the PCR Bottleneck Coefficients (PBC1 and PBC2).

First we must install the library.

BiocManager::install("ATACseqQC")

ATACseqQC

As with ChIPQC, the ATACseqQC function contains a workflow function which will acquire much of the required QC with a single argument of the BAM file path.

Since this can be fairly memory heavy, I am just illustrating it here on a BAM file containing just the chromosome 17 reads of the ATACseq data.

library(ATACseqQC)
ATACQC <- bamQC("~/Downloads/Sorted_ATAC_50K_2_ch17.bam")

ATACseqQC

The resulting ATACQC object has many slots of QC information including duplicateRate, non-redundant fraction, distribution of signal across chromosomes, mitochondrial fraction etc.

These include the PCRbottleneckCoefficient_1 and PCRbottleneckCoefficient_2 values.

names(ATACQC)
##  [1] "totalQNAMEs"                   "duplicateRate"                
##  [3] "mitochondriaRate"              "properPairRate"               
##  [5] "unmappedRate"                  "hasUnmappedMateRate"          
##  [7] "notPassingQualityControlsRate" "nonRedundantFraction"         
##  [9] "PCRbottleneckCoefficient_1"    "PCRbottleneckCoefficient_2"   
## [11] "MAPQ"                          "idxstats"

PCR bottleneck coefficients

PCR bottleneck coefficients identify PCR bias/overamplification which may have occurred in preparation of ATAC samples.

The PCRbottleneckCoefficient_1 is calculated as the number of positions in genome with exactly 1 read mapped uniquely compared to the number of positions with at least 1 read.

For example if we have 20 reads. 16 map uniquely to locations. 4 do not map uniquely, instead there are 2 locations, both of which have 2 reads. This would lead us to calculation 16/18. We therefore have a PBC1 of 0.889

Values less than 0.7 indicate severe bottlenecking, between 0.7 and 0.9 indicate moderate bottlenecking. Greater than 0.9 show no bottlenecking.

ATACQC$PCRbottleneckCoefficient_1
## [1] 0.7406656

PCR bottleneck coefficients

The PCRbottleneckCoefficient_2 is our secondary measure of bottlenecking. It is calculated as the number of positions in genome with exactly 1 read mapped uniquely compared to the number of positions with exactly 2 reads mapping uniquely.

We can reuse our example. If we have 20 reads, 16 of which map uniquely. 4 do not map uniquely, instead there are 2 locations, both of which have 2 reads. This would lead us to calculation 16/2. We therefore have a PBC2 of 8.

Values less than 1 indicate severe bottlenecking, between 1 and 3 indicate moderate bottlenecking. Greater than 3 show no bottlenecking.

ATACQC$PCRbottleneckCoefficient_2
## [1] 3.73014

Time for an exercise!

Exercise on ATACseq data can be found here

Answers to exercise

Answers can be found here

R code for solutions can be found here