class: center, middle, inverse, title-slide # ATACseq In Bioconductor
### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/RU_ATACseq/
--- ## ATACseq ATACseq (Assay for Transposase-Accessible Chromatin using sequencing) uses a transposase to efficiently fragment accessible DNA prior to sequencing. The result provides a method to map the accessible/open chromatin genome wide. In contrast to other techniques, ATACseq has several advantages including * Low input material required (> 10,000 cells) * Rapid experiment protocol (~ 4 hrs.) <div align="center"> <img src="imgs/ATACseqP.jpeg" alt="offset" height="300" width="600"> </div> --- ## ATACseq, MNaseseq and DNaseseq <div align="center"> <img src="imgs/mnATAC.jpg" alt="offset" height="300" width="600"> </div> * DNaseseq - Enzymatic digestion to extract signal from open chromatin around transcription factor binding sites. * MNaseseq - Enzymatic digestion to extract signal repesenting nucleosome positioning. * ATACseq - Uses transposases and offers a method to simultaneously extract signal from transcription factors binding sites and nucleosome positions from a single sample. --- ## Working with ATACseq data In this session we will look at some of the basics of ATACseq processing in R using some publically available data. We will look at the alignment, post-alignment processing and plotting of ATACseq data over TSSs. --- ## Data In the ATACseq session we will make use of three sets of published data. The first dataset is from original ATACseq paper. [**Transposition of native chromatin for multimodal regulatory analysis and personal epigenomics**](https://pubmed.ncbi.nlm.nih.gov/24097267/) Jason D. Buenrostro, Paul G. Giresi, Lisa C. Zaba, Howard Y. Chang, and William J. Greenleaf. In particular, we will make use of the ATACseq_50k_Rep2 sample GEO - GSM1155958 Data can be retrieved in FASTQ format from ENA * SAMN02192806 - [here](https://www.ebi.ac.uk/ena/data/view/SAMN02192806) --- ## Data For the second dataset we take ATACseq generated by Bing Ren at UCSD as part of the ENCODE consortium. It includes samples from several tissues in Mice. Links to data and sample information are included in list below. * Liver day 12 - [ENCSR302LIV](https://www.encodeproject.org/experiments/ENCSR302LIV/) * Kidney day 15 - [ENCSR023QZX](https://www.encodeproject.org/experiments/ENCSR023QZX/) * Hindbrain day 12 - [ENCSR088UYE](https://www.encodeproject.org/experiments/ENCSR088UYE/) --- ## Data Finally, I have processed some of the data from Christina Leslie's lab at MSKCC exactly as described in this session, so we can review some of the characteristics of ATACseq data alongside the same data processed by ENCODE's pipeline in the exercises. The raw data and processed BAM file is available from ENCODEs portal * T-Reg - [ENCSR724UJS](https://www.encodeproject.org/experiments/ENCSR724UJS/) FQ files can be found can be found [here for read1](https://www.encodeproject.org/files/ENCFF175VOD/@@download/ENCFF175VOD.fastq.gz) and [here for read2](https://www.encodeproject.org/files/ENCFF447BGX/@@download/ENCFF447BGX.fastq.gz). We will also work with the aligned data as a BAM file [which can be found here.](https://www.encodeproject.org/files/ENCFF053CGD/@@download/ENCFF053CGD.bam) --- ## The Reference Data For ATACseq analysis we will require a few pieces of reference data. This includes: * Reference genome in fasta format - We will retrieve these from BSGenome Bioconductor annotation packages. * Gene models - We will retrieve these from TxDb Bioconductor annotation packages. * Blacklists - Artifact regions specific to genomes. These can be found in [ENCODE portal here](https://www.encodeproject.org/annotations/ENCSR636HFF/) --- ## Processed Data We start with public sequencing data in links below and use reference data in Bioconductor. Since some of these processing steps may take a little time I provide links to pre-processed results. ### **Essentials** BAM file and BAI index from our alignment/sorting/indexing. * [SAMN02192806 - Greenleaf BAM](https://s3.amazonaws.com/rubioinformatics/ATAC_Workshop/ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2.bam) - Full BAM file for Greenleaf example produced following in our Rsubread alignment, sorting and indexing. * [SAMN02192806 - Greenleaf BAI index](https://s3.amazonaws.com/rubioinformatics/ATAC_Workshop/ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2.bam.bai) - BAI index file for BAM in Greenleaf example produced following in our alignment, sorting and indexing. --- ## Processed Data ### **Essentials** Small BAM, peak calls and directory structure. * [ATAC_Workshop_Essential.zip](https://s3.amazonaws.com/rubioinformatics/ATAC_Workshop_Essential.zip) - Require additional workshop files and directory structure. Once you have downloaded the above and unzipped **ATAC_Workshop.zip**, you should move the **Sorted_ATAC_50K_2.bam** and **Sorted_ATAC_50K_2.bam.bai** file into **ATAC_Workshop/ATAC_Data/ATAC_BAM/** You should also copy the **RU_ATAC_Workshop.Rmd** to **ATAC_Workshop/** directory and open then to make sure all relative paths are correct. --- ## Processed Data ### **Not essential** Same as above but with BAMs for counting as well as small BAM, peak calls and directory structure. * [ATAC_Workshop.zip](https://s3.amazonaws.com/rubioinformatics/ATAC_Workshop.zip) - Additional workshop files and directory structure. Bigwigs for IGV. * [Bigwigs](https://s3.amazonaws.com/rubioinformatics/ATAC_bigWigs.zip) - BigWigs to review in IGV. --- class: inverse, center, middle # Alignment for ATACseq <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## FASTQ QC Prior to alignemnt we recommend spending some time reviewing the FASTQ files. Some basic QC checks can help us see if there any biases in your artifacts in your sequencing such as unexpected drops in read quality, or nonrandom GC content. For guides on how to check your FASTQ quality, check out that section in our [ChIPseq course](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor.html#6). --- ## Align Greenleaf FASTQs In this section we will work a little with the Greenleaf dataset. We will process one sample of the Greenleaf data from FASTQ to BAM to allow us to review some of the features of ATACseq data and to create some processed files for review and further analysis. --- ## Creating a reference genome First we will need to create a reference genome to align our ATACseq data. [We can create a FASTA file for alignment from the Bioconductor BSGenome objects.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignmentInBioconductor.html#12) We have done this previously for the [ChIPseq](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor.html#28) and [RNAseq](https://rockefelleruniversity.github.io/RU_RNAseq/presentations/slides/RU_RNAseq_p1.html#17) analysis. We are working with human data this time, so we will use the **BSgenome.Hsapiens.UCSC.hg19** library for the hg19 genome build. ```r library(BSgenome.Hsapiens.UCSC.hg19) mainChromosomes <- paste0("chr",c(1:21,"X","Y","M")) mainChrSeq <- lapply(mainChromosomes, function(x)BSgenome.Hsapiens.UCSC.hg19[[x]]) names(mainChrSeq) <- mainChromosomes mainChrSeqSet <- DNAStringSet(mainChrSeq) writeXStringSet(mainChrSeqSet, "BSgenome.Hsapiens.UCSC.hg19.mainChrs.fa") ``` --- ## Creating Rsubread index For Rsubread we must build our index prior to Rsubread's alignment steps. Rsubread's [**buildindex()** function simply takes the parameters of our desired index name and the FASTA file to build index from.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignmentInBioconductor.html#14). Here I additional specify the parameter **indexSplit** as TRUE in conjunction with **memory** parameter set to 1000 (1000MB) to control memory usage in Rsubread alignment step. ```r library(Rsubread) buildindex("BSgenome.Hsapiens.UCSC.hg19.mainChrs", "BSgenome.Hsapiens.UCSC.hg19.mainChrs.fa", indexSplit = TRUE, memory = 1000) ``` --- ## Aligning Sequence Reads Now we have an index, we can align our ATACseq reads. As ATACseq data is typically paired-end sequencing we will need to make some minor adjustments to our alignment steps. Paired-end sequencing data will most often comes as two files typically with *_1* and *_2* or *_R1* and *_R2* in file name to denote which number in pair, a file is. ```r read1 <- "ATAC_Data/ATAC_FQs/SRR891269_1.fastq.gz" read2 <- "ATAC_Data/ATAC_FQs/SRR891269_2.fastq.gz" ``` --- ## Aligning Sequence Reads Our two matched paired-end read files will (most often) contain the same number of reads and the reads will be ordered the same in both files. The read names will match across files for paired reads with the exception of a 1 or 2 in name to signify if the read was first or second in a pair. ```r require(ShortRead) read1 <- readFastq("data/ATACSample_r1.fastq.gz") read2 <- readFastq("data/ATACSample_r2.fastq.gz") id(read1)[1:2] ``` ``` ## BStringSet object of length 2: ## width seq ## [1] 59 HISEQ:236:HA03EADXX:1:1101:1147:2237 1:Y:0:TAAGGCGACTCTCTAT ## [2] 59 HISEQ:236:HA03EADXX:1:1101:1201:2248 1:N:0:TAAGGCGACTCTCTAT ``` ```r id(read2)[1:2] ``` ``` ## BStringSet object of length 2: ## width seq ## [1] 59 HISEQ:236:HA03EADXX:1:1101:1147:2237 2:Y:0:TAAGGCGACTCTCTAT ## [2] 59 HISEQ:236:HA03EADXX:1:1101:1201:2248 2:N:0:TAAGGCGACTCTCTAT ``` --- ## Aligning Sequence Reads The distance between paired reads is important in ATACseq to allow us to distinguish reads mapping from short or long fragments indicating nucleosome free and nucleosome portions of signal respectively. Following the alignment step the insert size gives us the total distance between the start of read1 and read2. <div align="center"> <img src="imgs/insertSize.jpg" alt="offset" height="300" width="600"> </div> --- ## Aligning Sequence Reads We can use a standard alignment for DNA (as for ChIPseq), but we increase the maximum allowed fragment length to capture long fragments representing poly-nucleosome signal. The maximum allowed fragment length set here is based on parameters used within Greenleaf study. To control maximum allowed fragment lengths I set the **maxFragLength** parameter to 2000. I also set **unique** parameter to TRUE to only include uniquely mapping reads. ```r align("BSgenome.Hsapiens.UCSC.hg19.mainChrs", readfile1=read1,readfile2=read2, output_file = "ATAC_50K_2.bam", nthreads=2,type=1, unique=TRUE,maxFragLength = 2000) ``` --- ## Aligning with Rbowtie2 To use Rbowtie2 we must also build our index prior to aligment. Here we use the [**bowtie2_build()** function specifying the parameters of our FASTA file to build index from and the desired index name](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignmentInBioconductor.html#32). ```r library(Rbowtie2) bowtie2_build(references="BSgenome.Hsapiens.UCSC.hg19.mainChrs.fa", bt2Index="BSgenome.Hsapiens.UCSC.hg19.mainChrs_bowtie2") ``` --- ## Decompressing FASTQ files Once we have our index we must decompress our FASTQ files using **gunzip()** function. ```r gunzip("ATAC_Data/ATAC_FQs/SRR891269_1.fastq.gz") gunzip("ATAC_Data/ATAC_FQs/SRR891269_2.fastq.gz") ``` --- ## Creating Rbowtie2 index Now we can align our FASTQ to the genome with the **bowtie2()** function specifying our read1 and read2 to **seq1** and **seq2** parameters. Finally we can convert the output **SAM** file to a more useable **BAM** file with **asBam()** function. NOTE: SAM and uncompressed FASTQ files can use a lot of disk space. Once your done, it is good to recompress the FASTQs and remove the SAM files with the **unlink()** function. ```r library(Rsamtools) bowtie2(bt2Index = "BSgenome.Hsapiens.UCSC.hg19.mainChrs_bowtie2", samOutput = "ATAC_50K_2_bowtie2.sam", seq1 = "ATAC_Data/ATAC_FQs/SRR891269_1.fastq", seq1 = "ATAC_Data/ATAC_FQs/SRR891269_2.fastq" ) asBam("ATAC_50K_2_bowtie2.sam") ``` --- ## Sorting and Indexing Following alignment [we would want to sort and index our BAM file for use with external tools.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignedDataInBioconductor.html#10) First we sort our aligned data by sequence order (not Read Name here). We then index our file allowing for rapid access of particular genomic locations by other programs (e.g IGV, Samtools) and by R/Bioconductor packages we will use. ```r library(Rsamtools) sortedBAM <- file.path(dirname(outBAM), paste0("Sorted_",basename(outBAM)) ) sortBam(outBAM,gsub("\\.bam","",basename(sortedBAM))) indexBam(sortedBAM) ``` --- ## Distribution of mapped reads In ATACseq we will want to check the distribution of mapped reads across chromosomes. [We can check the number of mapped reads on every chromosome using the **idxstatsBam()** function.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignedDataInBioconductor.html#15) ATACseq is known have high signal on the mitochondrial chromosomes and so we can check for that here. ```r library(Rsamtools) mappedReads <- idxstatsBam(sortedBAM) ``` --- ## Distribution of mapped reads We can now use the mapped reads data frame to make a barplot of reads across chromosomes. In this example, we see a case where the mapping rate to mitochondrial genome is high. ```r library(ggplot2) ggplot(mappedReads, aes(seqnames, mapped, fill = seqnames)) + geom_bar(stat = "identity") + coord_flip() ``` ![](RU_ATAC_files/figure-html/quickMappingStatsPerChromosomes-1.png)<!-- --> --- class: inverse, center, middle # Post-alignment processing <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Proper pairs Now we have processed the Greenleaf ATACseq paired-end data we can start to work with alignments. First we will identify the expected fragment length distribution for ATACseq data. We read our newly aligned data using the GenomicAlignments package. Here we only wants read which are properly paired so we will use the **ScanBamParam()** and **scanBamFlag()** functions to control what will be read into R. We set the **scanBamFlag()** function parameters **isProperPair** to TRUE so as to only read in reads paired in alignment within our preset max fragment length (2000bpp) ```r library(GenomicAlignments) flags = scanBamFlag(isProperPair = TRUE) ``` --- ## Proper pairs We can now use these flags with the **ScanBamParam()** function to read in only properly paired reads. We additionally specify information to be read into R using the **what** parameter. Importantly we specify the insert size information - **isize**. To reduce memory footprint we read only information from chromosome 20 by specifying a GRanges object to **which** parameter. ```r myParam = ScanBamParam(flag = flags, what = c("qname", "mapq", "isize"), which = GRanges("chr20", IRanges(1, 63025520))) myParam ``` ``` ## class: ScanBamParam ## bamFlag (NA unless specified): isProperPair=TRUE ## bamSimpleCigar: FALSE ## bamReverseComplement: FALSE ## bamTag: ## bamTagFilter: ## bamWhich: 1 ranges ## bamWhat: qname, mapq, isize ## bamMapqFilter: NA ``` --- ## Proper pairs Now we have set up the **ScanBamParam** object, we can use the **readGAlignmentPairs()** function to read in our paired-end ATACseq data in a similar way to how we read in single-end ChIP-seq data using the **readGAlignments()** function. The resulting object is a **GAlignmentPairs** object. ```r atacReads <- readGAlignmentPairs(sortedBAM, param = myParam) class(atacReads) ``` ``` ## [1] "GAlignmentPairs" ## attr(,"package") ## [1] "GenomicAlignments" ``` --- ## GAlignmentPairs The **GAlignmentPairs** object contains information on our paired reads. It stores information on each read in a pair in parallel **GAlignments** objects. ```r atacReads[1:2, ] ``` ``` ## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns: ## seqnames strand : ranges -- ranges ## <Rle> <Rle> : <IRanges> -- <IRanges> ## [1] chr20 + : 60000-60029 -- 60361-60410 ## [2] chr20 + : 60000-60029 -- 60361-60410 ## ------- ## seqinfo: 24 sequences from an unspecified genome ``` --- ## GAlignmentPairs We access the **GAlignments** objects using the **first()** and **second()** accessor functions to gain information on the first or second read respectively. ```r read1 <- first(atacReads) read2 <- second(atacReads) read2[1, ] ``` ``` ## GAlignments object with 1 alignment and 3 metadata columns: ## seqnames strand cigar qwidth start end width ## <Rle> <Rle> <character> <integer> <integer> <integer> <integer> ## [1] chr20 - 50M 50 60361 60410 50 ## njunc | qname mapq isize ## <integer> | <character> <integer> <integer> ## [1] 0 | SRR891269.17032370 40 -411 ## ------- ## seqinfo: 24 sequences from an unspecified genome ``` --- ## Retrieve MapQ scores One of the first we can do is obtain the distribution of MapQ scores for our read1 and read2. We can access this for each read using **mcols()** function to access the mapq slot of the GAalignments object for each read. ```r read1MapQ <- mcols(read1)$mapq read2MapQ <- mcols(read2)$mapq read1MapQ[1:2] ``` ``` ## [1] 40 40 ``` --- ## MapQ score frequencies We can then use the **table()** function to summarize the frequency of scores for each read in pair. ```r read1MapQFreqs <- table(read1MapQ) read2MapQFreqs <- table(read2MapQ) read1MapQFreqs ``` ``` ## read1MapQ ## 6 8 10 13 20 40 ## 252 1003 3181 11027 26359 264748 ``` ```r read2MapQFreqs ``` ``` ## read2MapQ ## 6 8 10 13 20 40 ## 318 1112 4158 12610 27950 260422 ``` --- ## Plot MapQ scores Finally we can plot the distributions of MapQ for each read in pair using ggplot2. ```r library(ggplot2) toPlot <- data.frame(MapQ = c(names(read1MapQFreqs), names(read2MapQFreqs)), Frequency = c(read1MapQFreqs, read2MapQFreqs), Read = c(rep("Read1", length(read1MapQFreqs)), rep("Read2", length(read2MapQFreqs)))) toPlot$MapQ <- factor(toPlot$MapQ, levels = unique(sort(as.numeric(toPlot$MapQ)))) ggplot(toPlot, aes(x = MapQ, y = Frequency, fill = MapQ)) + geom_bar(stat = "identity") + facet_grid(~Read) ``` ![](RU_ATAC_files/figure-html/processData_readingInData5-1.png)<!-- --> --- ## Retrieving insert sizes Now we have reads in the paired aligned data into R, we can retrieve the insert sizes from the **elementMetadata()** attached to **GAlignments** objects of each read pair. Since properly paired reads will have the same insert size length we extract insert sizes from read1. ```r atacReads_read1 <- first(atacReads) insertSizes <- abs(elementMetadata(atacReads_read1)$isize) head(insertSizes) ``` ``` ## [1] 411 411 228 60 228 176 ``` --- ## Plotting insert sizes ATACseq should represent a mix of fragment lengths corresponding to nucleosome free, mononucleosome and polynucleosome fractions. We can use the **table()** function to retrieve a vector of the occurrence of each fragment length. ```r fragLenSizes <- table(insertSizes) fragLenSizes[1:5] ``` ``` ## insertSizes ## 50 51 52 53 54 ## 28311 2881 3050 2679 2433 ``` --- ## Plotting insert sizes Now we can use the newly acquired insert lengths for chromosome 20 to plot the distribution of all fragment lengths. ```r library(ggplot2) toPlot <- data.frame(InsertSize = as.numeric(names(fragLenSizes)), Count = as.numeric(fragLenSizes)) fragLenPlot <- ggplot(toPlot, aes(x = InsertSize, y = Count)) + geom_line() ``` --- ## Plotting insert sizes ```r fragLenPlot + theme_bw() ``` ![](RU_ATAC_files/figure-html/processData_plottfingFragmentLengths2-1.png)<!-- --> --- ## Plotting insert sizes We can apply a log2 transformation to the counts to clarify the nucloesome patterning. ```r fragLenPlot + scale_y_continuous(trans = "log2") + theme_bw() ``` ![](RU_ATAC_files/figure-html/processData_plottingFragmentLengths3-1.png)<!-- --> --- ## Plotting insert sizes This looks very similar to the image from the Greenleaf [paper](https://pubmed.ncbi.nlm.nih.gov/24097267/#&gid=article-figures&pid=figure-2-uid-1). We can now annotate our nucleosome free (< 100bp), mononucleosome (180bp-247bp) and dinucleosome (315-437) as in the Greenleaf study. ```r fragLenPlot + scale_y_continuous(trans = "log2") + geom_vline(xintercept = c(180, 247), colour = "red") + geom_vline(xintercept = c(315, 437), colour = "darkblue") + geom_vline(xintercept = c(100), colour = "darkgreen") + theme_bw() ``` ![](RU_ATAC_files/figure-html/processData_plottingFragmentLengths24-1.png)<!-- --> --- ## Subsetting ATACseq reads by insert sizes We may wish to divide our aligned reads into reads representing nucleosome free and nucleosome occupied. Here we create BAM files for the reads representing nucleosome free, mono and di nucleosome by using insert sizes to filter read. ```r atacReads_NucFree <- atacReads[insertSizes < 100, ] atacReads_MonoNuc <- atacReads[insertSizes > 180 & insertSizes < 240, ] atacReads_diNuc <- atacReads[insertSizes > 315 & insertSizes < 437, ] ``` --- ## Creating BAM files split by insert sizes The resulting reads can be written back to BAM files for use in other parts of our analysis or for visualization in programs such as IGV by functions in the rtracklayer package. ```r nucFreeRegionBam <- gsub("\\.bam", "_nucFreeRegions\\.bam", sortedBAM) monoNucBam <- gsub("\\.bam", "_monoNuc\\.bam", sortedBAM) diNucBam <- gsub("\\.bam", "_diNuc\\.bam", sortedBAM) library(rtracklayer) export(atacReads_NucFree, nucFreeRegionBam, format = "bam") export(atacReads_MonoNuc, monoNucBam, format = "bam") export(atacReads_diNuc, diNucBam, format = "bam") ``` --- ## Creating fragment GRanges We can recreate the full length fragments from our single end reads to evaluate duplication rate and create a fragment bigwig. Here we use the **granges()** function to recreate full fragments from the paired single-end reads. ```r atacReads[1, ] ``` ``` ## GAlignmentPairs object with 1 pair, strandMode=1, and 0 metadata columns: ## seqnames strand : ranges -- ranges ## <Rle> <Rle> : <IRanges> -- <IRanges> ## [1] chr20 + : 60000-60029 -- 60361-60410 ## ------- ## seqinfo: 24 sequences from an unspecified genome ``` ```r atacFragments <- granges(atacReads) atacFragments[1, ] ``` ``` ## GRanges object with 1 range and 0 metadata columns: ## seqnames ranges strand ## <Rle> <IRanges> <Rle> ## [1] chr20 60000-60410 + ## ------- ## seqinfo: 24 sequences from an unspecified genome ``` --- ## Creating fragment GRanges We can use the **duplicated()** function to identify our non-redundant (non-duplicate) fraction of our full length fragments. ```r duplicatedFragments <- sum(duplicated(atacFragments)) totalFragments <- length(atacFragments) duplicateRate <- duplicatedFragments/totalFragments nonRedundantFraction <- 1 - duplicateRate nonRedundantFraction ``` ``` ## [1] 0.7491177 ``` --- ## Creating an open region bigWig We can make it significantly quicker to review the pile-up of ATACseq signal in a genome browser by creating a bigWig file. Additional normalization to total mapped reads could be applied at this point. ```r openRegionRPMBigWig <- gsub("\\.bam", "_openRegionRPM\\.bw", sortedBAM) myCoverage <- coverage(atacFragments, weight = (10^6/length(atacFragments))) export.bw(myCoverage, openRegionRPMBigWig) ``` <div align="center"> <img src="imgs/final.png" alt="offset" height="300" width="600"> </div> --- class: inverse, center, middle # ATACseqQC <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- --- ## ATACseqQC The **ATACseqQC** library allows us to run many of the ATACseq QC steps we have seen in a single step. It may consume a little more memory but will allow for the inclusion of two more useful metrics called the PCR Bottleneck Coefficients (PBC1 and PBC2). First we must install the library. ```r BiocManager::install("ATACseqQC") ``` --- ## ATACseqQC As with ChIPQC, the ATACseqQC function contains a workflow function which will acquire much of the required QC with a single argument of the BAM file path. Since this can be fairly memory heavy, I am just illustrating it here on a BAM file containing just the chromosome 17 reads of the ATACseq data. ```r library(ATACseqQC) ATACQC <- bamQC("~/Downloads/Sorted_ATAC_50K_2_ch17.bam") ``` --- ## ATACseqQC The resulting ATACQC object has many slots of QC information including duplicateRate, non-redundant fraction, distribution of signal across chromosomes, mitochondrial fraction etc. These include the **PCRbottleneckCoefficient_1** and **PCRbottleneckCoefficient_2** values. ```r names(ATACQC) ``` ``` ## [1] "totalQNAMEs" "duplicateRate" ## [3] "mitochondriaRate" "properPairRate" ## [5] "unmappedRate" "hasUnmappedMateRate" ## [7] "notPassingQualityControlsRate" "nonRedundantFraction" ## [9] "PCRbottleneckCoefficient_1" "PCRbottleneckCoefficient_2" ## [11] "MAPQ" "idxstats" ``` --- ## PCR bottleneck coefficients PCR bottleneck coefficients identify PCR bias/overamplification which may have occurred in preparation of ATAC samples. The **PCRbottleneckCoefficient_1** is calculated as the number of positions in genome with *exactly* 1 read mapped uniquely compared to the number of positions with *at least* 1 read. For example if we have 20 reads. 16 map uniquely to locations. 4 do not map uniquely, instead there are 2 locations, both of which have 2 reads. This would lead us to calculation 16/18. We therefore have a PBC1 of 0.889 Values less than 0.7 indicate severe bottlenecking, between 0.7 and 0.9 indicate moderate bottlenecking. Greater than 0.9 show no bottlenecking. ```r ATACQC$PCRbottleneckCoefficient_1 ``` ``` ## [1] 0.7406656 ``` --- ## PCR bottleneck coefficients The **PCRbottleneckCoefficient_2** is our secondary measure of bottlenecking. It is calculated as the number of positions in genome with **exactly** 1 read mapped uniquely compared to the number of positions with **exactly** 2 reads mapping uniquely. We can reuse our example. If we have 20 reads, 16 of which map uniquely. 4 do not map uniquely, instead there are 2 locations, both of which have 2 reads. This would lead us to calculation 16/2. We therefore have a PBC2 of 8. Values less than 1 indicate severe bottlenecking, between 1 and 3 indicate moderate bottlenecking. Greater than 3 show no bottlenecking. ```r ATACQC$PCRbottleneckCoefficient_2 ``` ``` ## [1] 3.73014 ``` --- ## Time for an exercise! Exercise on ATACseq data can be found [here](../../exercises/exercises/ATACseq_part1_exercise.html) --- ## Answers to exercise Answers can be found [here](../../exercises/answers/ATACseq_part1_answers.html) R code for solutions can be found [here](../../exercises/answers/ATACseq_part1_answers.R)