class: middle, inverse, title-slide .title[ # Epigenomics, Session 2 ] .subtitle[ ##
Bioinformatics Resource Center - Rockefeller University ] .author[ ###
http://rockefelleruniversity.github.io/RU_course_template/
] .author[ ###
brc@rockefeller.edu
] --- ## Set the Working directory Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived. You may navigate to the unarchived RU_Course_help folder in the Rstudio menu. **Session -> Set Working Directory -> Choose Directory** or in the console. ``` r setwd("~/Downloads/ATAC.Cut-Run.ChIP-master/r_course") ``` --- ## What we will cover Earlier we learned how to asses the quality of a FASTQ file and filter the reads based on quality scores. We will take these filtered FASTQ files and continue with the analysis: * Align the FASTQ reads to the genome to generate BAM alignment files. * Identify areas of significant read build up by calling peaks. * These steps are very similar for CUT&RUN and ATACseq, but there are some differences we will highlight. --- class: inverse, center, middle # Aligning CUT&RUN and ATACseq reads <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Aligning CUT&RUN and ATACseq reads Following assessment of read quality and any read filtering we applied, we will want to align our reads to the genome so as to identify any genomic locations showing enrichment for aligned reads above background. Since CUT&RUN and ATACseq reads will align continuously against our reference genome we can use [genomic aligners](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignmentInBioconductor.html#7), as opposed to aligners that account for splice junctions (e.g. for RNAseq). The resulting BAM file will contain aligned sequence reads for use in further analysis. <div align="center"> <img src="imgs/sam2.png" alt="igv" height="200" width="700"> </div> --- ## Creating a reference genome First we need to retrieve the sequence information for the genome of interest in [FASTA format](https://rockefelleruniversity.github.io/Genomic_Data/presentations/slides/GenomicsData.html#9). We can use the [BSgenome libraries to retrieve the full sequence information.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/SequencesInBioconductor.html#4) For the mouse mm10 genome we load the package **BSgenome.Mmusculus.UCSC.mm10**. ``` r library(BSgenome.Mmusculus.UCSC.mm10) BSgenome.Mmusculus.UCSC.mm10 ``` ``` ## | BSgenome object for Mouse ## | - organism: Mus musculus ## | - provider: UCSC ## | - genome: mm10 ## | - release date: Sep 2017 ## | - 239 sequence(s): ## | chr1 chr2 chr3 ## | chr4 chr5 chr6 ## | chr7 chr8 chr9 ## | chr10 chr11 chr12 ## | chr13 chr14 chr15 ## | ... ... ... ## | chrX_KZ289094_fix chrX_KZ289095_fix chrY_JH792832_fix ## | chrY_JH792833_fix chrY_JH792834_fix chr1_KK082441_alt ## | chr11_KZ289073_alt chr11_KZ289074_alt chr11_KZ289075_alt ## | chr11_KZ289077_alt chr11_KZ289078_alt chr11_KZ289079_alt ## | chr11_KZ289080_alt chr11_KZ289081_alt ## | ## | Tips: call 'seqnames()' on the object to get all the sequence names, call ## | 'seqinfo()' to get the full sequence info, use the '$' or '[[' operator to ## | access a given sequence, see '?BSgenome' for more information. ``` --- ## Creating a reference genome We will only use the major chromosomes for our analysis so we may exclude random and unplaced contigs. Here we cycle through the major chromosomes and create a [**DNAStringSet** object from the retrieved sequences](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/SequencesInBioconductor.html#17). ``` r mainChromosomes <- paste0("chr", c(1:19, "X", "Y", "M")) mainChrSeq <- lapply(mainChromosomes, function(x) BSgenome.Mmusculus.UCSC.mm10[[x]]) names(mainChrSeq) <- mainChromosomes mainChrSeqSet <- DNAStringSet(mainChrSeq) mainChrSeqSet ``` ``` ## DNAStringSet object of length 22: ## width seq names ## [1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr1 ## [2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr2 ## [3] 160039680 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr3 ## [4] 156508116 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr4 ## [5] 151834684 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr5 ## ... ... ... ## [18] 90702639 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr18 ## [19] 61431566 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr19 ## [20] 171031299 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrX ## [21] 91744698 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrY ## [22] 16299 GTTAATGTAGCTTAATAACAA...TACGCAATAAACATTAACAA chrM ``` --- # Creating a reference genome Now we have a **DNAStringSet** object we can use the [**writeXStringSet** to create our FASTA file of sequences to align to.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/SequencesInBioconductor.html#22) ``` r writeXStringSet(mainChrSeqSet, "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa") ``` --- ## Creating an Rsubread index We will be aligning using the **seed-and-vote** algorithm from the folks behind subread. We can therefore use the **Rsubread** package. Before we attempt to align our FASTQ files, we will need to first build an index from our reference genome using the **buildindex()** function. The [**buildindex()** function simply takes the parameters of our desired index name and the FASTA file to build index from.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignmentInBioconductor.html#14) *NOTE: Building an index is memory intensive and by default is set to 8GB. This may be too large for your laptop or Desktop.* ``` r library(Rsubread) buildindex("mm10_mainchrs", "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa", memory = 8000, indexSplit = TRUE) ``` --- ## Aligning CUT&RUN Reads In the following slides we will focus on aligning the CUT&RUN data, but alignment of ATACseq data will be the same. In the post processing and peak calling section there are some considerations for each method that we will cover. CUT&RUN and ATACseq data are typically paired-end sequencing, and most often comes as two files typically with *_1* and *_2* or *_R1* and *_R2* in the file name to denote which number in pair a file is. The two matched paired-end read files will (most often) contain the same number of reads and the reads will be ordered the same in both files. --- ## Aligning Sequence Reads Below we read in the first 1000 reads in replicate 1 of the Week 6 CUT&RUN FASTQ file. The read names will match across files for paired reads with the exception of a 1 or 2 in name to signify if the read was first or second in a pair. ``` r require(ShortRead) # first 1000 reads in each file read1 <- readFastq("data/SOX9CNR_W6_rep1_1K_R1.fastq.gz") read2 <- readFastq("data/SOX9CNR_W6_rep1_1K_R2.fastq.gz") id(read1)[1:2] ``` ``` ## BStringSet object of length 2: ## width seq ## [1] 17 SRR20110409.9 9/1 ## [2] 19 SRR20110409.10 10/1 ``` ``` r id(read2)[1:2] ``` ``` ## BStringSet object of length 2: ## width seq ## [1] 17 SRR20110409.9 9/2 ## [2] 19 SRR20110409.10 10/2 ``` --- ## Aligning Sequence Reads Now that we have an index and our FASTQ files, we can perform the alignment. Our two FASTQ files that have been filtered with rfastp for QC metrics, and are [available on dropbox](https://www.dropbox.com/scl/fo/ktmtrypv57tpckk7ehmow/AN-f1t17gbBtFpxjdFK8WRY?rlkey=3dnog47inu8y0giqct1g0xy4q&st=zq28gbmj&dl=0) if you have built the index and want to try the alignment on your own. ``` r read1_toAlign <- "~/Downloads/SOX9CNR_W6_rep1_QC_R1.fastq.gz" read2_toAlign <- "~/Downloads/SOX9CNR_W6_rep1_QC_R2.fastq.gz" ``` --- ## Rsubread alignment of CUT&RUN We can align our raw sequence data in FASTQ format to the FASTA file of our mm10 genome sequence using the **Rsubread** package. Specifically we will be using the **align** function as this utilizes the subread genomic alignment algorithm. The [**align()** function accepts arguments](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignmentInBioconductor.html#15) for the index to align to, the FASTQ(s) to align, the name of output BAM, the mode of alignment (rna or dna), and fragment length filters. We will go through alignment of the CUT&RUN. ``` r myMapped <- align("mm10_mainchrs", readfile1 = read1_toAlign, readfile2 = read2_toAlign, type = "dna", output_file = "SOX9CNR_W6_rep1.bam", nthreads = 4, minFragLength = 0, maxFragLength = 2000) ``` --- ## Sort and index CUT&RUN reads As before, we sort and index our files using the [**Rsamtools** packages **sortBam()** and **indexBam()** functions respectively.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignedDataInBioconductor.html#10) The resulting sorted and indexed BAM file is now ready for use in external programs such as IGV as well as for further downstream analysis in R. ``` r library(Rsamtools) sortBam("SOX9CNR_W6_rep1.bam", "SOX9CNR_W6_rep1_sorted") indexBam("SOX9CNR_W6_rep1_sorted.bam") ``` --- ## Create a bigWig We can also create a bigWig from our sorted, indexed BAM file to allow us to quickly review our data in IGV. First we use the [**coverage()** function to create an **RLElist object** containing our coverage scores.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/Summarising_Scores_In_Bioconductor.html#13) ``` r forBigWig <- coverage("SOX9CNR_W6_rep1_sorted.bam") forBigWig ``` ``` RleList of length 22 $chr1 integer-Rle of length 195471971 with 1828459 runs Lengths: 3004130 40 121 40 1365 34 6 34 113 ... 13 9 8 6 4 30 6 100490 Values : 0 1 0 1 0 1 2 1 0 ... 6 3 2 3 4 2 1 0 $chr2 integer-Rle of length 182113224 with 1855901 runs Lengths: 3050000 31 4 71 9 30 1 3 3 ... 119 39 1 61 6 34 6 100214 Values : 0 3 2 0 1 2 1 2 1 ... 0 2 1 0 1 2 1 0 ``` --- ## Create a bigWig We can now export our [**RLElist object** as a bigWig using the **rtracklayer** package's **export.bw()** function.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicScores_In_Bioconductor.html#40) ``` r library(rtracklayer) export.bw(forBigWig, con = "SOX9CNR_W6_rep1.bw") ``` --- ## Create a bigWig We may wish to normalize our coverage to allow us to compare enrichment across samples. First we we can retrieve and plot the number of mapped reads using [the **idxstatsBam()** function.](https://rockefelleruniversity.github.io/Bioconductor_Introduction//presentations/slides/AlignedDataInBioconductor.html#16) ``` r mappedReads <- idxstatsBam("SOX9CNR_W6_rep1_sorted.bam") TotalMapped <- sum(mappedReads[, "mapped"]) TotalMapped ``` ``` [1] 24848310 ``` --- ## Create a bigWig To make the normalized bigwig we can use the [**weight** parameter in the **coverage()**](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/Summarising_Scores_In_Bioconductor.html#20) to scale our reads to the number of mapped reads multiplied by a million (reads per million). This bigwig is available [here on dropbox](https://www.dropbox.com/scl/fo/zsxaw6uv19ar8k0svcq4a/ABBTWY3xEREQrKGU3hnaMuQ?rlkey=gk7zdl27xagl5nhz6k8amz5yy&st=2hmzbrvp&dl=0) for download if you want to check it out. ``` r forBigWig <- coverage("SOX9CNR_W6_rep1_sorted.bam", weight = (10^6)/TotalMapped) forBigWig export.bw(forBigWig, con = "SOX9CNR_W6_rep1_weighted.bw") ``` --- # BAM and bigWig  --- class: inverse, center, middle # Post alignment processing - CUT&RUN <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Peak calling software - CUT&RUN To identify regions of SOX9 transcription factor binding we can use a **peak caller**. Although many peak callers are available in R and beyond, one the most popular and widely used peak caller remains [**MACS**](https://macs3-project.github.io/MACS/). Another peak caller, [**SEACR**](https://github.com/FredHutch/SEACR), was developed specificlly for CUT&RUN. We will use both. <img src="imgs/krt5_signal_peakintro.png" alt="igv" height="300" width="700"> --- ## Consider an appropriate input While we don't have one in this test data set, inputs are sometimes used in peak calling. They are theoretically less critical in CUT&RUN compared to ChIPseq due to lower background signal, but this is not always the case. * Input samples in CUT&RUN are typically made from IgG pulldowns * Allows for control of artefact regions which occur across samples. For example: When using tumour samples for CUT&RUN, it might be important to have matched input samples. Differing conditions of same tissue may share common input. --- ## Install software with conda and Herper We will install a version of MACS that is not available in R. It is possible to install it on Mac and Linux using the [Anaconda](https://anaconda.org/bioconda/MACS3) package repository (unfortunately there is not a Windows implementation). Anaconda is a huge collection of version controlled packages that can be installed through the conda package management system. With conda it is easy to create and manage environments that have a variety of different packages in them. We can interact with the Anaconda package system using the R package [Herper](https://bioconductor.org/packages/release/bioc/html/Herper.html). This was created by us here at the BRC and is part of Bioconductor. ``` r BiocManager::install("Herper") library(Herper) ``` --- ## Install software with conda and Herper First, we will use Herper to install MACS with the `install_CondaTools` function. This will automatically install miniconda, a lightweight version of Anaconda. We just need to tell `install_CondaTools` the following: * what tool/s you want * the name of the environment you want to build. * path where the miniconda currently is or should be installed * whether you are updating a current environment (optional) We will install MACS (version 3), in addition to samtools and bedtools, which will be used to prepare our BAM files for use by MACS and SEACR. ``` r dir.create("miniconda") macs_paths <- install_CondaTools(tools = c("macs3", "samtools", "bedtools"), env = "CnR_analysis", pathToMiniConda = "miniconda") ``` --- ## Install software with conda and Herper Behind the scenes, Herper will install the most minimal version of conda (called miniconda), and then will create a new environment into which MACS, bedtools, and samtools will be installed. When you run the function it returns and prints out where the software is installed. ``` r macs_paths ``` ``` $pathToConda [1] "miniconda/bin/conda" $environment [1] "CnR_analysis" $pathToEnvBin [1] "miniconda/envs/CnR_analysis/bin" ``` --- ## Install software with conda and Herper After the software are installed, we can then use them using the `with_CondaEnv` function from Herper paired with the `system` function that allows us to run terminal commands from R. ``` r Herper::with_CondaEnv("CnR_analysis", pathToMiniConda = "miniconda", system("samtools sort -h")) ``` ``` samtools: invalid option -- h Usage: samtools sort [options...] [in.bam] Options: -l INT Set compression level, from 0 (uncompressed) to 9 (best) -u Output uncompressed data (equivalent to -l 0) -m INT Set maximum memory per thread; suffix K/M/G recognized [768M] -M Use minimiser for clustering unaligned/unplaced reads -R Do not use reverse strand (only compatible with -M) -K INT Kmer size to use for minimiser [20] -I FILE Order minimisers by their position in FILE FASTA -w INT Window size for minimiser indexing via -I ref.fa [100] -H Squash homopolymers when computing minimiser -n Sort by read name (natural): cannot be used with samtools index -N Sort by read name (ASCII): cannot be used with samtools index -t TAG Sort by value of TAG. Uses position as secondary index (or read name if -n is set) -o FILE Write final output to FILE rather than standard output -T PREFIX Write temporary files to PREFIX.nnnn.bam --no-PG Do not add a PG line --template-coordinate Sort by template-coordinate --input-fmt-option OPT[=VAL] Specify a single input file format option in the form of OPTION or OPTION=VALUE -O, --output-fmt FORMAT[,OPT[=VAL]]... Specify output format (SAM, BAM, CRAM) --output-fmt-option OPT[=VAL] Specify a single output file format option in the form of OPTION or OPTION=VALUE --reference FILE Reference sequence FASTA FILE [null] -@, --threads INT Number of additional threads to use [0] --write-index Automatically index the output files [off] --verbosity INT Set level of verbosity ``` --- ## Proper pairs - CUT&RUN Now that we have our aligned paired-end data contained in BAM files, and the software installed to eventually call peaks, we can assess some initial QC metrics like alignment quality and fragment length. We will read our newly aligned data using the GenomicAlignments package. Here we only want reads which are properly paired so we will use the **ScanBamParam()** and **scanBamFlag()** functions to control what will be read into R. We set the **scanBamFlag()** function parameters **isProperPair** to TRUE so as to only read in reads paired in alignment. ``` r library(GenomicAlignments) flags = scanBamFlag(isProperPair = TRUE) ``` --- ## Proper pairs - CUT&RUN We can now use these flags with the **ScanBamParam()** function to read in only properly paired reads. We additionally specify information to be read into R using the **what** parameter. Importantly we specify the insert size information - **isize**. To reduce memory footprint we read only information from chromosome 18 by specifying a GRanges object to **which** parameter. ``` r myParam = ScanBamParam(flag = flags, what = c("qname", "mapq", "isize", "flag")) myParam ``` ``` ## class: ScanBamParam ## bamFlag (NA unless specified): isProperPair=TRUE ## bamSimpleCigar: FALSE ## bamReverseComplement: FALSE ## bamTag: ## bamTagFilter: ## bamWhich: 0 ranges ## bamWhat: qname, mapq, isize, flag ## bamMapqFilter: NA ``` --- ## GAlignmentPairs - CUT&RUN Now we have set up the **ScanBamParam** object, we can use the **readGAlignmentPairs()** function to read in our paired-end CUT&RUN data using the **readGAlignments()** function. In the *data* folder is a BAM file contianing the reads from only chromosome 18 that we will read into R. The resulting object is a **GAlignmentPairs** object that contains information on our paired reads ``` r sortedBAM <- "data/SOX9CNR_W6_rep1_chr18_sorted.bam" cnrReads <- readGAlignmentPairs(sortedBAM, param = myParam) cnrReads[1:2, ] ``` ``` ## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns: ## seqnames strand : ranges -- ranges ## <Rle> <Rle> : <IRanges> -- <IRanges> ## [1] chr18 + : 3001178-3001217 -- 3001186-3001225 ## [2] chr18 + : 3002517-3002556 -- 3002544-3002583 ## ------- ## seqinfo: 22 sequences from an unspecified genome ``` --- ## GAlignmentPairs - CUT&RUN We access the **GAlignments** objects using the **first()** and **second()** accessor functions to gain information on the first or second read respectively. ``` r read1 <- GenomicAlignments::first(cnrReads) read2 <- GenomicAlignments::second(cnrReads) read2[1:2, ] ``` ``` ## GAlignments object with 2 alignments and 4 metadata columns: ## seqnames strand cigar qwidth start end width ## <Rle> <Rle> <character> <integer> <integer> <integer> <integer> ## [1] chr18 - 40M 40 3001186 3001225 40 ## [2] chr18 - 40M 40 3002544 3002583 40 ## njunc | qname mapq isize flag ## <integer> | <character> <integer> <integer> <integer> ## [1] 0 | SRR20110409.2378427 0 -48 147 ## [2] 0 | SRR20110409.10295909 10 -67 147 ## ------- ## seqinfo: 22 sequences from an unspecified genome ``` --- ## Retrieve MapQ scores - CUT&RUN One of the first we can do is obtain the distribution of MapQ scores for our read1 and read2. We can access this for each read using **mcols()** function to access the mapq slot of the GAlignments object for each read. For Rsubread, a MAPQ of 0 means that a read maps equally well to multiple locations, we want to assess if this is common and potentially remove these reads. ``` r read1MapQ <- mcols(read1)$mapq read2MapQ <- mcols(read2)$mapq read1MapQ[1:5] ``` ``` ## [1] 0 40 40 10 10 ``` --- ## MapQ score frequencies - CUT&RUN We can then use the **table()** function to summarize the frequency of scores for each read in pair. ``` r read1MapQFreqs <- table(read1MapQ) read2MapQFreqs <- table(read2MapQ) read1MapQFreqs ``` ``` ## read1MapQ ## 0 6 8 10 13 20 40 ## 3123 32 307 1848 9611 31077 357317 ``` ``` r read2MapQFreqs ``` ``` ## read2MapQ ## 0 6 8 10 13 20 40 ## 3123 50 396 2232 11486 31529 354499 ``` --- ## Plot MapQ scores - CUT&RUN Finally we can plot the distributions of MapQ for each read in pair using ggplot2. ``` r library(ggplot2) toPlot <- data.frame(MapQ = c(names(read1MapQFreqs), names(read2MapQFreqs)), Frequency = c(read1MapQFreqs, read2MapQFreqs), Read = c(rep("Read1", length(read1MapQFreqs)), rep("Read2", length(read2MapQFreqs)))) toPlot$MapQ <- factor(toPlot$MapQ, levels = unique(sort(as.numeric(toPlot$MapQ)))) ggplot(toPlot, aes(x = MapQ, y = Frequency, fill = MapQ)) + geom_bar(stat = "identity") + facet_grid(~Read) ``` <!-- --> --- ## Retrieving insert sizes - CUT&RUN Now we have reads in the paired aligned data into R, we can retrieve the insert sizes from the **mcols()** attached to **GAlignments** objects of each read pair. CUT&RUN should represent a mix of fragment lengths depending on the target. A TF will generally only be in nucleosome-free regions and will have small fragments, while a histone mark might have longer fragments. Since properly paired reads will have the same insert size length we extract insert sizes from read1. ``` r insertSizes <- abs(mcols(read1)$isize) head(insertSizes) ``` ``` ## [1] 48 67 67 67 67 67 ``` --- ## Plotting insert sizes - CUT&RUN We can use the **table()** function to retrieve a vector of the occurrence of each fragment length. ``` r fragLenSizes <- table(insertSizes) fragLenSizes[1:5] ``` ``` ## insertSizes ## 30 31 32 33 34 ## 1 574 746 1067 1386 ``` --- ## Plotting insert sizes - CUT&RUN Now we can use the newly acquired insert lengths for chromosome 18 to plot the distribution of all fragment lengths. ``` r library(ggplot2) toPlot <- data.frame(InsertSize = as.numeric(names(fragLenSizes)), Count = as.numeric(fragLenSizes)) fragLenPlot <- ggplot(toPlot, aes(x = InsertSize, y = Count)) + geom_line() fragLenPlot + theme_bw() ``` <!-- --> --- ## Plotting insert sizes - CUT&RUN We can now annotate our nucleosome free (< 120bp) regions and see a clear bi-modal distribution of reads separated by this distinction. ``` r fragLenPlot + geom_vline(xintercept = c(120), colour = "darkgreen") + theme_bw() ``` <!-- --> --- ## Filter reads - CUT&RUN We will filter out fragments where either read has a MAPQ of 0 and reads shorter than 1000bp. MAPQ of zero means a read mapped equally well to at least two places in the genome. Note: For TF CUT&RUN where we expect binding in nucleosome free regions, sometimes it is common to only take reads shorter than 120bp. In the exercises you have a chance to compare both strategies. ``` r cnrReads_filter <- cnrReads[insertSizes < 1000 & (!mcols(read1)$mapq == 0 | !mcols(read2)$mapq == 0)] ``` --- ## Remove blacklist - CUT&RUN ChIPseq and CUT&RUN will often show the presence of common artifacts, such as ultra-high signal regions. Such regions can confound peak calling, fragment length estimation and QC metrics. Anshul Kundaje created the DAC blacklist as a reference to help deal with these regions. We will use this reference to remove these reads from the BAM file. <div align="center"> <img src="imgs/blacklist.png" alt="offset" height="400" width="350"> </div> --- ## Remove blacklist regions - CUT&RUN You can find a Blacklist for most genomes at the [Boyle lab Github page](https://github.com/Boyle-Lab/Blacklist/tree/master/lists). The most recent version (V2) for mm10 is available in the *data* folder of this course. We read it into R as a GRanges object using the `import` function from rtracklayer. ``` r library(rtracklayer) blacklist <- "data/mm10-blacklist.v2.bed" bl_regions <- rtracklayer::import(blacklist) bl_regions ``` ``` ## GRanges object with 3435 ranges and 1 metadata column: ## seqnames ranges strand | name ## <Rle> <IRanges> <Rle> | <character> ## [1] chr10 1-3135400 * | High Signal Region ## [2] chr10 3218901-3276600 * | Low Mappability ## [3] chr10 3576901-3627700 * | Low Mappability ## [4] chr10 4191101-4197600 * | Low Mappability ## [5] chr10 4613501-4615400 * | High Signal Region ## ... ... ... ... . ... ## [3431] chrY 6530201-6663200 * | High Signal Region ## [3432] chrY 6760201-6835800 * | High Signal Region ## [3433] chrY 6984101-8985400 * | High Signal Region ## [3434] chrY 10638501-41003800 * | High Signal Region ## [3435] chrY 41159201-91744600 * | High Signal Region ## ------- ## seqinfo: 21 sequences from an unspecified genome; no seqlengths ``` --- ## Remove blacklist regions - CUT&RUN The `granges` function finds the range spanning the whole fragment (5' read 1 to 3' read 2) when given a GAlignmentPairs object. We then identify the fragments from this GRanges object that overlap the blacklist regions. ``` r fragment_spans <- granges(cnrReads_filter) bl_remove <- overlapsAny(fragment_spans, bl_regions) table(bl_remove) ``` ``` ## bl_remove ## FALSE TRUE ## 394289 5893 ``` --- ## Write out filtered BAM file - CUT&RUN This logical vector is then used to filter out these fragments that overlap blacklist. It is helpful to include the names for each read in the BAM file for the peak callers, so we unlist the GAlignmentPairs object so each read now has its own row, and then the names are set from the metadata of the object. Lastly, the BAM file is written to file. ``` r cnrReads_filter_noBL <- cnrReads_filter[!bl_remove] cnrReads_unlist <- unlist(cnrReads_filter_noBL) names(cnrReads_unlist) <- mcols(cnrReads_unlist)$qname filter_bam <- gsub("sorted.bam", "filter.bam", basename(sortedBAM)) rtracklayer::export(cnrReads_unlist, filter_bam, format = "bam") ``` --- ## Prepare BAM for peak calling - CUT&RUN The `samtools fixmate` function is helpful to clean up paired end BAM files. It will repair some of the metadata in the BAM file, which can be especially helpful when doing filtering and reading/writing of paired-end data in R. Below we use Herper to call samtools as the fixmate function is not available with Rsamtools. You'll notice a few extra tricks in the `with_CondaEnv` function call. We can use brackets {} to run multiple lines of code, and we also utilize the `tempfile` function to create a temporary path. These files will be deleted with the R session ends. ``` r forPeak_bam <- gsub("_filter.bam", "_forPeak.bam", filter_bam) Herper::with_CondaEnv("CnR_analysis", pathToMiniConda = "miniconda", { tempBam <- paste0(tempfile(), ".bam") system(paste("samtools sort", "-n", "-o", tempBam, filter_bam, sep = " ")) system(paste("samtools fixmate", "-m", tempBam, forPeak_bam, sep = " ")) }) ``` --- ## Prepare BAM for peak calling - CUT&RUN The peak caller MACS requires the file to be sorted by coordinate. This can be done in R with the `sortBam` function. Note: This could also be done using `samtools sort` on the command line using Herper, as shown in the previous slide when we sorted by name before using `samtools fixmate`. ``` r forMacs_bam <- gsub("_forPeak.bam", "_macs.bam", forPeak_bam) sortBam(forPeak_bam, gsub(".bam", "", forMacs_bam)) indexBam(forMacs_bam) ``` --- class: inverse, center, middle # Peak calling - CUT&RUN <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## CUT&RUN peak calling with MACS MACS3 is installed and ready to use in our conda environment. The `macs3 callpeak` function is used to call peaks. To run MACS to we simply need to supply: * A BAM file to find enriched regions in. (specified after **-t**) * File format (BAMPE for paired end, specified after **-f**) * A Name for peak calls (specified after **–name**). * An output folder to write peaks into (specified after **–outdir**). * Optionally, and sometimes recommended, we can identify a control file (we don't have one, but would be specified after **–c**). Example command line function call: ``` macs3 callpeak -t pairedEnd.bam -f BAMPE --outdir path/to/output/ --n pairedEndPeakName ``` --- ## CUT&RUN peak calling with MACS If we have sequenced paired-end data then we do know the fragment lengths and can provide BAM files to MACS which have been prefiltered to properly paired We have to simply tell MACS that the data is paired using the format argument. This is important as MACS will guess it is single-end BAM by default. ``` r peaks_name <- gsub("_macs.bam", "", basename(forMacs_bam)) with_CondaEnv("CnR_analysis", pathToMiniConda = "miniconda", system2(command = "macs3", args = c("callpeak", "-t", forMacs_bam, "-f", "BAMPE", "--outdir", ".", "-n", peaks_name))) ``` --- ## MACS output files Following MACS peak calling we would get 3 files. * {name}.narrowPeak -- Narrow peak format suitable for IGV and further analysis * {name}.peaks.xls -- Peak table suitable for review in excel.(not actually a xls but a tsv) * {name}.summits.bed -- Summit positions for peaks useful for finding motifs and plotting <img src="imgs/macsIGV.png" alt="igv" height="450" width="800"> --- ## CUT&RUN peak calling with SEACR [SEACR](https://epigeneticsandchromatin.biomedcentral.com/articles/10.1186/s13072-019-0287-4) is a peak calling algorithm that was written by same lab that developed CUT&RUN (Henikoff). It is meant to deal with the lower read depths and sparser background signal from CUT&RUN. SEACR requires a bedgraph file, which like a BAM file contains information about genomic CUT&RUN signal. In the following slides we will make a paired end bed file, filter it and then write out as a bedgraph. SEACR is built for paired end data, and first we make a bedpe file using the function `bamtobed` from bedtools. We can use our conda environment we made previously to do this. ``` r # use the BAM with blacklist reads removed (from MACS section) bedpe <- gsub("\\.bam", "\\.bed", forPeak_bam) with_CondaEnv("CnR_analysis", pathToMiniConda = "miniconda", system(paste("bedtools bamtobed -bedpe -i", forPeak_bam, ">", bedpe, sep = " "))) ``` --- ## CUT&RUN peak calling with SEACR Lastly we make a bedgraph file with the `genomecov` function from bedtools. A key input to this function is a file with chromosome lengths, and we can extract this from the FASTA file. NOTE: The **chrom.lengths.txt** is already available in the 'data' folder of the course. ``` r library(dplyr) library(tibble) indexFa("BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa") seqlengths(Rsamtools::FaFile("BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa")) %>% as.data.frame() %>% rownames_to_column() %>% write.table(file = "chrom.lengths.txt", row.names = FALSE, col.names = FALSE, quote = FALSE, sep = "\t") ``` --- ## CUT&RUN peak calling with SEACR Then we make the bedgraph file with `genomecov` using the file containing chromosome lengths in the '-g' argument and the bedpe file in the '-i' argument. The result is written to a file that is our bedgraph. ``` r bedgraph <- gsub("\\.bed", "\\.bedgraph", bedpe) with_CondaEnv("CnR_analysis", pathToMiniConda = "miniconda", system(paste("bedtools genomecov -bg -i", bedpe, "-g", "data/chrom.lengths.txt", ">", bedgraph, sep = " "))) ``` --- ## CUT&RUN peak calling with SEACR Now we have the file required to run SEACR, but we need the SEACR executable script to call peaks. We can download this from the Github page for SEACR. After it's downloaded you might see the following error when running the script, indicating the permissions need to be changed. ``` r download.file("https://github.com/FredHutch/SEACR/archive/refs/tags/v1.4-beta.2.zip", destfile = "~/Downloads/SEACR_v1.4.zip") unzip("~/Downloads/SEACR_v1.4.zip", exdir = "~/Downloads/SEACR_v1.4") seacr_path <- "~/Downloads/SEACR_v1.4/SEACR-1.4-beta.2/SEACR_1.4.sh" system(paste(seacr_path, "-h")) ``` ``` trying URL 'https://github.com/FredHutch/SEACR/archive/refs/tags/v1.4-beta.2.zip' downloaded 3.0 MB sh: /Users/douglasbarrows/Downloads/SEACR_v1.4/SEACR-1.4-beta.2/SEACR_1.4.sh: Permission denied ``` --- ## CUT&RUN peak calling with SEACR The permissions can be changed by running the `chmod` function. Here we make it accessible to anyone, but this can be fine-tuned depending on your system. ``` r system(paste("chmod 777", seacr_path)) system(paste(seacr_path, "-h")) ``` ``` SEACR: Sparse Enrichment Analysis for CUT&RUN Usage: bash SEACR_1.4.sh -b <experimental bedgraph>.bg -c [<control bedgraph>.bg | <FDR threshold>] -n [norm | non] -m [relaxed | stringent] -o output_prefix -e [double] -r [yes | no] Required input fields: -b|--bedgraph Experimental bedgraph file -c|--control Control bedgraph file -n|--normalize Internal normalization (norm|non) -m|--mode Stringency mode (stringent|relaxed) -o|--output Output prefix Optional input fields: -e|--extension Peak extension constant for peak merging (Default=0.1) -r|--remove Remove peaks overlapping IgG peaks (yes|no, Default=yes) -h|--help Print help screen Output file: <output prefix>.[stringent | relaxed].bed (Bed file of enriched regions) Output data structure: <chr> <start> <end> <AUC> <max signal> <max signal region> Description of output fields: Field 1: Chromosome Field 2: Start coordinate Field 3: End coordinate Field 4: Total signal contained within denoted coordinates Field 5: Maximum bedgraph signal attained at any base pair within denoted coordinates Field 6: Region representing the farthest upstream and farthest downstream bases within the denoted coordinates that are represented by the maximum bedgraph signal Examples: bash SEACR_1.4.sh -b target.bedgraph -c IgG.bedgraph -n norm -m stringent -o output Calls enriched regions in target data using normalized IgG control track with stringent threshold and default 10% peak extension for merging bash SEACR_1.4.sh -b target.bedgraph -c IgG.bedgraph -n non -m relaxed -o output -e 0.5 Calls enriched regions in target data using non-normalized IgG control track with relaxed threshold and 50% peak extension for merging bash SEACR_1.4.sh -b target.bedgraph -c 0.01 -n non -m stringent -o output Calls enriched regions in target data by selecting the top 1% of regions by total signal ``` --- ## CUT&RUN peak calling with SEACR Now we can call peaks with the bedgraph file we generated and the SEACR.sh script. Here we set a few key arguments for peaks calling: * -b: the path to the bedgraph file * -c: either a path to a control file (eg IgG) or a number between 0 and 1. If its a number, then this fraction will control the top peaks returned based on the signal in the peak + below we will only return the top 1% of peaks * -n: 'norm' if using control, 'non' if not * -m: 'relaxed' or 'stringent' mode ``` r seacr_prefix <- gsub("\\.bedgraph", "_top01", bedgraph) system(paste(seacr_path, "-b", bedgraph, "-c 0.01", "-n non", "-m stringent", "-o", seacr_prefix)) ``` --- ## CUT&RUN peak calling with SEACR Following peak calling we get one file. * {prefix}.stringent.bed -- a bed file with columns showing signal information that is suitable for IGV and further analysis We can look at this peak file in IGV and compare to the MACS peak calls: <img src="imgs/macs_and_seacr_igv.png" alt="igv" height="400" width="800"> --- class: inverse, center, middle # Post alignment processing and peak calling - ATACseq <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Proper pairs - ATACseq ATACseq processing and peak calling is very similar to CUT&RUN, with minor differences. We will quickly go through the same workflow for ATACseq. ``` r flags = scanBamFlag(isProperPair = TRUE) myParam = ScanBamParam(flag = flags, what = c("qname", "mapq", "isize", "flag")) sortedBAM_atac <- "data/W6_ATAC_rep1_chr18_sorted.bam" atacReads <- readGAlignmentPairs(sortedBAM_atac, param = myParam) head(atacReads, 2) ``` ``` ## GAlignmentPairs object with 2 pairs, strandMode=1, and 0 metadata columns: ## seqnames strand : ranges -- ranges ## <Rle> <Rle> : <IRanges> -- <IRanges> ## [1] chr18 - : 3000146-3000189 -- 3000146-3000189 ## [2] chr18 + : 3000257-3000300 -- 3000280-3000325 ## ------- ## seqinfo: 44 sequences from an unspecified genome ``` --- ## Plotting insert sizes - ATACseq ATACseq should represent a mix of fragment lengths corresponding to nucleosome free, mononucleosome and polynucleosome fractions. This is different from TF CUT&RUN where we expect almost exclusively short fragements. We can use the **table()** function to retrieve a vector of the occurrence of each fragment length. ``` r atacReads_read1 <- GenomicAlignments::first(atacReads) atacReads_read2 <- GenomicAlignments::second(atacReads) insertSizes_atac <- abs(elementMetadata(atacReads_read1)$isize) fragLenSizes_atac <- table(insertSizes_atac) toPlot_atac <- data.frame(InsertSize = as.numeric(names(fragLenSizes_atac)), Count = as.numeric(fragLenSizes_atac)) fragLenPlot_atac <- ggplot(toPlot_atac, aes(x = InsertSize, y = Count)) + geom_line() + theme_bw() ``` --- ## Plotting insert sizes - ATACseq The nucleosome pattern becomes much more clear when we apply a log2 transformation to the counts. A good quality ATACseq experiment typically shows this nucleosome patterning. ``` r fragLenPlot_atac + scale_y_continuous(trans = "log2") ``` <!-- --> --- ## Plotting insert sizes - ATACseq We can now annotate our nucleosome free (< 100bp), mononucleosome (180bp-247bp) and dinucleosome (315-437) as in the Greenleaf study. ``` r fragLenPlot_atac + scale_y_continuous(trans = "log2") + geom_vline(xintercept = c(180, 247), colour = "red") + geom_vline(xintercept = c(315, 437), colour = "darkblue") + geom_vline(xintercept = c(100), colour = "darkgreen") + theme_bw() ``` <!-- --> --- ## Filter reads by MAPQ - ATACseq We will filter out fragments where either read has a MAPQ of 0. For CUT&RUN we also filtered out longer fragments, we will not do that here to get a full picture of the chromatin structure. ``` r atacReads_filter <- atacReads[!mcols(atacReads_read1)$mapq == 0 | !mcols(atacReads_read2)$mapq == 0] ``` --- ## Remove blacklist regions - ATACseq The `granges` function finds the range spanning the whole fragment (5' read 1 to 3' read 2) when given a GAlignmentPairs object. We then identify the fragments from this GRanges object that overlap the blacklist regions. ``` r # the 'bl_regions' GRanges object was generated in CUT&RUN section fragment_spans_atac <- granges(atacReads_filter) bl_remove_atac <- overlapsAny(fragment_spans_atac, bl_regions) table(bl_remove_atac) ``` --- ## Write out filtered BAM file - ATACseq This logical vector is then used to filter out these fragments that overlap blacklist. It is helpful to include the names for each read in the BAM file for the peak callers, so we unlist the GAlignmentPairs object so each read now has its own row, and then the names are set from the metadata of the object. Lastly, the BAM file is written to file. ``` r atacReads_filter_noBL <- atacReads_filter[!bl_remove_atac] atacReads_unlist <- unlist(atacReads_filter_noBL) names(atacReads_unlist) <- mcols(atacReads_unlist)$qname filter_bam_atac <- gsub("sorted.bam", "filter.bam", basename(sortedBAM_atac)) rtracklayer::export(atacReads_unlist, filter_bam_atac, format = "bam") ``` --- ## Prepare BAM for peak calling - ATACseq Similar to CUT&RUN, the BAM file needs to be cleaned up and prepared for MACS3. Here we do this using Herper and the command line version of samtools to sort by name, fix the metadata with fixmate, and lastly sort by coordinate. ``` r forPeak_bam_atac <- gsub("_filter.bam", "_macs.bam", filter_bam_atac) Herper::with_CondaEnv("CnR_analysis", pathToMiniConda = "miniconda", { tempBam <- paste0(tempfile(), ".bam") tempBam2 <- paste0(tempfile(), ".bam") system(paste("samtools sort", "-n", "-o", tempBam, filter_bam_atac, sep = " ")) system(paste("samtools fixmate", "-m", tempBam, tempBam2, sep = " ")) system(paste("samtools sort -o", forPeak_bam_atac, tempBam2, sep = " ")) }) ``` --- ## ATACseq peak calling with MACS Then we call peaks with MACS3 using the BAMPE setting. While its not common to perform ATAC-seq using single end sequencing, there are some specific strategies to call peaks for single end data. In this case, see our [extended ATACseq course](https://rockefelleruniversity.github.io/RU_ATACseq/presentations/slides/RU_ATAC_part2.html#26). ``` r peaks_name_atac <- gsub("_macs.bam", "", basename(forPeak_bam_atac)) with_CondaEnv("CnR_analysis", pathToMiniConda = "miniconda", system2(command = "macs3", args = c("callpeak", "-t", forPeak_bam_atac, "--format", "BAMPE", "--outdir", ".", "--name", peaks_name_atac))) ``` --- ## ATACseq peak calling with MACS <img src="imgs/atac_igv.png" alt="igv" height="400" width="800"> --- ## BAM file for Nucleosome free regions It is also common to only use nucleosome free regions (e.g. < 100bp) to focus on TF binding sites. This can be helpful for some down stream analyses, such as motif analysis. In this case, we filter the GAlignmentPairs object based on isnert size. ``` r read1_noBL <- GenomicAlignments::first(atacReads_filter_noBL) insertSizes_atac_noBL <- abs(elementMetadata(read1_noBL)$isize) atacReads_NF <- atacReads_filter_noBL[insertSizes_atac_noBL < 100] atacReads_NF_unlist <- unlist(atacReads_NF) names(atacReads_NF_unlist) <- mcols(atacReads_NF_unlist)$qname NF_bam_atac <- gsub("sorted.bam", "filterNF.bam", basename(sortedBAM_atac)) rtracklayer::export(atacReads_NF_unlist, NF_bam_atac, format = "bam") ``` --- ## NF BAM for peak calling - ATACseq As before, the BAM file needs to be cleaned up and prepared for MACS3. ``` r forPeak_NFbam_atac <- gsub("_filterNF.bam", "_macsNF.bam", NF_bam_atac) Herper::with_CondaEnv("CnR_analysis", pathToMiniConda = "miniconda", { tempBam <- paste0(tempfile(), ".bam") tempBam2 <- paste0(tempfile(), ".bam") system(paste("samtools sort", "-n", "-o", tempBam, NF_bam_atac, sep = " ")) system(paste("samtools fixmate", "-m", tempBam, tempBam2, sep = " ")) system(paste("samtools sort -o", forPeak_NFbam_atac, tempBam2, sep = " ")) }) ``` --- ## Call NF peaks with MACS Then we call peaks for the nucleosome free regions with MACS3 using the BAMPE setting. ``` r peaksNF_name_atac <- gsub("_macsNF.bam", "_macsNF", basename(forPeak_NFbam_atac)) with_CondaEnv("CnR_analysis", pathToMiniConda = "miniconda", system2(command = "macs3", args = c("callpeak", "-t", forPeak_NFbam_atac, "--format", "BAMPE", "--outdir", ".", "--name", peaksNF_name_atac))) ``` <img src="imgs/atac_with_NF.png" alt="igv" height="400" width="800"> --- ## Time for an exercise! Exercise on CUT&RUN data can be found [here](../../exercises/exercises/ATACCutRunChIP_AlignPeaksQC_exercise.html) --- ## Answers to exercise Answers can be found [here](../../exercises/answers/ATACCutRunChIP_AlignPeaksQC_answers.html)