class: middle, inverse, title-slide .title[ # Epigenomics, Session 3 ] .subtitle[ ##
Bioinformatics Resource Center - Rockefeller University ] .author[ ###
http://rockefelleruniversity.github.io/ATAC.Cut-Run.ChIP/
] .author[ ###
brc@rockefeller.edu
] --- class: inverse, center, middle # Quality control - CUT&RUN <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Quality Control CUT&RUN has many sources of potential noise including * Varying efficiency of antibodies * Non-specific binding * Library complexity * artifacts and background Many of these sources of noise can be assessed using some well established methodology, many borrowed from ChIPseq.. --- ## Mapped reads First, we can retrieve and plot the number of mapped reads using [the **idxstatsBam()** function.](https://rockefelleruniversity.github.io/Bioconductor_Introduction//presentations/slides/AlignedDataInBioconductor.html#16) ``` r mappedReads <- idxstatsBam("~/Downloads/SOX9CNR_W6_rep1_sorted.bam") mappedReads <- mappedReads[mappedReads$seqnames %in% paste0("chr", c(1:19, "X", "Y", "M")), ] TotalMapped <- sum(mappedReads[, "mapped"]) ggplot(mappedReads, aes(x = seqnames, y = mapped)) + geom_bar(stat = "identity") + coord_flip() ``` <!-- --> --- ## Quality metrics for CUT&RUN The [**ChIPQC package**](https://bioconductor.org/packages/release/bioc/html/ChIPQC.html) wraps some of the metrics into a Bioconductor package and takes care to measure these metrics under the appropriate condition. To run a single sample we can use the **ChIPQCsample()** function, the relevant **unfiltered** BAM file and we are recommended to supply a **blacklist** as a BED file or GRanges and Genome name. We can use the same blacklist bed file that we used for peak calling. This file is available in the *data* folder of this course and also at the [Boyle lab Github page](https://github.com/Boyle-Lab/Blacklist/tree/master/lists). ``` r QCresult <- ChIPQCsample(reads = "/pathTo/myCnRreads.bam", genome = "mm10", peaks = "/pathTo/myCnRpeaks.bed", blacklist = "/pathTo/mm10_Blacklist.bed") ``` --- ## Quality control with ChIPQC We can then provide an initial analysis of our CUT&RUN samples quality using the **ChIPQCsample()** function from the [**ChIPQC** package.](http://bioconductor.org/packages/stats/bioc/ChIPQC/) Here we evaluate the quality of samples we aligned in the prior session with Rsubread. The returned object is a **ChIPQCsample** object. Note: We use the BAM file before we filtered out low quality reads and blacklist so we can incorporate those reads into our QC analysis. ``` r library(ChIPQC) blklist <- rtracklayer::import.bed("data/mm10-blacklist.v2.bed") qc_sox9_rep1 <- ChIPQCsample("data/SOX9CNR_W6_rep1_chr18_sorted.bam", annotation = "mm10", peaks = "data/SOX9CNR_W6_rep1_chr18_peaks.narrowPeak", blacklist = blklist, chromosomes = "chr18") ``` ``` ## 1 / 300 2 / 300 3 / 300 4 / 300 5 / 300 6 / 300 7 / 300 8 / 300 9 / 300 10 / 300 11 / 300 12 / 300 13 / 300 14 / 300 15 / 300 16 / 300 17 / 300 18 / 300 19 / 300 20 / 300 21 / 300 22 / 300 23 / 300 24 / 300 25 / 300 26 / 300 27 / 300 28 / 300 29 / 300 30 / 300 31 / 300 32 / 300 33 / 300 34 / 300 35 / 300 36 / 300 37 / 300 38 / 300 39 / 300 40 / 300 41 / 300 42 / 300 43 / 300 44 / 300 45 / 300 46 / 300 47 / 300 48 / 300 49 / 300 50 / 300 51 / 300 52 / 300 53 / 300 54 / 300 55 / 300 56 / 300 57 / 300 58 / 300 59 / 300 60 / 300 61 / 300 62 / 300 63 / 300 64 / 300 65 / 300 66 / 300 67 / 300 68 / 300 69 / 300 70 / 300 71 / 300 72 / 300 73 / 300 74 / 300 75 / 300 76 / 300 77 / 300 78 / 300 79 / 300 80 / 300 81 / 300 82 / 300 83 / 300 84 / 300 85 / 300 86 / 300 87 / 300 88 / 300 89 / 300 90 / 300 91 / 300 92 / 300 93 / 300 94 / 300 95 / 300 96 / 300 97 / 300 98 / 300 99 / 300 100 / 300 101 / 300 102 / 300 103 / 300 104 / 300 105 / 300 106 / 300 107 / 300 108 / 300 109 / 300 110 / 300 111 / 300 112 / 300 113 / 300 114 / 300 115 / 300 116 / 300 117 / 300 118 / 300 119 / 300 120 / 300 121 / 300 122 / 300 123 / 300 124 / 300 125 / 300 126 / 300 127 / 300 128 / 300 129 / 300 130 / 300 131 / 300 132 / 300 133 / 300 134 / 300 135 / 300 136 / 300 137 / 300 138 / 300 139 / 300 140 / 300 141 / 300 142 / 300 143 / 300 144 / 300 145 / 300 146 / 300 147 / 300 148 / 300 149 / 300 150 / 300 151 / 300 152 / 300 153 / 300 154 / 300 155 / 300 156 / 300 157 / 300 158 / 300 159 / 300 160 / 300 161 / 300 162 / 300 163 / 300 164 / 300 165 / 300 166 / 300 167 / 300 168 / 300 169 / 300 170 / 300 171 / 300 172 / 300 173 / 300 174 / 300 175 / 300 176 / 300 177 / 300 178 / 300 179 / 300 180 / 300 181 / 300 182 / 300 183 / 300 184 / 300 185 / 300 186 / 300 187 / 300 188 / 300 189 / 300 190 / 300 191 / 300 192 / 300 193 / 300 194 / 300 195 / 300 196 / 300 197 / 300 198 / 300 199 / 300 200 / 300 201 / 300 202 / 300 203 / 300 204 / 300 205 / 300 206 / 300 207 / 300 208 / 300 209 / 300 210 / 300 211 / 300 212 / 300 213 / 300 214 / 300 215 / 300 216 / 300 217 / 300 218 / 300 219 / 300 220 / 300 221 / 300 222 / 300 223 / 300 224 / 300 225 / 300 226 / 300 227 / 300 228 / 300 229 / 300 230 / 300 231 / 300 232 / 300 233 / 300 234 / 300 235 / 300 236 / 300 237 / 300 238 / 300 239 / 300 240 / 300 241 / 300 242 / 300 243 / 300 244 / 300 245 / 300 246 / 300 247 / 300 248 / 300 249 / 300 250 / 300 251 / 300 252 / 300 253 / 300 254 / 300 255 / 300 256 / 300 257 / 300 258 / 300 259 / 300 260 / 300 261 / 300 262 / 300 263 / 300 264 / 300 265 / 300 266 / 300 267 / 300 268 / 300 269 / 300 270 / 300 271 / 300 272 / 300 273 / 300 274 / 300 275 / 300 276 / 300 277 / 300 278 / 300 279 / 300 280 / 300 281 / 300 282 / 300 283 / 300 284 / 300 285 / 300 286 / 300 287 / 300 288 / 300 289 / 300 290 / 300 291 / 300 292 / 300 293 / 300 294 / 300 295 / 300 296 / 300 297 / 300 298 / 300 299 / 300 300 / 300 ``` ``` ## done ## Calculating coverage ## [1] 1 ``` ``` r class(qc_sox9_rep1) ``` ``` ## [1] "ChIPQCsample" ## attr(,"package") ## [1] "ChIPQC" ``` --- ## Quality control with ChIPQC We can display our **ChIPQCsample** object which will show a basic summary of our CUT&RUN quality. ``` r qc_sox9_rep1 ``` ``` ## ChIPQCsample ``` ``` ## Number of Mapped reads: 829781 ``` ``` ## Number of Mapped reads passing MapQ filter: 788532 ``` ``` ## Percentage Of Reads as Non-Duplicates (NRF): 100(0) ``` ``` ## Percentage Of Reads in Blacklisted Regions: 2 ``` ``` ## SSD: 2.61011887557391 ``` ``` ## Fragment Length Cross-Coverage: 0.0133142837287353 ``` ``` ## Relative Cross-Coverage: 1.06351020408163 ``` ``` ## Percentage Of Reads in GenomicFeature: ``` ``` ## ProportionOfCounts ## Peaks 0.00000000 ## BlackList 0.02050392 ## LongPromoter20000to2000 0.27655060 ## Promoters2000to500 0.03991975 ## Promoters500 0.03665038 ## All5utrs 0.01701516 ## Alltranscripts 0.57769628 ## Allcds 0.01479838 ## Allintrons 0.54665125 ## All3utrs 0.01184733 ``` ``` ## Percentage Of Reads in Peaks: 0 ``` ``` ## Number of Peaks: 0 ``` ``` ## GRanges object with 0 ranges and 2 metadata columns: ## seqnames ranges strand | Counts bedRangesSummitsTemp ## <Rle> <IRanges> <Rle> | <integer> <numeric> ## ------- ## seqinfo: 1 sequence from an unspecified genome; no seqlengths ``` --- ## QC of multiple samples It is helpful to review CUT&RUN quality versus other samples (including any IgG) which we are using (or even external data if you do not have your own). This will allow us to identify expected patterns of CUT&RUN enrichment in our samples versus controls as well as spot any outlier samples by these metrics. We can run **ChIPQCsample()** on all our samples of interest using an **lapply**. First we make vectors of BAM files and peak files where the indeces of BAMs and corresponding peaks line up. If you want to try on your own, the BAM files are [available on dropbox](https://www.dropbox.com/scl/fo/f8q9iz5j1mic0wrhzei2c/AEIL42gWMI-Tc_OwFxc5wOA?rlkey=gxl9u7rqk783zuz4tezj9e3pu&st=qzh2allk&dl=0), but be aware that they are big files. ``` r bamsToQC <- c("~/Downloads/SOX9CNR_D0_rep1_sorted.bam", "~/Downloads/SOX9CNR_D0_rep2_sorted.bam", "~/Downloads/SOX9CNR_W6_rep1_sorted.bam", "~/Downloads/SOX9CNR_W6_rep2_sorted.bam") peaksToQC <- c("data/peaks/SOX9CNR_D0_rep1_macs_peaks.narrowPeak", "data/peaks/SOX9CNR_D0_rep2_macs_peaks.narrowPeak", "data/peaks/SOX9CNR_W6_rep1_macs_peaks.narrowPeak", "data/peaks/SOX9CNR_W6_rep2_macs_peaks.narrowPeak") ``` --- ## QC of multiple samples Then we can run the `ChIPQCsample` function using *lapply* while cycling through the paths to BAM and peak files. Here we limit the analyis to chromosome 18. ``` r myQC <- lapply(seq_along(bamsToQC), function(x) { ChIPQCsample(bamsToQC[x], annotation = "mm10", peaks = peaksToQC[x], blacklist = blklist, chromosomes = "chr18") }) names(myQC) <- basename(bamsToQC) ``` --- ## QC of multiple samples All ChIPQC functions can work with a named list of **ChIPQCsample** objects to aggregate scores into table as well as plots. Here we use the **QCmetrics()** function to give an overview of quality metrics. ``` r QCmetrics(myQC) ``` ``` ## Reads Map% Filt% Dup% ReadL FragL RelCC SSD RiP% ## SOX9CNR_D0_rep1_sorted.bam 3858873 100 7.49 0 36 79 2.000 3.23 5.59 ## SOX9CNR_D0_rep2_sorted.bam 2310101 100 9.05 0 34 81 3.290 3.50 1.22 ## SOX9CNR_W6_rep1_sorted.bam 835558 100 5.05 0 36 73 0.975 2.66 32.80 ## SOX9CNR_W6_rep2_sorted.bam 1600013 100 5.49 0 35 71 0.348 5.60 25.90 ## RiBL% ## SOX9CNR_D0_rep1_sorted.bam 2.70 ## SOX9CNR_D0_rep2_sorted.bam 2.72 ## SOX9CNR_W6_rep1_sorted.bam 1.91 ## SOX9CNR_W6_rep2_sorted.bam 1.97 ``` --- class: inverse, center, middle # Blacklists and SSD <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Blacklists As we discussed while calling peaks, CUT&RUN will often show the presence of common artifacts, such as ultra-high signal regions. Such regions can confound peak calling, fragment length estimation and QC metrics. Anshul Kundaje created the DAC blacklist as a reference to help deal with these regions. <div align="center"> <img src="imgs/blacklist.png" alt="offset" height="400" width="400"> </div> --- ## Blacklists and SSD SSD is one of these measures that is sensitive to blacklisted artifacts. SSD is a measure of standard deviation of signal across the genome with higher scores reflecting significant pile-up of reads. SSD can therefore be used to assess both the extent of ultra high signals and the signal. But first blacklisted regions must be removed. <div align="center"> <img src="imgs/ssdAndBlacklist.png" alt="offset" height="400" width="300"> </div> --- ## Standardized Standard Deviation ChIPQC calculates SSD before and after removing signal coming from Blacklisted regions. The **plotSSD()** function plots samples's pre-blacklisting score in **red** and post-blacklisting score in **blue**. Blacklisting does not have a huge impact on SSD in our samples, suggesting a strong peak signal. ``` r plotSSD(myQC) + xlim(0, 7) ``` <!-- --> --- ## Standardized Standard Deviation This is not always the case. Sometimes you will much higher pre-blacklist SSD than post-blacklist SSD. This would indicate a strong background signal in blacklisted regions for that sample. Below is an example of this from another dataset: <img src="imgs/blacklist_ssd.png"height="200" width="600"> --- class: inverse, center, middle # Library complexity and enrichment <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Library complexity A potential source of noise in CUT&RUN is overamplification of the CUT&RUN library during a PCR step. This can lead to large number of duplicate reads which may confound peak calling.  --- ## Duplication We should compare our duplication rate across samples to identify any sample experiencing overamplification and so potential of a lower complexity. The **flagtagcounts()** function reports can report the number of duplicates and total mapped reads and so from there we can calculate our duplication rate. ``` r myFlags <- flagtagcounts(myQC) myFlags["DuplicateByChIPQC", ]/myFlags["Mapped", ] ``` ``` ## SOX9CNR_D0_rep1_sorted.bam SOX9CNR_D0_rep2_sorted.bam ## 0.18044154 0.25100764 ## SOX9CNR_W6_rep1_sorted.bam SOX9CNR_W6_rep2_sorted.bam ## 0.22392940 0.05412144 ``` --- ## Enrichment for reads across genes We can also use ChIPQC to review where our distribution of reads across gene features using the **plotRegi()** function. Here we expect CUT&RUN signal to be stronger in 5'UTRs and promoters when compared to input samples. ``` r p <- plotRegi(myQC) ``` --- ## Enrichment for reads across genes. ``` r p ``` <!-- --> --- class: inverse, center, middle # Quality control - ATACseq <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Distribution of mapped reads In ATACseq we will want to check the distribution of mapped reads across chromosomes. [We can check the number of mapped reads on every chromosome using the **idxstatsBam()** function.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignedDataInBioconductor.html#15) ATACseq is known have high signal on the mitochondrial chromosomes and so we can check for that here. ``` r library(Rsamtools) mappedReads_atac <- idxstatsBam("W6_ATAC_rep1_sorted.bam") mappedReads_atac <- mappedReads_atac[mappedReads_atac$seqnames %in% paste0("chr", c(1:19, "X", "Y", "M")), ] ``` --- ## Distribution of mapped reads We can now use the mapped reads data frame to make a barplot of reads across chromosomes. In this example, we see a case where the mapping rate to mitochondrial genome is high. ``` r library(ggplot2) ggplot(mappedReads_atac, aes(seqnames, mapped)) + geom_bar(stat = "identity") + coord_flip() ``` <!-- --> --- class: inverse, center, middle # ATACseqQC <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## ATACseqQC The **ATACseqQC** library allows us to run many of the ATACseq QC steps we have seen in a single step. It may consume a little more memory but will allow for the inclusion of two more useful metrics called the PCR Bottleneck Coefficients (PBC1 and PBC2). First we must install the library. ``` r BiocManager::install("ATACseqQC") ``` --- ## ATACseqQC As with ChIPQC, the ATACseqQC function contains a workflow function which will acquire much of the required QC with a single argument of the BAM file path. Since this can be fairly memory heavy, I am just illustrating it here on a BAM file containing just the chromosome 18 reads of the ATACseq data. ``` r library(ATACseqQC) ``` ``` ## ``` ``` r atac_bam <- "data/W6_ATAC_rep1_chr18_sorted.bam" ATACQC <- bamQC(atac_bam) ``` --- ## ATACseqQC The resulting ATACQC object has many slots of QC information including duplicateRate, non-redundant fraction, distribution of signal across chromosomes, mitochondrial fraction etc. These include the **PCRbottleneckCoefficient_1** and **PCRbottleneckCoefficient_2** values. ``` r names(ATACQC) ``` ``` ## [1] "totalQNAMEs" "duplicateRate" ## [3] "mitochondriaRate" "properPairRate" ## [5] "unmappedRate" "hasUnmappedMateRate" ## [7] "notPassingQualityControlsRate" "nonRedundantFraction" ## [9] "PCRbottleneckCoefficient_1" "PCRbottleneckCoefficient_2" ## [11] "MAPQ" "idxstats" ``` --- ## PCR bottleneck coefficients PCR bottleneck coefficients identify PCR bias/overamplification which may have occurred in preparation of ATAC samples. The **PCRbottleneckCoefficient_1** is calculated as the number of positions in genome with *exactly* 1 read mapped uniquely compared to the number of positions with *at least* 1 read. For example if we have 20 reads. 16 map uniquely to locations. 4 do not map uniquely, instead there are 2 locations, both of which have 2 reads. This would lead us to calculation 16/18. We therefore have a PBC1 of 0.889 Values less than 0.7 indicate severe bottlenecking, between 0.7 and 0.9 indicate moderate bottlenecking. Greater than 0.9 show no bottlenecking. ``` r ATACQC$PCRbottleneckCoefficient_1 ``` ``` ## [1] 0.9419972 ``` --- ## PCR bottleneck coefficients The **PCRbottleneckCoefficient_2** is our secondary measure of bottlenecking. It is calculated as the number of positions in genome with **exactly** 1 read mapped uniquely compared to the number of positions with **exactly** 2 reads mapping uniquely. We can reuse our example. If we have 20 reads, 16 of which map uniquely. 4 do not map uniquely, instead there are 2 locations, both of which have 2 reads. This would lead us to calculation 16/2. We therefore have a PBC2 of 8. Values less than 1 indicate severe bottlenecking, between 1 and 3 indicate moderate bottlenecking. Greater than 3 show no bottlenecking. ``` r ATACQC$PCRbottleneckCoefficient_2 ``` ``` ## [1] 17.22071 ``` --- ## ATACseqQC insert sizes plot In the peak calling lecture we made a plot of insert sizes. This is a very important QC readout for ATACseq, and ATACseqQC has a function that makes a similar plot by just providing a path to a BAM file. ``` r fragSize <- fragSizeDist(atac_bam, bamFiles.labels = gsub("\\.bam", "", basename(atac_bam))) ``` <!-- --> --- ## ATACseqQC insert sizes plot In addition to printing the plot, this function returns a vector of insert size distributions that could be used to manually make a plot. This BAM file has not been filtered for proper pairs, so these numbers will be different from those seen in the peak calling lecture. ``` r fragSize$W6_ATAC_rep1_chr18_sorted[1:10] ``` ``` ## ## 0 18 19 21 22 23 24 25 26 27 ## 1720 2 2 14 10 2 312 198 1714 1314 ``` --- class: inverse, center, middle # Evaluating TSS signal <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Evaluating signal over TSS regions If our shorter fragments represent the open regions around transcription factors and transcriptional machinery we would expect to see signal at transcriptional start sites. Our longer fragments will represent signal around nucleosomes and so signal should be outside of the transcriptional start sites and more present at the +1 and -1 nucleosome positions. <div align="center"> <img src="imgs/phasing.png" alt="offset" height="300" width="350"> </div> --- ## Evaluating signal over TSS regions We can create a meta-plot over all TSS regions to illustrate where our nucleosome free and nucleosome occupied fractions of signal are most prevalent. Meta-plots average or sum signal over sets of regions to identify trends in data. <div align="center"> <img src="imgs/signalOverTSS.png" alt="offset" height="300" width="600"> </div> --- ## Plotting signal over regions in R To produce meta-plots of signal over regions we can use the **soGGi** bioconductor package. We can load in **soGGi** with the BiocManager::install and library function, as we have done before. ``` r BiocManager::install("soGGi") library(soGGi) ``` --- ## Plotting regions in soGGi The soGGi library simply requires a BAM file and a GRanges of regions over which to average signal to produce the meta-plot. We wish to plot over TSS regions and so we first will need to produce a GRanges of TSS locations for hg19 genome. Thankfully we now know how to extract these regions for all genes using the **TxDB packages** and some **GenomicRanges** functions. First we can load our TxDb of interest - **TxDb.Hsapiens.UCSC.hg19.knownGene**. ``` r library(TxDb.Mmusculus.UCSC.mm10.knownGene) ``` ``` ## Warning: package 'TxDb.Mmusculus.UCSC.mm10.knownGene' was built under R version ## 4.4.1 ``` ``` ## Warning: package 'GenomicFeatures' was built under R version 4.4.1 ``` ``` ## Warning: package 'AnnotationDbi' was built under R version 4.4.1 ``` ``` r TxDb.Mmusculus.UCSC.mm10.knownGene ``` ``` ## TxDb object: ## # Db type: TxDb ## # Supporting package: GenomicFeatures ## # Data source: UCSC ## # Genome: mm10 ## # Organism: Mus musculus ## # Taxonomy ID: 10090 ## # UCSC Table: knownGene ## # UCSC Track: GENCODE VM23 ## # Resource URL: http://genome.ucsc.edu/ ## # Type of Gene ID: Entrez Gene ID ## # Full dataset: yes ## # miRBase build ID: NA ## # transcript_nrow: 142446 ## # exon_nrow: 447558 ## # cds_nrow: 243727 ## # Db created by: GenomicFeatures package from Bioconductor ## # Creation time: 2019-10-21 20:52:26 +0000 (Mon, 21 Oct 2019) ## # GenomicFeatures version at creation time: 1.37.4 ## # RSQLite version at creation time: 2.1.2 ## # DBSCHEMAVERSION: 1.2 ``` --- ## Plotting regions in soGGi We can extract gene locations (TSS to TTS) [using the **genes()** function and our **TxDb** object.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/GenomicFeatures_In_Bioconductor.html#15) ``` r genesLocations <- genes(TxDb.Mmusculus.UCSC.mm10.knownGene) ``` ``` ## 66 genes were dropped because they have exons located on both strands ## of the same reference sequence or on more than one reference sequence, ## so cannot be represented by a single genomic range. ## Use 'single.strand.genes.only=FALSE' to get all the genes in a ## GRangesList object, or use suppressMessages() to suppress this message. ``` ``` r genesLocations ``` ``` ## GRanges object with 24528 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## 100009600 chr9 21062393-21073096 - | 100009600 ## 100009609 chr7 84935565-84964115 - | 100009609 ## 100009614 chr10 77711457-77712009 + | 100009614 ## 100009664 chr11 45808087-45841171 + | 100009664 ## 100012 chr4 144157557-144162663 - | 100012 ## ... ... ... ... . ... ## 99889 chr3 84496093-85887516 - | 99889 ## 99890 chr3 110246109-110250998 - | 99890 ## 99899 chr3 151730922-151749960 - | 99899 ## 99929 chr3 65528410-65555518 + | 99929 ## 99982 chr4 136550540-136602723 - | 99982 ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome ``` --- ## Plotting regions in soGGi We can then use the [**resize()** function](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicIntervals_In_Bioconductor.html#40) to extract the location of start of every gene (the TSSs) in a stranded manner. Here we set the **fix** position as the start and the width as 1. ``` r tssLocations <- resize(genesLocations, fix = "start", width = 1) tssLocations ``` ``` ## GRanges object with 24528 ranges and 1 metadata column: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <character> ## 100009600 chr9 21073096 - | 100009600 ## 100009609 chr7 84964115 - | 100009609 ## 100009614 chr10 77711457 + | 100009614 ## 100009664 chr11 45808087 + | 100009664 ## 100012 chr4 144162663 - | 100012 ## ... ... ... ... . ... ## 99889 chr3 85887516 - | 99889 ## 99890 chr3 110250998 - | 99890 ## 99899 chr3 151749960 - | 99899 ## 99929 chr3 65528410 + | 99929 ## 99982 chr4 136602723 - | 99982 ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome ``` --- ## Plotting regions in soGGi When we created our index we subset the genome to the main chromosomes. We can do this again with our TSS GRange object, and update the levels. This means the BAM and GRanges will play nicely. ``` r mainChromosomes <- paste0("chr", c(1:19, "X", "Y", "M")) tssLocations <- tssLocations[as.vector(seqnames(tssLocations)) %in% mainChromosomes] seqlevels(tssLocations) <- mainChromosomes ``` --- ## Plotting regions in soGGi The soGGi package's **regionPlot()** function requires a BAM file of data to plot supplied to **bamFile** parameter and a GRanges to plot over supplied to **testRanges** argument. ``` r library(soGGi) sortedBAM <- "~/Downloads/W6_ATAC_rep1_sorted.bam" library(Rsamtools) # Nucleosome free allSignal <- regionPlot(bamFile = sortedBAM, testRanges = tssLocations) ``` --- ## Plotting regions in soGGi Additionally we supply information on input file format to **format** parameter, whether data is paired to **paired** parameter and type of plot to **style** parameter. We explore visualization options [visualization training](https://rockefelleruniversity.github.io/RU_VisualizingGenomicsData.html). A useful feature is that we can we can specify the minimum and maximum fragment lengths of paired reads to be used in our plot with the **minFragmentLength** and **maxFragmentLength** parameters. This allows us to select only our nucleosome free signal (< 100 base-pairs) to produce our metaplot over TSS regions. ``` r nucFree <- regionPlot(bamFile = sortedBAM, testRanges = tssLocations, style = "point", format = "bam", paired = TRUE, minFragmentLength = 0, maxFragmentLength = 100, forceFragment = 50) ``` ``` ## Warning: package 'soGGi' was built under R version 4.4.1 ``` --- ## Plotting regions in soGGi Now we have our profile object we can create our metaplot using the **plotRegion()** function in **soGGi**. Here we see the expected peak of signal for our nucleosome free region in the region over the TSS. ``` r plotRegion(nucFree) ``` <!-- --> --- ## Plotting regions in soGGi We can create a plot for our mono-nucleosome signal by adjusting our **minFragmentLength** and **maxFragmentLength** parameters to those expected for nucleosome length fragments (here 180 to 240). ``` r monoNuc <- regionPlot(bamFile = sortedBAM, testRanges = tssLocations, style = "point", format = "bam", paired = TRUE, minFragmentLength = 180, maxFragmentLength = 240, forceFragment = 80) ``` --- ## Plotting regions in soGGi Similarly we can plot the mono-nucleosome signal over TSS locations using **plotRegion()** function. In this plot we can clearly see the expected +1 nucleosome signal peak as well as several other nucleosome signalpeaks ``` r plotRegion(monoNuc) ``` <!-- --> --- ## Time for an exercise! Exercise on CUT&RUN data can be found [here](../../exercises/exercises/ATACCutRunChIP_AlignPeaksQC_exercise.html) --- ## Answers to exercise Answers can be found [here](../../exercises/answers/ATACCutRunChIP_AlignPeaksQC_answers.html)