RU_ATAC_part2.knit

class: center, middle, inverse, title-slide

# ATACseq In Bioconductor
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=796px>
</html>
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/RU_ATACseq/" class="uri">http://rockefelleruniversity.github.io/RU_ATACseq/</a>

---

## Data

In this section we will work a little more with the Greenleaf dataset.

To remind you, this is the original ATACseq dataset and we are working with a single sample. More details are [here.](https://rockefelleruniversity.github.io/RU_ATACseq/presentations/slides/RU_ATAC.html#5)

We will characterize the Greenleaf data signal around TSS regions and identify and annotate open regions using peak calls with MACS2.

At the end of the session we will plot transposase cut-sites around DNA binding proteins.

---

## Data

In this session we will also start working with a larger dataset from the ENCODE consortium generated by Bing Ren at UCSD. This dataset has multiple samples for multiple tissues (Liver, Kidney and Hindbrain), allowing us to perform a differential analysis on open regions.

Links to data and sample information are included in list below.

* Liver day 12 - [ENCSR302LIV](https://www.encodeproject.org/experiments/ENCSR302LIV/)

* Kidney day 15 - [ENCSR023QZX](https://www.encodeproject.org/experiments/ENCSR023QZX/)

* Hindbrain day 12 - [ENCSR088UYE](https://www.encodeproject.org/experiments/ENCSR088UYE/)

---
## Processed Data

We start with public sequencing data in links below and use reference data in Bioconductor.

Since some of these processing steps may take a little time I provide links to pre-processed results.

**Essentials**

BAM file and BAI index from our alignment/sorting/indexing.

* [SAMN02192806 - Greenleaf BAM](https://s3.amazonaws.com/rubioinformatics/ATAC_Workshop/ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2.bam) - Full BAM file for Greenleaf example produced following in our Rsubread alignment, sorting and indexing.

* [SAMN02192806 - Greenleaf BAI index](https://s3.amazonaws.com/rubioinformatics/ATAC_Workshop/ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2.bam.bai) - BAI index file for BAM in Greenleaf example produced following in our alignment, sorting and indexing.

---
**Essentials**

Small BAM, peak calls and directory structure.

* [ATAC_Workshop_Essential.zip](https://s3.amazonaws.com/rubioinformatics/ATAC_Workshop_Essential.zip) - Require additional workshop files and directory structure.

Once you have downloaded the above and unzipped **ATAC_Workshop.zip**, you should move the **Sorted_ATAC_50K_2.bam** and **Sorted_ATAC_50K_2.bam.bai** file into **ATAC_Workshop/ATAC_Data/ATAC_BAM/**

---
**Not essential**

Same as above but with BAMs for counting as well as small BAM, peak calls and directory structure.

* [ATAC_Workshop.zip](https://s3.amazonaws.com/rubioinformatics/ATAC_Workshop.zip)  - Additional workshop files and directory structure.

Bigwigs for IGV.

* [Bigwigs](https://s3.amazonaws.com/rubioinformatics/ATAC_bigWigs.zip) - BigWigs to review in IGV.

---
## ATACseq

* ATACseq - Uses transposases and offers a method to simultaneously extract signal from transcription factors binding sites and nucleosome positions from a single sample.

---
## Working with ATACseq

This means our data potentially contains a mixture of signal types in our data.

* We will have signal from nucleosome free regions and around transcription factors (our shorter fragments).
* A portion of our signal will be from around nucleosomes in open chromatin (longer fragments).

All our data is from open chromatin where our transposase has been able to access.

---
class: inverse, center, middle

# Evaluating TSS signal

---

##  Evaluating signal over TSS regions

If our shorter fragments represent the open regions around transcription factors and transcriptional machinery we would expect to see signal at transcriptional start sites.

Our longer fragments will represent signal around nucleosomes and so signal should be outside of the transcriptional start sites and more present at the +1 and -1 nucleosome positions.

---

##  Evaluating signal over TSS regions

We can create a meta-plot over all TSS regions to illustrate where our nucleosome free and nucleosome occupied fractions of signal are most prevalent.

Meta-plots average or sum signal over sets of regions to identify trends in data.

---

##  Plotting signal over regions in R

To produce meta-plots of signal over regions we can use the **soGGi** bioconductor package. We can load in **soGGi** with the BiocManager::install and library function, as we have done before.

```r
BiocManager::install("soGGi")
library(soGGi)
```
---

##  Plotting regions in soGGi

The soGGi library simply requires a BAM file and a GRanges of regions over which to average signal to produce the meta-plot.

We wish to plot over TSS regions and so we first will need to produce a GRanges of TSS locations for hg19 genome.

Thankfully we now know how to extract these regions for all genes using the **TxDB packages** and some **GenomicRanges** functions.

First we can load our TxDb of interest - **TxDb.Hsapiens.UCSC.hg19.knownGene**.

```r
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb.Hsapiens.UCSC.hg19.knownGene
```

```
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
```

---

##  Plotting regions in soGGi

We can extract gene locations (TSS to TTS) [using the **genes()** function and our **TxDb** object.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/GenomicFeatures_In_Bioconductor.html#15)

```r
genesLocations <- genes(TxDb.Hsapiens.UCSC.hg19.knownGene)
```

```
##   403 genes were dropped because they have exons located on both strands
##   of the same reference sequence or on more than one reference sequence,
##   so cannot be represented by a single genomic range.
##   Use 'single.strand.genes.only=FALSE' to get all the genes in a
##   GRangesList object, or use suppressMessages() to suppress this message.
```

```r
genesLocations
```

```
## GRanges object with 23056 ranges and 1 metadata column:
##         seqnames              ranges strand |     gene_id
##            <Rle>           <IRanges>  <Rle> | <character>
##       1    chr19   58858172-58874214      - |           1
##      10     chr8   18248755-18258723      + |          10
##     100    chr20   43248163-43280376      - |         100
##    1000    chr18   25530930-25757445      - |        1000
##   10000     chr1 243651535-244006886      - |       10000
##     ...      ...                 ...    ... .         ...
##    9991     chr9 114979995-115095944      - |        9991
##    9992    chr21   35736323-35743440      + |        9992
##    9993    chr22   19023795-19109967      - |        9993
##    9994     chr6   90539619-90584155      + |        9994
##    9997    chr22   50961997-50964905      - |        9997
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
```

---
## Plotting regions in soGGi

We can then use the [**resize()** function](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/GenomicIntervals_In_Bioconductor.html#40) to extract the location of start of every gene (the TSSs) in a stranded manner.

Here we set the **fix** position as the start and the width as 1.

```r
tssLocations <- resize(genesLocations, fix = "start", width = 1)
tssLocations
```

```
## GRanges object with 23056 ranges and 1 metadata column:
##         seqnames    ranges strand |     gene_id
##            <Rle> <IRanges>  <Rle> | <character>
##       1    chr19  58874214      - |           1
##      10     chr8  18248755      + |          10
##     100    chr20  43280376      - |         100
##    1000    chr18  25757445      - |        1000
##   10000     chr1 244006886      - |       10000
##     ...      ...       ...    ... .         ...
##    9991     chr9 115095944      - |        9991
##    9992    chr21  35736323      + |        9992
##    9993    chr22  19109967      - |        9993
##    9994     chr6  90539619      + |        9994
##    9997    chr22  50964905      - |        9997
##   -------
##   seqinfo: 93 sequences (1 circular) from hg19 genome
```

---
## Plotting regions in soGGi

When we created our index we subset the genome to the main chromosomes. We can do this again with our TSS GRange object, and update the levels. This means the BAM and GRanges will play nicely.

```r
mainChromosomes <- paste0("chr", c(1:21, "X", "Y", "M"))

myindex <- (match(seqnames(tssLocations), mainChromosomes))

tssLocations <- tssLocations[as.numeric(myindex)]

seqlevels(tssLocations) <- mainChromosomes
```

---
##  Plotting regions in soGGi

The soGGi package's **regionPlot()** function requires a BAM file of data to plot supplied to **bamFile** parameter and a GRanges to plot over supplied to **testRanges** argument.

```r
library(soGGi)
sortedBAM <- "~/Downloads/ATAC_Workshop/Sorted_ATAC_50K_2.bam"

library(Rsamtools)
# Nucleosome free
allSignal <- regionPlot(bamFile = sortedBAM, testRanges = tssLocations)
```

---
## Plotting regions in soGGi

Additionally we supply information on input file format to **format** parameter, whether data is paired to **paired** parameter and type of plot to **style** parameter. We explore visualization options [visualization training](https://rockefelleruniversity.github.io/RU_VisualizingGenomicsData.html).

A useful feature is that we can we can specify the minimum and maximum fragment lengths of paired reads to be used in our plot with the **minFragmentLength** and **maxFragmentLength** parameters. This allows us to select only our nucleosome free signal (< 100 base-pairs) to produce our metaplot over TSS regions.

```r
nucFree <- regionPlot(bamFile = sortedBAM, testRanges = tssLocations, style = "point",
    format = "bam", paired = TRUE, minFragmentLength = 0, maxFragmentLength = 100,
    forceFragment = 50)
class(nucFree)
```
---
##  Plotting regions in soGGi

Now we have our profile object we can create our metaplot using the **plotRegion()** function in **soGGi**.

Here we see the expected peak of signal for our nucleosome free region in the region over the TSS.

```r
plotRegion(nucFree)
```

![](RU_ATAC_part2_files/figure-html/processData_plot-1.png)

---

##  Plotting regions in soGGi

We can create a plot for our mono-nucleosome signal by adjusting our **minFragmentLength** and **maxFragmentLength** parameters to those expected for nucleosome length fragments (here 180 to 240).

```r
monoNuc <- regionPlot(bamFile = sortedBAM, testRanges = tssLocations, style = "point",
    format = "bam", paired = TRUE, minFragmentLength = 180, maxFragmentLength = 240,
    forceFragment = 80)
```

---
##  Plotting regions in soGGi

Similarly we can plot the mono-nucleosome signal over TSS locations using **plotRegion()** function.

In this plot we can clearly see the expected +1 nucleosome signal peak as well as several other nucleosome signalpeaks

```r
plotRegion(monoNuc)
```

![](RU_ATAC_part2_files/figure-html/processData_plot3-1.png)

---
class: inverse, center, middle

# Peak calling

---

## Finding Open Regions

A common goal in ATACseq is to identify nucleosome free regions where transcription factors are binding and/or transcriptional machinery is active. This nucleosome free signal would correspond to fragments less than one nucleosome (as defined in Greenleaf paper < 100bp)

To identify open chromatin however we could simply use all reads which have been properly paired in sequencing (< 2000bp).

For the remainder of the workshop we will look at analysing the nucleosome free portions of the ATACseq data.

---
## Peak calling for nucleosome free regions

There are many methods available to call nucleosome free regions from ATACseq data with many borrowed from ChIPseq analysis. One very popular and standard peak caller for ATACseq is MAC2.

There is no R package for MACS2 (well it just got [released](https://github.com/macs3-project/MACS) but we haven't tested it yet). It is possible to install it on Mac and Linux using the [Anaconda](https://anaconda.org/bioconda/salmon) package repository (unfortunately there is not a Windows implementation). Anaconda is a huge collection of version controlled packages that can be installed through the conda package management system. With conda it is easy to create and manage environments that have a variety of different packages in them.

Though there is no MACS2 R package, we can interact with the anaconda package system using the R package [Herper](https://bioconductor.org/packages/release/bioc/html/Herper.html). This was created by us here at the BRC and is part of Bioconductor. We cover its installation and use for ChIPseq [here](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor2.html#37)

```r
BiocManager::install("Herper")
library(Herper)
```

---
## Install MACS2 with Herper

First, we will use Herper to install Salmon with the *install_CondaTools* function. We just need to tell *install_CondaTools* what tool/s you want and the name of the environment you want to build.

```r
salmon_paths <- install_CondaTools(tools = "macs2", env = "ATACseq_analysis")
salmon_paths
```

Behind the scenes, Herper will install the most minimal version of conda (called miniconda), and then will create a new environment into which MACS2 will be installed. When you run the function it prints out where MACS2 is installed. There are additional arguments to control where miniconda is installed using *pathToMiniConda* and also update an existing environment with *updateEnv*.

---
## Single-end peak calling

With single-end sequencing from ATACseq we do not know how long the fragments are. To identify open regions therefore requires some different parameters for MACS2 peak calling  compared to ChIPseq.

One strategy employed is to shift read 5' ends by -100 and then extend from this by 200bp. Considering the expected size of our nucleosome free fragments this should provide a pile-up over nucelosome regions suitable for MACS2 window size.

This is the MACS system call we would use to do this.

```
MACS2 callpeak -t singleEnd.bam --nomodel --shift -100
                --extsize 200 --format BAM -g MyGenome
```

---
## Single-end peak calling

In R we can run this system call using Herper so we can access our installed MACS2. MACS2 has been installed into *ATACseq_analysis*. So we can use this environment from R using *with_CondaEnv()*.

```r
with_CondaEnv("ATACseq_analysis",
                      system2(command="macs2",args =c("callpeak", 
                      "-t", "singleEnd.bam",
                      "--nomodel",
                      "--shift","-100",
                      "--extsize", "200",
                      "--format", "BAM",
                      "-g", "hs")),
                        stdout = TRUE))
```

---
## Single-end peak calling

Alternatively for the nucleosome occupied data we can adjust shift and extension to centre the signal on nucleosome centres (nucleosomes wrapped in 147bp of DNA).

```
MACS2 callpeak -t singleEnd.bam --nomodel --shift 37
               --extsize 73 --format BAM -g MyGenome
```

```r
with_CondaEnv("ATACseq_analysis", system2(command = "macs2", args = c("callpeak",
    "-t", "singleEnd.bam", "--nomodel", "--shift", "37", "--extsize", "73", "--format",
    "BAM", "-g", "hs")), stdout = TRUE)
```

---
## Paired-end peak calling

If we have sequenced paired-end data then we do know the fragment lengths and can provide BAM files to MACS2 which have been prefiltered to properly paired (and fragment size if you want to distinguish nucleosomes from nucleosome free regions)

We have to simply tell MACS2 that the data is paired using the format argument. This is important as MACS2 will guess it is single-end BAM by default.

```
MACS2 callpeak -t pairedEnd.bam -f BAMPE 
               --outdir path/to/output/
               --name pairedEndPeakName -g MyGenome
```

```r
with_CondaEnv("ATACseq_analysis", system2(command = "macs2", args = c("callpeak",
    "-t", "pairedEnd.bam", "--format", "BAMPE", "--outdir", "/Documents/ATAC_MACS2_calls/",
    "--name", "pairedEndPeakName", "-g", "hs")), stdout = TRUE)
```

---
## Paired-end peak calling
For our paired-end data here, we call peaks on our nucleosome free regions from our nucleosome free region BAM file.

```
MACS2 callpeak  -t ~/Downloads/Sorted_ATAC_50K_2_openRegions.bam
                --outdir ATAC_Data/ATAC_Peaks/ATAC_50K_2
                --name Sorted_ATAC_50K_2_Small_Paired_peaks.narrowPeak
                -f BAMPE -g hs
```

```r
with_CondaEnv("ATACseq_analysis", system2(command = "macs2", args = c("callpeak",
    "-t", "~/Downloads/Sorted_ATAC_50K_2_openRegions.bam", "--outdir", "ATAC_Data/ATAC_Peaks/ATAC_50K_2",
    "--name", "Sorted_ATAC_50K_2_Small_Paired_peaks.narrowPeak", "-f", "BAMPE", "-g",
    "hs")), stdout = TRUE)
```

---
## Paired-end peak calling

Following peak calling we would get 3 files.

* Name.narrowPeak -- Narrow peak format suitable for IGV and further analysis

* Name_peaks.xls -- Peak table suitable for review in excel.(not actually a xls but a tsv)

* summits.bed -- Summit positions for peaks useful for finding motifs and plotting

---
## QC

Often we want to do some QC to check for low quality, duplicates and signal distribution.

Before we remove any data we can get a quick assessment of our reads in peaks, duplication rate, low quality reads and reads in artifact regions from ChIPQC.

To do this we will use the hg19 [blacklist](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor2.html#24) which you can download [here](https://www.encodeproject.org/annotations/ENCSR636HFF/)

```r
library(ChIPQC)
library(rtracklayer)
library(DT)
library(dplyr)
library(tidyr)

blkList <- import.bed("~/Downloads/ENCFF001TDO.bed.gz")
openRegionPeaks <- "~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_Peaks/Sorted_ATAC_50K_2_Small_Paired_peaks.narrowPeak"

qcRes <- ChIPQCsample("~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2_openRegions.bam",
    peaks = openRegionPeaks, annotation = "hg19", chromosomes = "chr20", blacklist = blkList,
    verboseT = FALSE)
```

---
## QC

We can use the ChIPQC package to capture some important metrics for our ATACseq data such as reads in peaks and reads in blacklist from the **QCmetrics()** function and numbers of duplicated reads from the **flagtagcounts()** funcion.

```r
myMetrics <- QCmetrics(qcRes)
myMetrics[c("RiBL%", "RiP%")]
```

```
##  RiBL%   RiP% 
##  0.342 12.300
```

```r
flgCounts <- flagtagcounts(qcRes)
DupRate <- flgCounts["DuplicateByChIPQC"]/flgCounts["Mapped"]
DupRate * 100
```

```
## DuplicateByChIPQC 
##          22.90539
```

---
##  Remove blacklisted peaks

Artifacts from sequencing and imperfect genome builds can confound our results. These artifacts have been collated together into "Blacklists" of regions. We discuss these regions in more detail [here](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor2.html#24)

Since blacklisted regions may confound our analysis we remove any peaks which have been called there.

Removing blacklists too early can hide some of the QC issues in your data. The **blacklist** should always be considered in your analysis and recommended to removed data from these regions once QC is considered.

```r
MacsCalls <- granges(qcRes[seqnames(qcRes) %in% "chr20"])

data.frame(Blacklisted = sum(MacsCalls %over% blkList), Not_Blacklisted = sum(!MacsCalls %over%
    blkList))

MacsCalls <- MacsCalls[!MacsCalls %over% blkList]
```

---
class: inverse, center, middle

# Annotating Peaks

---

## Annotating Open Regions

It is often of interest to associate identified nucleosome free regions to genomic features such as genes and enhancers.

Once annotated to genes or enhancers' genes, we can start to associate ATACseq data to characteristics of these genes. (functional annotation, expression changes, other epigenetic states)

---
## Annotating peaks to genes

A simple approach to annotating nucleosome free region to genes is to associate regions to their closest gene or within a window around a genes transcriptional start site.

We can use the [chipseeker library to identify genes closest to our regions and to give us simple summaries and visualizations of this annotation.](https://rockefelleruniversity.github.io/RU_ChIPseq/chipseq_course/Presentations/Slides/ChIPseq_In_Bioconductor2.html#48)

We use the gene models from TxDb.Hsapiens.UCSC.hg19.knownGene and supply this to ChIPseeker packages annotatePeak function.

ChIPseeker's csAnno object will then show breakdown of percentages of peaks in genomic regions.

```r
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
MacsCalls_Anno <- annotatePeak(MacsCalls, TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene)
```

```
## >> preparing features information...		 2021-07-28 05:37:21 PM 
## >> identifying nearest features...		 2021-07-28 05:37:21 PM 
## >> calculating distance from peak to TSS...	 2021-07-28 05:37:21 PM 
## >> assigning genomic annotation...		 2021-07-28 05:37:21 PM 
## >> assigning chromosome lengths			 2021-07-28 05:37:36 PM 
## >> done...					 2021-07-28 05:37:36 PM
```

```r
MacsCalls_Anno
```

```
## Annotated peaks generated by ChIPseeker
## 596/596  peaks were annotated
## Genomic Annotation Summary:
##              Feature  Frequency
## 8   Promoter (<=1kb) 40.2684564
## 9   Promoter (1-2kb)  2.0134228
## 10  Promoter (2-3kb)  2.1812081
## 4             5' UTR  0.1677852
## 3             3' UTR  1.0067114
## 1           1st Exon  0.1677852
## 6         Other Exon  1.5100671
## 2         1st Intron  6.0402685
## 7       Other Intron 19.9664430
## 5  Distal Intergenic 26.6778523
```

---
## Displaying annotation distribution

As well as showing us a table of the annotation distribution we can visualize this using the plotAnnoPie and plotAnnoBar functions.

```r
plotAnnoPie(MacsCalls_Anno)
```

![](RU_ATAC_part2_files/figure-html/processData_Pie-1.png)

---

## Annotating nucleosome free regions

With this information we can then subset our peaks/nuc free regions to those only landing in TSS regions (+/- 500)

```r
MacsGR_Anno <- as.GRanges(MacsCalls_Anno)
MacsGR_TSS <- MacsGR_Anno[abs(MacsGR_Anno$distanceToTSS) < 500]
MacsGR_TSS[1, ]
```

```
## GRanges object with 1 range and 9 metadata columns:
##       seqnames        ranges strand |       annotation   geneChr geneStart
##          <Rle>     <IRanges>  <Rle> |      <character> <integer> <integer>
##   [1]    chr20 278001-278160      * | Promoter (<=1kb)        20    278204
##         geneEnd geneLength geneStrand      geneId transcriptId distanceToTSS
##       <integer>  <integer>  <integer> <character>  <character>     <numeric>
##   [1]    280963       2760          1       85364   uc002wdf.3           -44
##   -------
##   seqinfo: 1 sequence from hg19 genome
```

---
class: inverse, center, middle

# Functional analysis of peaks

---

## Functional Analysis of NF regions

Another common step to ATACseq analysis is to identify any functional enrichment in genes associated to nucleosome free regions.

One approach is to take the genes we identified from ChIPseeker as having nuclesome free regions and test these for functional enrichment using [standard tools like GOseq](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor3.html#7).

[Another approach which is well suited to ATACseq is that implemented in GREAT.](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor3.html#17)

**rGREAT by default will limit our queries and so we may need to be patient. This can be adjusted in rGREAT options.**

---
## Functional Analysis

We can submit our peak calls to GREAT using the **submitGreatJob** function and review available categories of results using **availableCategories**.

```r
library(rGREAT)
```

```
## 
## ------------------
## Note: On Aug 19 2019 GREAT released version 4 where it supports `hg38`
## genome and removes some ontologies such pathways. `submitGreatJob()`
## still takes `hg19` as default. `hg38` can be specified by the `species
## = 'hg38'` argument. To use the older versions such as 3.0.0, specify as
## `submitGreatJob(..., version = '3.0.0')`.
## ------------------
```

```r
great_Job <- submitGreatJob(MacsCalls, species = "hg19")
availableCategories(great_Job)
```

```
## [1] "GO"        "Phenotype" "Genes"
```
---
## Functional Analysis

For this example we select the results tables for the GO category using getEnrichmentTables functions and then review the results for Biological processes.

```r
great_ResultTable = getEnrichmentTables(great_Job, category = "GO")
names(great_ResultTable)
```

```
## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"
```

```r
great_ResultTable[["GO Biological Process"]][1:4, ]
```

```
##           ID                                             name
## 1 GO:0060466 activation of meiosis involved in egg activation
## 2 GO:2000438    negative regulation of monocyte extravasation
## 3 GO:2000560 positive regulation of CD24 biosynthetic process
## 4 GO:0061101              neuroendocrine cell differentiation
##   Binom_Genome_Fraction Binom_Expected Binom_Observed_Region_Hits
## 1          0.0003871583      0.2307463                         15
## 2          0.0003871583      0.2307463                         15
## 3          0.0003871583      0.2307463                         15
## 4          0.0014154550      0.8436112                         20
##   Binom_Fold_Enrichment Binom_Region_Set_Coverage Binom_Raw_PValue
## 1              65.00645                0.02516779     1.451510e-22
## 2              65.00645                0.02516779     1.451510e-22
## 3              65.00645                0.02516779     1.451510e-22
## 4              23.70760                0.03355705     4.566609e-21
##   Binom_Adjp_BH Hyper_Total_Genes Hyper_Expected Hyper_Observed_Gene_Hits
## 1  6.360033e-19                 1     0.01967761                        1
## 2  6.360033e-19                 1     0.01967761                        1
## 3  6.360033e-19                 1     0.01967761                        1
## 4  1.500702e-17                 7     0.13774330                        2
##   Hyper_Fold_Enrichment Hyper_Gene_Set_Coverage Hyper_Term_Gene_Coverage
## 1              50.81918             0.002739726                1.0000000
## 2              50.81918             0.002739726                1.0000000
## 3              50.81918             0.002739726                1.0000000
## 4              14.51977             0.005479452                0.2857143
##   Hyper_Raw_PValue Hyper_Adjp_BH
## 1      0.019677610             1
## 2      0.019677610             1
## 3      0.019677610             1
## 4      0.007595723             1
```

---
class: inverse, center, middle

# Differential ATACseq

---

## Differential ATACseq

We have briefly reviewed the processing and initial analysis of one ATACseq sample using R.

In the next part we will look at how we can identify changes in open regions using R/Bioconductor.

Here we will take an approach akin that in Diffbind and reasonably esatablished in ATACseq analysis.

---
## Identifying non-redundant peaks

First, We will define a set of non-redundant peaks present in at least 2 samples and use these to assess changes in nucleosome free ATACseq signal using DESeq2. [Here we use the same method for deriving consensus peaks for differentials as seen for ChIPseq.](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor4.html#22)

We take peaks across all our samples and reduce them into a single set of non-redudnant peaks. We can then create a matrix of presence/absence of these peaks over each sample.

```r
peaks <- dir("~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_Peaks_forCounting/", pattern = "*.narrowPeak",
    full.names = TRUE)
```

```r
myPeaks <- lapply(peaks, ChIPQC:::GetGRanges, simple = TRUE)
allPeaksSet_nR <- reduce(unlist(GRangesList(myPeaks)))
overlap <- list()
for (i in 1:length(myPeaks)) {
    overlap[[i]] <- allPeaksSet_nR %over% myPeaks[[i]]
}
overlapMatrix <- do.call(cbind, overlap)
colnames(overlapMatrix) <- basename(peaks)
mcols(allPeaksSet_nR) <- overlapMatrix
```

---
## Identifying non-redundant peaks

```r
allPeaksSet_nR[1:2, ]
```

```
## GRanges object with 2 ranges and 6 metadata columns:
##       seqnames          ranges strand |
##          <Rle>       <IRanges>  <Rle> |
##   [1]     chr1 3130336-3130413      * |
##   [2]     chr1 3400112-3400250      * |
##       Sorted_Hindbrain_day_12_1_Small_Paired_peaks.narrowPeak
##                                                     <logical>
##   [1]                                                   FALSE
##   [2]                                                   FALSE
##       Sorted_Hindbrain_day_12_2_Small_Paired_peaks.narrowPeak
##                                                     <logical>
##   [1]                                                   FALSE
##   [2]                                                   FALSE
##       Sorted_Kidney_day_15_1_Small_Paired_peaks.narrowPeak
##                                                  <logical>
##   [1]                                                FALSE
##   [2]                                                FALSE
##       Sorted_Kidney_day_15_2_Small_Paired_peaks.narrowPeak
##                                                  <logical>
##   [1]                                                 TRUE
##   [2]                                                 TRUE
##       Sorted_Liver_day_12_1_Small_Paired_peaks.narrowPeak
##                                                 <logical>
##   [1]                                               FALSE
##   [2]                                               FALSE
##       Sorted_Liver_day_12_2_Small_Paired_peaks.narrowPeak
##                                                 <logical>
##   [1]                                               FALSE
##   [2]                                               FALSE
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
```
---
## Identifying non-redundant peaks

We filter out peaks in blacklists and in ChrM prior to testing to eliminate potential artifact differential calls.

```r
blklist <- import.bed("data/ENCFF547MET.bed.gz")
nrToCount <- allPeaksSet_nR[!allPeaksSet_nR %over% blklist & !seqnames(allPeaksSet_nR) %in%
    "chrM"]
nrToCount
```

```
## GRanges object with 89654 ranges and 6 metadata columns:
##           seqnames            ranges strand |
##              <Rle>         <IRanges>  <Rle> |
##       [1]     chr1   3130336-3130413      * |
##       [2]     chr1   3400112-3400250      * |
##       [3]     chr1   3433954-3434095      * |
##       [4]     chr1   3515020-3515102      * |
##       [5]     chr1   3575895-3576078      * |
##       ...      ...               ...    ... .
##   [89650]     chrY 90805058-90805151      * |
##   [89651]     chrY 90807391-90807562      * |
##   [89652]     chrY 90808761-90808841      * |
##   [89653]     chrY 90812961-90813237      * |
##   [89654]     chrY 90829708-90829782      * |
##           Sorted_Hindbrain_day_12_1_Small_Paired_peaks.narrowPeak
##                                                         <logical>
##       [1]                                                   FALSE
##       [2]                                                   FALSE
##       [3]                                                   FALSE
##       [4]                                                   FALSE
##       [5]                                                   FALSE
##       ...                                                     ...
##   [89650]                                                   FALSE
##   [89651]                                                   FALSE
##   [89652]                                                   FALSE
##   [89653]                                                    TRUE
##   [89654]                                                   FALSE
##           Sorted_Hindbrain_day_12_2_Small_Paired_peaks.narrowPeak
##                                                         <logical>
##       [1]                                                   FALSE
##       [2]                                                   FALSE
##       [3]                                                   FALSE
##       [4]                                                   FALSE
##       [5]                                                   FALSE
##       ...                                                     ...
##   [89650]                                                   FALSE
##   [89651]                                                   FALSE
##   [89652]                                                    TRUE
##   [89653]                                                    TRUE
##   [89654]                                                   FALSE
##           Sorted_Kidney_day_15_1_Small_Paired_peaks.narrowPeak
##                                                      <logical>
##       [1]                                                FALSE
##       [2]                                                FALSE
##       [3]                                                 TRUE
##       [4]                                                FALSE
##       [5]                                                 TRUE
##       ...                                                  ...
##   [89650]                                                FALSE
##   [89651]                                                 TRUE
##   [89652]                                                FALSE
##   [89653]                                                 TRUE
##   [89654]                                                FALSE
##           Sorted_Kidney_day_15_2_Small_Paired_peaks.narrowPeak
##                                                      <logical>
##       [1]                                                 TRUE
##       [2]                                                 TRUE
##       [3]                                                 TRUE
##       [4]                                                 TRUE
##       [5]                                                 TRUE
##       ...                                                  ...
##   [89650]                                                FALSE
##   [89651]                                                FALSE
##   [89652]                                                FALSE
##   [89653]                                                FALSE
##   [89654]                                                 TRUE
##           Sorted_Liver_day_12_1_Small_Paired_peaks.narrowPeak
##                                                     <logical>
##       [1]                                               FALSE
##       [2]                                               FALSE
##       [3]                                               FALSE
##       [4]                                               FALSE
##       [5]                                               FALSE
##       ...                                                 ...
##   [89650]                                                TRUE
##   [89651]                                               FALSE
##   [89652]                                               FALSE
##   [89653]                                                TRUE
##   [89654]                                               FALSE
##           Sorted_Liver_day_12_2_Small_Paired_peaks.narrowPeak
##                                                     <logical>
##       [1]                                               FALSE
##       [2]                                               FALSE
##       [3]                                               FALSE
##       [4]                                               FALSE
##       [5]                                               FALSE
##       ...                                                 ...
##   [89650]                                               FALSE
##   [89651]                                                TRUE
##   [89652]                                               FALSE
##   [89653]                                                TRUE
##   [89654]                                               FALSE
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
```

---
## Counting for differential ATACseq

We now identify the number of samples in which our non-redundant peaks occur. Here we use the **rowSums()** function with our occurrence matrix and select those samples occuring in at least 2 samples.

```r
library(Rsubread)
occurrences <- rowSums(as.data.frame(elementMetadata(nrToCount)))

nrToCount <- nrToCount[occurrences >= 2, ]
nrToCount
```

```
## GRanges object with 36320 ranges and 6 metadata columns:
##           seqnames            ranges strand |
##              <Rle>         <IRanges>  <Rle> |
##       [1]     chr1   3433954-3434095      * |
##       [2]     chr1   3575895-3576078      * |
##       [3]     chr1   3671534-3671885      * |
##       [4]     chr1   3994858-3995006      * |
##       [5]     chr1   4560267-4560344      * |
##       ...      ...               ...    ... .
##   [36316]     chrY 90784207-90784298      * |
##   [36317]     chrY 90784944-90785099      * |
##   [36318]     chrY 90801504-90801651      * |
##   [36319]     chrY 90807391-90807562      * |
##   [36320]     chrY 90812961-90813237      * |
##           Sorted_Hindbrain_day_12_1_Small_Paired_peaks.narrowPeak
##                                                         <logical>
##       [1]                                                   FALSE
##       [2]                                                   FALSE
##       [3]                                                   FALSE
##       [4]                                                    TRUE
##       [5]                                                   FALSE
##       ...                                                     ...
##   [36316]                                                   FALSE
##   [36317]                                                    TRUE
##   [36318]                                                    TRUE
##   [36319]                                                   FALSE
##   [36320]                                                    TRUE
##           Sorted_Hindbrain_day_12_2_Small_Paired_peaks.narrowPeak
##                                                         <logical>
##       [1]                                                   FALSE
##       [2]                                                   FALSE
##       [3]                                                   FALSE
##       [4]                                                    TRUE
##       [5]                                                    TRUE
##       ...                                                     ...
##   [36316]                                                   FALSE
##   [36317]                                                    TRUE
##   [36318]                                                   FALSE
##   [36319]                                                   FALSE
##   [36320]                                                    TRUE
##           Sorted_Kidney_day_15_1_Small_Paired_peaks.narrowPeak
##                                                      <logical>
##       [1]                                                 TRUE
##       [2]                                                 TRUE
##       [3]                                                 TRUE
##       [4]                                                FALSE
##       [5]                                                 TRUE
##       ...                                                  ...
##   [36316]                                                FALSE
##   [36317]                                                 TRUE
##   [36318]                                                 TRUE
##   [36319]                                                 TRUE
##   [36320]                                                 TRUE
##           Sorted_Kidney_day_15_2_Small_Paired_peaks.narrowPeak
##                                                      <logical>
##       [1]                                                 TRUE
##       [2]                                                 TRUE
##       [3]                                                 TRUE
##       [4]                                                FALSE
##       [5]                                                FALSE
##       ...                                                  ...
##   [36316]                                                 TRUE
##   [36317]                                                 TRUE
##   [36318]                                                 TRUE
##   [36319]                                                FALSE
##   [36320]                                                FALSE
##           Sorted_Liver_day_12_1_Small_Paired_peaks.narrowPeak
##                                                     <logical>
##       [1]                                               FALSE
##       [2]                                               FALSE
##       [3]                                                TRUE
##       [4]                                               FALSE
##       [5]                                               FALSE
##       ...                                                 ...
##   [36316]                                                TRUE
##   [36317]                                                TRUE
##   [36318]                                                TRUE
##   [36319]                                               FALSE
##   [36320]                                                TRUE
##           Sorted_Liver_day_12_2_Small_Paired_peaks.narrowPeak
##                                                     <logical>
##       [1]                                               FALSE
##       [2]                                               FALSE
##       [3]                                               FALSE
##       [4]                                               FALSE
##       [5]                                               FALSE
##       ...                                                 ...
##   [36316]                                               FALSE
##   [36317]                                                TRUE
##   [36318]                                                TRUE
##   [36319]                                                TRUE
##   [36320]                                                TRUE
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
```
---
## Counting for differential ATACseq

Now we have to set of regions to count in we can use **summariseOverlaps()** to count paired reads landing in peaks as we have done for ChIPseq.

We have to adjust our counting for paired-end reads so we additionally set the **singleEnd** parameter to FALSE.

```r
library(GenomicAlignments)
bamsToCount <- dir("~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_BAM_forCounting/", full.names = TRUE,
    pattern = "*.\\.bam$")

myCounts <- summarizeOverlaps(consensusToCount, bamsToCount, singleEnd = FALSE)

colnames(myCounts) <- c("HindBrain_1", "HindBrain_2", "Kidney_1", "Kidney_2", "Liver_1",
    "Liver_2")
```
---
## DESeq2 for differential ATACseq

With our counts of fragments in nucleosome free regions we can now contruct a DESeq2 object.

We pass the GRanges of regions we count to DESeqDataSetFromMatrix function so as to access these from DESeq2 later.

```r
library(DESeq2)
load("data/myCounts.RData")
Group <- factor(c("HindBrain", "HindBrain", "Kidney", "Kidney", "Liver", "Liver"))
metaData <- data.frame(Group, row.names = colnames(myCounts))

atacDDS <- DESeqDataSetFromMatrix(assay(myCounts), metaData, ~Group, rowRanges = rowRanges(myCounts))
atacDDS <- DESeq(atacDDS)
```

```
## estimating size factors
```

```
## estimating dispersions
```

```
## gene-wise dispersion estimates
```

```
## mean-dispersion relationship
```

```
## -- note: fitType='parametric', but the dispersion trend was not well captured by the
##    function: y = a/x + b, and a local regression fit was automatically substituted.
##    specify fitType='local' or 'mean' to avoid this message next time.
```

```
## final dispersion estimates
```

```
## fitting model and testing
```
---
## DESeq2 for differential ATACseq

With the new DESeq2 object we can now test for any differences in ATACseq signal between groups.

In this example we look at differences between the Hindbrain and Kidney samples.

We return a GRanges object here to allow us to perform some more GenomicRanges operations.

```r
KidneyMinusHindbrain <- results(atacDDS, c("Group", "Kidney", "HindBrain"), format = "GRanges")
KidneyMinusHindbrain <- KidneyMinusHindbrain[order(KidneyMinusHindbrain$pvalue)]
KidneyMinusHindbrain
```

```
## GRanges object with 36320 ranges and 6 metadata columns:
##           seqnames              ranges strand |  baseMean log2FoldChange
##              <Rle>           <IRanges>  <Rle> | <numeric>      <numeric>
##       [1]     chr4   22488379-22488925      * |   96.8199       -3.48476
##       [2]     chr1   24613428-24614006      * |  285.2533        1.46558
##       [3]     chrX 143483004-143483128      * | 1130.4129       -1.22535
##       [4]     chr2 150644059-150644430      * |   36.0214       -4.17235
##       [5]    chr15   66239217-66239619      * |   32.3408       -4.44791
##       ...      ...                 ...    ... .       ...            ...
##   [36316]     chrX   93897236-93897401      * |   2.85026              0
##   [36317]     chrX 100475652-100475834      * |   2.19112              0
##   [36318]     chrX 153334237-153334437      * |   2.26341              0
##   [36319]     chrX 164197795-164198006      * |   4.02077              0
##   [36320]     chrX 169047684-169047857      * |   2.55896              0
##               lfcSE      stat      pvalue        padj
##           <numeric> <numeric>   <numeric>   <numeric>
##       [1]  0.272029 -12.81021 1.43736e-37 5.01798e-33
##       [2]  0.161832   9.05623 1.35040e-19 2.35719e-15
##       [3]  0.148502  -8.25140 1.56553e-16 1.82181e-12
##       [4]  0.521845  -7.99539 1.29168e-15 1.12735e-11
##       [5]  0.583590  -7.62164 2.50474e-14 1.74886e-10
##       ...       ...       ...         ...         ...
##   [36316]   2.36122         0           1           1
##   [36317]   2.56424         0           1          NA
##   [36318]   2.54823         0           1          NA
##   [36319]   2.22715         0           1           1
##   [36320]   2.42940         0           1          NA
##   -------
##   seqinfo: 22 sequences from an unspecified genome; no seqlengths
```

---
## DESeq2 for differential ATACseq

We can subset to only open regions within promoters and then create a HTML table to review the results in IGV using the **makebedtable()** function in tracktables package.

You can see the result [here](https://rockefelleruniversity.github.io/RU_ATACseq/data/KidneyMinusHindbrain.html)

```
## [1] "data//KidneyMinusHindbrain.html"
```

```r
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
toOverLap <- promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, 500, 500)
KidneyMinusHindbrain <- KidneyMinusHindbrain[(!is.na(KidneyMinusHindbrain$padj) &
    KidneyMinusHindbrain$padj < 0.05) & KidneyMinusHindbrain %over% toOverLap, ]
myReport <- makebedtable(KidneyMinusHindbrain, "KidneyMinusHindbrain.html", getwd())
browseURL(myReport)
```

---
## Annotation for differential ATACseq

In the final part we can annotate our differential ATACseq regions to genes and then use gene information to test enrichment for GO sets.

Since we have subset regions to those within +/- 500bp of a TSS we can use a standard enrichment analysis at this point. Here we use clusterProfiler to identify enrichment

```r
anno_KidneyMinusHindbrain <- annotatePeak(KidneyMinusHindbrain, TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,
    verbose = FALSE)
DB_ATAC <- as.data.frame(anno_KidneyMinusHindbrain)
DB_ATAC[1, ]
```

```
##   seqnames    start      end width strand baseMean log2FoldChange     lfcSE
## 1     chr4 22488379 22488925   547      * 96.81992      -3.484755 0.2720294
##        stat       pvalue         padj       annotation geneChr geneStart
## 1 -12.81021 1.437363e-37 5.017979e-33 Promoter (<=1kb)       4  22482780
##    geneEnd geneLength geneStrand geneId         transcriptId distanceToTSS
## 1 22488366       5587          2  18992 ENSMUST00000178174.2           -13
```

---
## Annotation for differential ATACseq

Since we have subset regions to those within +/- 500bp of a TSS we can use a standard enrichment analysis at this point. Here we use clusterProfiler to identify enrichment. For alternatives have a look at our [ChIPseq course](https://rockefelleruniversity.github.io/RU_ChIPseq/presentations/slides/ChIPseq_In_Bioconductor3.html#7)

```r
library(clusterProfiler)
go <- enrichGO(DB_ATAC$geneId, OrgDb = "org.Mm.eg.db", ont = "BP", maxGSSize = 5000)
go[1:2, 1:6]
```

```
##                    ID                                          Description
## GO:0120036 GO:0120036 plasma membrane bounded cell projection organization
## GO:0030030 GO:0030030                         cell projection organization
##            GeneRatio    BgRatio       pvalue     p.adjust
## GO:0120036    94/511 1609/23355 1.279533e-18 5.672171e-15
## GO:0030030    95/511 1656/23355 2.713130e-18 6.013653e-15
```

---
class: inverse, center, middle

# Finding Motifs in ATACseq data

---

## Cutting sites from ATACseq data

ATACseq should generate shorter fragments (our nucleosome free regions) around smaller *protected* areas such as transcription factor binding sites.

We can therefore look for the pile-up of cut-sites around motifs of interest within different tissues/celltypes/samples.

To produce cut-sites from our BAM file we first resize our reads to 1bp and make the shift of 4/-5 bp depending on strand to adjust for expected shift from insertion of Tn5 transposase.

Here we will identify CTCF motifs passing an arbitary cut-off and then use soGGi to plot cut-sites around them. We will jump back to our Greenleaf dataset to do this.

---
## Finding motifs

We need to identify the position of CTCF motifs across the genome so first we need to know what a CTCF motif looks like.

The motifDB package contains information on Motifs from public databases such as JASPAR. Here we use the **query()** function with our motif of interest (**CTCF**) to extract the CTCF motif.

```r
library(MotifDb)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg19)
CTCF <- query(MotifDb, c("CTCF"))
CTCF
```

```
## MotifDb object of length 19
## | Created from downloaded public sources: 2013-Aug-30
## | 19 position frequency matrices from 9 sources:
## |        HOCOMOCOv10:    3
## | HOCOMOCOv11-core-A:    2
## |              HOMER:    3
## |        JASPAR_2014:    2
## |        JASPAR_CORE:    1
## |         jaspar2016:    2
## |         jaspar2018:    3
## |          jolma2013:    1
## |       SwissRegulon:    2
## | 4 organism/s
## |           Hsapiens:   12
## |      Dmelanogaster:    3
## |          Mmusculus:    1
## |              other:    3
## Hsapiens-HOCOMOCOv10-CTCFL_HUMAN.H10MO.A 
## Hsapiens-HOCOMOCOv10-CTCF_HUMAN.H10MO.A 
## Mmusculus-HOCOMOCOv10-CTCF_MOUSE.H10MO.A 
## Hsapiens-HOCOMOCOv11-core-A-CTCFL_HUMAN.H11MO.0.A 
## Hsapiens-HOCOMOCOv11-core-A-CTCF_HUMAN.H11MO.0.A 
## ...
## Dmelanogaster-jaspar2018-CTCF-MA0531.1 
## Hsapiens-jaspar2018-CTCFL-MA1102.1 
## Hsapiens-jolma2013-CTCF 
## Hsapiens-SwissRegulon-CTCFL.SwissRegulon 
## Hsapiens-SwissRegulon-CTCF.SwissRegulon
```
---
## Finding motifs

We can extract a point weight matrix for CTCF which specifies the likelihood of a DNA base occurring in a CTCF motif. Here we extract the motif for CTCF from Human JASPAR Core database.

```r
names(CTCF)
```

```
##  [1] "Hsapiens-HOCOMOCOv10-CTCFL_HUMAN.H10MO.A"                                   
##  [2] "Hsapiens-HOCOMOCOv10-CTCF_HUMAN.H10MO.A"                                    
##  [3] "Mmusculus-HOCOMOCOv10-CTCF_MOUSE.H10MO.A"                                   
##  [4] "Hsapiens-HOCOMOCOv11-core-A-CTCFL_HUMAN.H11MO.0.A"                          
##  [5] "Hsapiens-HOCOMOCOv11-core-A-CTCF_HUMAN.H11MO.0.A"                           
##  [6] "NA-HOMER-BORIS(Zf)/K562-CTCFL-ChIP-Seq(GSE32465)/Homer"                     
##  [7] "NA-HOMER-CTCF(Zf)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer"                  
##  [8] "NA-HOMER-CTCF-SatelliteElement(Zf?)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer"
##  [9] "Hsapiens-JASPAR_CORE-CTCF-MA0139.1"                                         
## [10] "Hsapiens-JASPAR_2014-CTCF-MA0139.1"                                         
## [11] "Dmelanogaster-JASPAR_2014-CTCF-MA0531.1"                                    
## [12] "Hsapiens-jaspar2016-CTCF-MA0139.1"                                          
## [13] "Dmelanogaster-jaspar2016-CTCF-MA0531.1"                                     
## [14] "Hsapiens-jaspar2018-CTCF-MA0139.1"                                          
## [15] "Dmelanogaster-jaspar2018-CTCF-MA0531.1"                                     
## [16] "Hsapiens-jaspar2018-CTCFL-MA1102.1"                                         
## [17] "Hsapiens-jolma2013-CTCF"                                                    
## [18] "Hsapiens-SwissRegulon-CTCFL.SwissRegulon"                                   
## [19] "Hsapiens-SwissRegulon-CTCF.SwissRegulon"
```

```r
ctcfMotif <- CTCF[[1]]
ctcfMotif[, 1:4]
```

```
##       1     2     3     4
## A 0.100 0.162 0.302 0.072
## C 0.356 0.214 0.100 0.826
## G 0.122 0.406 0.436 0.050
## T 0.422 0.218 0.162 0.052
```
---
## Visualising PWMs

We can visualise the frequency of DNA bases in our motif using the **seqLogo** package and the **seqLogo** function.

```r
library(seqLogo)
```

```
## Loading required package: grid
```

```
## 
## Attaching package: 'grid'
```

```
## The following object is masked from 'package:Biostrings':
## 
##     pattern
```

```r
seqLogo(ctcfMotif)
```

![](RU_ATAC_part2_files/figure-html/processData_motifCutsc-1.png)
---
## Searching for PWMs in DNAstring

We can now use the **matchPWM()** function with our newly acquired CTCF PWM.

Here we will search the sequence on Chr20 using the sequence provided within the BSgenome library for human **BSgenome.Hsapiens.UCSC.hg19**.

The result is a **Views** object, similar to the IRanges object. We convert

```r
myRes <- matchPWM(ctcfMotif, BSgenome.Hsapiens.UCSC.hg19[["chr20"]])
myRes
```

```
## Views on a 63025520-letter DNAString subject
## subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## views:
##             start      end width
##      [1]    60545    60563    19 [GGACCACGAGGGGCCAGAG]
##      [2]   216708   216726    19 [TGGGCTCTAGAGGGCAGGG]
##      [3]   239773   239791    19 [TTGCCACTGGGGGGAGACA]
##      [4]   246577   246595    19 [CAGCCCCCAGGTGGCAGCC]
##      [5]   335282   335300    19 [GTACCACTAGAGGGAGCTC]
##      ...      ...      ...   ... ...
##   [2327] 62917548 62917566    19 [TCGCCAGCAGGGGGTGCAC]
##   [2328] 62917607 62917625    19 [CTGCAAGCAGGGGGCGGTC]
##   [2329] 62922628 62922646    19 [GCACCAGAAGGTGGCAGGA]
##   [2330] 62922894 62922912    19 [GATCCAGCAAGGGGCGGCA]
##   [2331] 62961365 62961383    19 [CCACCACCACGTGGTGCCA]
```

---
  
## Searching for PWMs in DNAstring

We need to convert the Views object to a GRanges so we can use these in soGGi to plot cut sites over.

```r
toCompare <- GRanges("chr20", ranges(myRes))
toCompare
```

```
## GRanges object with 2331 ranges and 0 metadata columns:
##          seqnames            ranges strand
##             <Rle>         <IRanges>  <Rle>
##      [1]    chr20       60545-60563      *
##      [2]    chr20     216708-216726      *
##      [3]    chr20     239773-239791      *
##      [4]    chr20     246577-246595      *
##      [5]    chr20     335282-335300      *
##      ...      ...               ...    ...
##   [2327]    chr20 62917548-62917566      *
##   [2328]    chr20 62917607-62917625      *
##   [2329]    chr20 62922628-62922646      *
##   [2330]    chr20 62922894-62922912      *
##   [2331]    chr20 62961365-62961383      *
##   -------
##   seqinfo: 1 sequence from an unspecified genome; no seqlengths
```

---
## Shifting reads for cut-sites

To plot cut-sites we will wish to consider only the 5' end of reads and will need to adjust for a known offset of 5' reads to actual T5 cut-sites.

This will involve capturing the 5'end of reads and shifting reads on positive and negative strand by 4bp or -5bp respectively.

First we read in our nucleosome free region BAM file and extract read pairs.

```r
BAM <- "~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2_openRegions.bam"
atacReads_Open <- readGAlignmentPairs(BAM)
read1 <- first(atacReads_Open)
read2 <- second(atacReads_Open)
read2[1, ]
```

```
## GAlignments object with 1 alignment and 0 metadata columns:
##       seqnames strand       cigar    qwidth     start       end     width
##          <Rle>  <Rle> <character> <integer> <integer> <integer> <integer>
##   [1]    chr20      -         50M        50     60446     60495        50
##           njunc
##       <integer>
##   [1]         0
##   -------
##   seqinfo: 25 sequences from an unspecified genome
```

---
  
## Shifting reads for cut-sites

Now we can shift the 5' end of both reads pairs by 4bp or -5bp depending on strand. This produces a GRanges of all our cut-sites from both reads.

```r
Firsts <- resize(granges(read1), fix = "start", 1)
First_Pos_toCut <- shift(granges(Firsts[strand(read1) == "+"]), 4)
First_Neg_toCut <- shift(granges(Firsts[strand(read1) == "-"]), -5)

Seconds <- resize(granges(read2), fix = "start", 1)
Second_Pos_toCut <- shift(granges(Seconds[strand(read2) == "+"]), 4)
Second_Neg_toCut <- shift(granges(Seconds[strand(read2) == "-"]), -5)

test_toCut <- c(First_Pos_toCut, First_Neg_toCut, Second_Pos_toCut, Second_Neg_toCut)
test_toCut[1:2, ]
```

```
## GRanges object with 2 ranges and 0 metadata columns:
##       seqnames    ranges strand
##          <Rle> <IRanges>  <Rle>
##   [1]    chr20     60440      +
##   [2]    chr20     61905      +
##   -------
##   seqinfo: 25 sequences from an unspecified genome
```
---
  
## Coverage for cut-sites

Now we can use the GRanges of cut-site positions to produce an RLElist of cut-sites across the genome using the **coverage()** function.

```r
cutsCoverage <- coverage(test_toCut)
cutsCoverage20 <- cutsCoverage["chr20"]
cutsCoverage20[[1]]
```

```
## integer-Rle of length 63025520 with 215635 runs
##   Lengths: 60439     1    49     1  1414 ...  5268     1    42     1 60252
##   Values :     0     1     0     1     0 ...     0     1     0     1     0
```
---
  
## Plotting for cut-sites

We can use an RLElist with soGGi to produce a plot of cut-sites around our discovered CTCF motifs.

We change the **format** to rlelist and the **distanceAround** parameter to 500bp.

```r
CTCF_Cuts_open <- regionPlot(cutsCoverage20, testRanges = toCompare, style = "point",
    format = "rlelist", distanceAround = 500)
```

---
  
## Plotting for cut-sites

Now we can now plot our cut-sites using the **plotRegion()** function.

```r
plotRegion(CTCF_Cuts_open, outliers = 0.001) + ggtitle("NucFree Cuts Centred on CTCF") +
    theme_bw()
```

![](RU_ATAC_part2_files/figure-html/processData_motifCutsi-1.png)

---
## Time for an exercise!

Exercise on ATACseq data can be found [here](../../exercises/exercises/ATACseq_part2_exercise.html)

---
## Answers to exercise

Answers can be found [here](../../exercises/answers/ATACseq_part2_answers.html)