In this section we will work a little more with the Greenleaf dataset.
To remind you, this is the original ATACseq dataset and we are working with a single sample. More details are here.
We will characterize the Greenleaf data signal around TSS regions and identify and annotate open regions using peak calls with MACS2.
At the end of the session we will plot transposase cut-sites around DNA binding proteins.
In this session we will also start working with a larger dataset from the ENCODE consortium generated by Bing Ren at UCSD. This dataset has multiple samples for multiple tissues (Liver, Kidney and Hindbrain), allowing us to perform a differential analysis on open regions.
Links to data and sample information are included in list below.
Liver day 12 - ENCSR302LIV
Kidney day 15 - ENCSR023QZX
Hindbrain day 12 - ENCSR088UYE
We start with public sequencing data in links below and use reference data in Bioconductor.
Since some of these processing steps may take a little time I provide links to pre-processed results.
Essentials
BAM file and BAI index from our alignment/sorting/indexing.
SAMN02192806 - Greenleaf BAM - Full BAM file for Greenleaf example produced following in our Rsubread alignment, sorting and indexing.
SAMN02192806 - Greenleaf BAI index - BAI index file for BAM in Greenleaf example produced following in our alignment, sorting and indexing.
Essentials
Small BAM, peak calls and directory structure.
Once you have downloaded the above and unzipped ATAC_Workshop.zip, you should move the Sorted_ATAC_50K_2.bam and Sorted_ATAC_50K_2.bam.bai file into ATAC_Workshop/ATAC_Data/ATAC_BAM/
Not essential
Same as above but with BAMs for counting as well as small BAM, peak calls and directory structure.
Bigwigs for IGV.
This means our data potentially contains a mixture of signal types in our data.
All our data is from open chromatin where our transposase has been able to access.
If our shorter fragments represent the open regions around transcription factors and transcriptional machinery we would expect to see signal at transcriptional start sites.
Our longer fragments will represent signal around nucleosomes and so signal should be outside of the transcriptional start sites and more present at the +1 and -1 nucleosome positions.
We can create a meta-plot over all TSS regions to illustrate where our nucleosome free and nucleosome occupied fractions of signal are most prevalent.
Meta-plots average or sum signal over sets of regions to identify trends in data.
To produce meta-plots of signal over regions we can use the soGGi bioconductor package. We can load in soGGi with the BiocManager::install and library function, as we have done before.
::install("soGGi")
BiocManagerlibrary(soGGi)
The soGGi library simply requires a BAM file and a GRanges of regions over which to average signal to produce the meta-plot.
We wish to plot over TSS regions and so we first will need to produce a GRanges of TSS locations for hg19 genome.
Thankfully we now know how to extract these regions for all genes using the TxDB packages and some GenomicRanges functions.
First we can load our TxDb of interest - TxDb.Hsapiens.UCSC.hg19.knownGene.
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
TxDb.Hsapiens.UCSC.hg19.knownGene
## TxDb object:
## # Db type: TxDb
## # Supporting package: GenomicFeatures
## # Data source: UCSC
## # Genome: hg19
## # Organism: Homo sapiens
## # Taxonomy ID: 9606
## # UCSC Table: knownGene
## # Resource URL: http://genome.ucsc.edu/
## # Type of Gene ID: Entrez Gene ID
## # Full dataset: yes
## # miRBase build ID: GRCh37
## # transcript_nrow: 82960
## # exon_nrow: 289969
## # cds_nrow: 237533
## # Db created by: GenomicFeatures package from Bioconductor
## # Creation time: 2015-10-07 18:11:28 +0000 (Wed, 07 Oct 2015)
## # GenomicFeatures version at creation time: 1.21.30
## # RSQLite version at creation time: 1.0.0
## # DBSCHEMAVERSION: 1.1
We can extract gene locations (TSS to TTS) using the genes() function and our TxDb object.
<- genes(TxDb.Hsapiens.UCSC.hg19.knownGene) genesLocations
## 403 genes were dropped because they have exons located on both strands
## of the same reference sequence or on more than one reference sequence,
## so cannot be represented by a single genomic range.
## Use 'single.strand.genes.only=FALSE' to get all the genes in a
## GRangesList object, or use suppressMessages() to suppress this message.
genesLocations
## GRanges object with 23056 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 58858172-58874214 - | 1
## 10 chr8 18248755-18258723 + | 10
## 100 chr20 43248163-43280376 - | 100
## 1000 chr18 25530930-25757445 - | 1000
## 10000 chr1 243651535-244006886 - | 10000
## ... ... ... ... . ...
## 9991 chr9 114979995-115095944 - | 9991
## 9992 chr21 35736323-35743440 + | 9992
## 9993 chr22 19023795-19109967 - | 9993
## 9994 chr6 90539619-90584155 + | 9994
## 9997 chr22 50961997-50964905 - | 9997
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
We can then use the resize() function to extract the location of start of every gene (the TSSs) in a stranded manner.
Here we set the fix position as the start and the width as 1.
<- resize(genesLocations, fix = "start", width = 1)
tssLocations tssLocations
## GRanges object with 23056 ranges and 1 metadata column:
## seqnames ranges strand | gene_id
## <Rle> <IRanges> <Rle> | <character>
## 1 chr19 58874214 - | 1
## 10 chr8 18248755 + | 10
## 100 chr20 43280376 - | 100
## 1000 chr18 25757445 - | 1000
## 10000 chr1 244006886 - | 10000
## ... ... ... ... . ...
## 9991 chr9 115095944 - | 9991
## 9992 chr21 35736323 + | 9992
## 9993 chr22 19109967 - | 9993
## 9994 chr6 90539619 + | 9994
## 9997 chr22 50964905 - | 9997
## -------
## seqinfo: 93 sequences (1 circular) from hg19 genome
When we created our index we subset the genome to the main chromosomes. We can do this again with our TSS GRange object, and update the levels. This means the BAM and GRanges will play nicely.
<- paste0("chr", c(1:21, "X", "Y", "M"))
mainChromosomes
<- (match(seqnames(tssLocations), mainChromosomes))
myindex
<- tssLocations[as.numeric(myindex)]
tssLocations
seqlevels(tssLocations) <- mainChromosomes
The soGGi package’s regionPlot() function requires a BAM file of data to plot supplied to bamFile parameter and a GRanges to plot over supplied to testRanges argument.
library(soGGi)
<- "~/Downloads/ATAC_Workshop/Sorted_ATAC_50K_2.bam"
sortedBAM
library(Rsamtools)
# Nucleosome free
<- regionPlot(bamFile = sortedBAM, testRanges = tssLocations) allSignal
Additionally we supply information on input file format to format parameter, whether data is paired to paired parameter and type of plot to style parameter. We explore visualization options visualization training.
A useful feature is that we can we can specify the minimum and maximum fragment lengths of paired reads to be used in our plot with the minFragmentLength and maxFragmentLength parameters. This allows us to select only our nucleosome free signal (< 100 base-pairs) to produce our metaplot over TSS regions.
<- regionPlot(bamFile = sortedBAM, testRanges = tssLocations, style = "point",
nucFree format = "bam", paired = TRUE, minFragmentLength = 0, maxFragmentLength = 100,
forceFragment = 50)
class(nucFree)
Now we have our profile object we can create our metaplot using the plotRegion() function in soGGi.
Here we see the expected peak of signal for our nucleosome free region in the region over the TSS.
plotRegion(nucFree)
We can create a plot for our mono-nucleosome signal by adjusting our minFragmentLength and maxFragmentLength parameters to those expected for nucleosome length fragments (here 180 to 240).
<- regionPlot(bamFile = sortedBAM, testRanges = tssLocations, style = "point",
monoNuc format = "bam", paired = TRUE, minFragmentLength = 180, maxFragmentLength = 240,
forceFragment = 80)
Similarly we can plot the mono-nucleosome signal over TSS locations using plotRegion() function.
In this plot we can clearly see the expected +1 nucleosome signal peak as well as several other nucleosome signalpeaks
plotRegion(monoNuc)
A common goal in ATACseq is to identify nucleosome free regions where transcription factors are binding and/or transcriptional machinery is active. This nucleosome free signal would correspond to fragments less than one nucleosome (as defined in Greenleaf paper < 100bp)
To identify open chromatin however we could simply use all reads which have been properly paired in sequencing (< 2000bp).
For the remainder of the workshop we will look at analysing the nucleosome free portions of the ATACseq data.
There are many methods available to call nucleosome free regions from ATACseq data with many borrowed from ChIPseq analysis. One very popular and standard peak caller for ATACseq is MAC2.
There is no R package for MACS2 (well it just got released but we haven’t tested it yet). It is possible to install it on Mac and Linux using the Anaconda package repository (unfortunately there is not a Windows implementation). Anaconda is a huge collection of version controlled packages that can be installed through the conda package management system. With conda it is easy to create and manage environments that have a variety of different packages in them.
Though there is no MACS2 R package, we can interact with the anaconda package system using the R package Herper. This was created by us here at the BRC and is part of Bioconductor. We cover its installation and use for ChIPseq here
::install("Herper")
BiocManagerlibrary(Herper)
First, we will use Herper to install Salmon with the install_CondaTools function. We just need to tell install_CondaTools what tool/s you want and the name of the environment you want to build.
<- install_CondaTools(tools = "macs2", env = "ATACseq_analysis")
salmon_paths salmon_paths
Behind the scenes, Herper will install the most minimal version of conda (called miniconda), and then will create a new environment into which MACS2 will be installed. When you run the function it prints out where MACS2 is installed. There are additional arguments to control where miniconda is installed using pathToMiniConda and also update an existing environment with updateEnv.
With single-end sequencing from ATACseq we do not know how long the fragments are. To identify open regions therefore requires some different parameters for MACS2 peak calling compared to ChIPseq.
One strategy employed is to shift read 5’ ends by -100 and then extend from this by 200bp. Considering the expected size of our nucleosome free fragments this should provide a pile-up over nucelosome regions suitable for MACS2 window size.
This is the MACS system call we would use to do this.
MACS2 callpeak -t singleEnd.bam --nomodel --shift -100
--extsize 200 --format BAM -g MyGenome
In R we can run this system call using Herper so we can access our installed MACS2. MACS2 has been installed into ATACseq_analysis. So we can use this environment from R using with_CondaEnv().
with_CondaEnv("ATACseq_analysis",
system2(command="macs2",args =c("callpeak",
"-t", "singleEnd.bam",
"--nomodel",
"--shift","-100",
"--extsize", "200",
"--format", "BAM",
"-g", "hs")),
stdout = TRUE))
Alternatively for the nucleosome occupied data we can adjust shift and extension to centre the signal on nucleosome centres (nucleosomes wrapped in 147bp of DNA).
MACS2 callpeak -t singleEnd.bam --nomodel --shift 37
--extsize 73 --format BAM -g MyGenome
with_CondaEnv("ATACseq_analysis", system2(command = "macs2", args = c("callpeak",
"-t", "singleEnd.bam", "--nomodel", "--shift", "37", "--extsize", "73", "--format",
"BAM", "-g", "hs")), stdout = TRUE)
If we have sequenced paired-end data then we do know the fragment lengths and can provide BAM files to MACS2 which have been prefiltered to properly paired (and fragment size if you want to distinguish nucleosomes from nucleosome free regions)
We have to simply tell MACS2 that the data is paired using the format argument. This is important as MACS2 will guess it is single-end BAM by default.
MACS2 callpeak -t pairedEnd.bam -f BAMPE
--outdir path/to/output/
--name pairedEndPeakName -g MyGenome
with_CondaEnv("ATACseq_analysis", system2(command = "macs2", args = c("callpeak",
"-t", "pairedEnd.bam", "--format", "BAMPE", "--outdir", "/Documents/ATAC_MACS2_calls/",
"--name", "pairedEndPeakName", "-g", "hs")), stdout = TRUE)
For our paired-end data here, we call peaks on our nucleosome free regions from our nucleosome free region BAM file.
MACS2 callpeak -t ~/Downloads/Sorted_ATAC_50K_2_openRegions.bam
--outdir ATAC_Data/ATAC_Peaks/ATAC_50K_2
--name Sorted_ATAC_50K_2_Small_Paired_peaks.narrowPeak
-f BAMPE -g hs
with_CondaEnv("ATACseq_analysis", system2(command = "macs2", args = c("callpeak",
"-t", "~/Downloads/Sorted_ATAC_50K_2_openRegions.bam", "--outdir", "ATAC_Data/ATAC_Peaks/ATAC_50K_2",
"--name", "Sorted_ATAC_50K_2_Small_Paired_peaks.narrowPeak", "-f", "BAMPE", "-g",
"hs")), stdout = TRUE)
Following peak calling we would get 3 files.
Name.narrowPeak – Narrow peak format suitable for IGV and further analysis
Name_peaks.xls – Peak table suitable for review in excel.(not actually a xls but a tsv)
summits.bed – Summit positions for peaks useful for finding motifs and plotting
Often we want to do some QC to check for low quality, duplicates and signal distribution.
Before we remove any data we can get a quick assessment of our reads in peaks, duplication rate, low quality reads and reads in artifact regions from ChIPQC.
To do this we will use the hg19 blacklist which you can download here
library(ChIPQC)
library(rtracklayer)
library(DT)
library(dplyr)
library(tidyr)
<- import.bed("~/Downloads/ENCFF001TDO.bed.gz")
blkList <- "~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_Peaks/Sorted_ATAC_50K_2_Small_Paired_peaks.narrowPeak"
openRegionPeaks
<- ChIPQCsample("~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2_openRegions.bam",
qcRes peaks = openRegionPeaks, annotation = "hg19", chromosomes = "chr20", blacklist = blkList,
verboseT = FALSE)
We can use the ChIPQC package to capture some important metrics for our ATACseq data such as reads in peaks and reads in blacklist from the QCmetrics() function and numbers of duplicated reads from the flagtagcounts() funcion.
<- QCmetrics(qcRes)
myMetrics c("RiBL%", "RiP%")] myMetrics[
## RiBL% RiP%
## 0.342 12.300
<- flagtagcounts(qcRes)
flgCounts <- flgCounts["DuplicateByChIPQC"]/flgCounts["Mapped"]
DupRate * 100 DupRate
## DuplicateByChIPQC
## 22.90539
Artifacts from sequencing and imperfect genome builds can confound our results. These artifacts have been collated together into “Blacklists” of regions. We discuss these regions in more detail here
Since blacklisted regions may confound our analysis we remove any peaks which have been called there.
Removing blacklists too early can hide some of the QC issues in your data. The blacklist should always be considered in your analysis and recommended to removed data from these regions once QC is considered.
<- granges(qcRes[seqnames(qcRes) %in% "chr20"])
MacsCalls
data.frame(Blacklisted = sum(MacsCalls %over% blkList), Not_Blacklisted = sum(!MacsCalls %over%
blkList))
<- MacsCalls[!MacsCalls %over% blkList] MacsCalls
It is often of interest to associate identified nucleosome free regions to genomic features such as genes and enhancers.
Once annotated to genes or enhancers’ genes, we can start to associate ATACseq data to characteristics of these genes. (functional annotation, expression changes, other epigenetic states)
A simple approach to annotating nucleosome free region to genes is to associate regions to their closest gene or within a window around a genes transcriptional start site.
We can use the chipseeker library to identify genes closest to our regions and to give us simple summaries and visualizations of this annotation.
We use the gene models from TxDb.Hsapiens.UCSC.hg19.knownGene and supply this to ChIPseeker packages annotatePeak function.
ChIPseeker’s csAnno object will then show breakdown of percentages of peaks in genomic regions.
library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
<- annotatePeak(MacsCalls, TxDb = TxDb.Hsapiens.UCSC.hg19.knownGene) MacsCalls_Anno
## >> preparing features information... 2021-07-28 05:39:40 PM
## >> identifying nearest features... 2021-07-28 05:39:40 PM
## >> calculating distance from peak to TSS... 2021-07-28 05:39:40 PM
## >> assigning genomic annotation... 2021-07-28 05:39:40 PM
## >> assigning chromosome lengths 2021-07-28 05:39:52 PM
## >> done... 2021-07-28 05:39:52 PM
MacsCalls_Anno
## Annotated peaks generated by ChIPseeker
## 596/596 peaks were annotated
## Genomic Annotation Summary:
## Feature Frequency
## 8 Promoter (<=1kb) 40.2684564
## 9 Promoter (1-2kb) 2.0134228
## 10 Promoter (2-3kb) 2.1812081
## 4 5' UTR 0.1677852
## 3 3' UTR 1.0067114
## 1 1st Exon 0.1677852
## 6 Other Exon 1.5100671
## 2 1st Intron 6.0402685
## 7 Other Intron 19.9664430
## 5 Distal Intergenic 26.6778523
As well as showing us a table of the annotation distribution we can visualize this using the plotAnnoPie and plotAnnoBar functions.
plotAnnoPie(MacsCalls_Anno)
With this information we can then subset our peaks/nuc free regions to those only landing in TSS regions (+/- 500)
<- as.GRanges(MacsCalls_Anno)
MacsGR_Anno <- MacsGR_Anno[abs(MacsGR_Anno$distanceToTSS) < 500]
MacsGR_TSS 1, ] MacsGR_TSS[
## GRanges object with 1 range and 9 metadata columns:
## seqnames ranges strand | annotation geneChr geneStart
## <Rle> <IRanges> <Rle> | <character> <integer> <integer>
## [1] chr20 278001-278160 * | Promoter (<=1kb) 20 278204
## geneEnd geneLength geneStrand geneId transcriptId distanceToTSS
## <integer> <integer> <integer> <character> <character> <numeric>
## [1] 280963 2760 1 85364 uc002wdf.3 -44
## -------
## seqinfo: 1 sequence from hg19 genome
Another common step to ATACseq analysis is to identify any functional enrichment in genes associated to nucleosome free regions.
One approach is to take the genes we identified from ChIPseeker as having nuclesome free regions and test these for functional enrichment using standard tools like GOseq.
Another approach which is well suited to ATACseq is that implemented in GREAT.
rGREAT by default will limit our queries and so we may need to be patient. This can be adjusted in rGREAT options.
We can submit our peak calls to GREAT using the submitGreatJob function and review available categories of results using availableCategories.
library(rGREAT)
<- submitGreatJob(MacsCalls, species = "hg19")
great_Job availableCategories(great_Job)
## [1] "GO" "Phenotype" "Genes"
For this example we select the results tables for the GO category using getEnrichmentTables functions and then review the results for Biological processes.
= getEnrichmentTables(great_Job, category = "GO")
great_ResultTable names(great_ResultTable)
## [1] "GO Molecular Function" "GO Biological Process" "GO Cellular Component"
"GO Biological Process"]][1:4, ] great_ResultTable[[
## ID name
## 1 GO:0060466 activation of meiosis involved in egg activation
## 2 GO:2000438 negative regulation of monocyte extravasation
## 3 GO:2000560 positive regulation of CD24 biosynthetic process
## 4 GO:0061101 neuroendocrine cell differentiation
## Binom_Genome_Fraction Binom_Expected Binom_Observed_Region_Hits
## 1 0.0003871583 0.2307463 15
## 2 0.0003871583 0.2307463 15
## 3 0.0003871583 0.2307463 15
## 4 0.0014154550 0.8436112 20
## Binom_Fold_Enrichment Binom_Region_Set_Coverage Binom_Raw_PValue
## 1 65.00645 0.02516779 1.451510e-22
## 2 65.00645 0.02516779 1.451510e-22
## 3 65.00645 0.02516779 1.451510e-22
## 4 23.70760 0.03355705 4.566609e-21
## Binom_Adjp_BH Hyper_Total_Genes Hyper_Expected Hyper_Observed_Gene_Hits
## 1 6.360033e-19 1 0.01967761 1
## 2 6.360033e-19 1 0.01967761 1
## 3 6.360033e-19 1 0.01967761 1
## 4 1.500702e-17 7 0.13774330 2
## Hyper_Fold_Enrichment Hyper_Gene_Set_Coverage Hyper_Term_Gene_Coverage
## 1 50.81918 0.002739726 1.0000000
## 2 50.81918 0.002739726 1.0000000
## 3 50.81918 0.002739726 1.0000000
## 4 14.51977 0.005479452 0.2857143
## Hyper_Raw_PValue Hyper_Adjp_BH
## 1 0.019677610 1
## 2 0.019677610 1
## 3 0.019677610 1
## 4 0.007595723 1
We have briefly reviewed the processing and initial analysis of one ATACseq sample using R.
In the next part we will look at how we can identify changes in open regions using R/Bioconductor.
Here we will take an approach akin that in Diffbind and reasonably esatablished in ATACseq analysis.
First, We will define a set of non-redundant peaks present in at least 2 samples and use these to assess changes in nucleosome free ATACseq signal using DESeq2. Here we use the same method for deriving consensus peaks for differentials as seen for ChIPseq.
We take peaks across all our samples and reduce them into a single set of non-redudnant peaks. We can then create a matrix of presence/absence of these peaks over each sample.
<- dir("~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_Peaks_forCounting/", pattern = "*.narrowPeak",
peaks full.names = TRUE)
<- lapply(peaks, ChIPQC:::GetGRanges, simple = TRUE)
myPeaks <- reduce(unlist(GRangesList(myPeaks)))
allPeaksSet_nR <- list()
overlap for (i in 1:length(myPeaks)) {
<- allPeaksSet_nR %over% myPeaks[[i]]
overlap[[i]]
}<- do.call(cbind, overlap)
overlapMatrix colnames(overlapMatrix) <- basename(peaks)
mcols(allPeaksSet_nR) <- overlapMatrix
1:2, ] allPeaksSet_nR[
## GRanges object with 2 ranges and 6 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 3130336-3130413 * |
## [2] chr1 3400112-3400250 * |
## Sorted_Hindbrain_day_12_1_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## Sorted_Hindbrain_day_12_2_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## Sorted_Kidney_day_15_1_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## Sorted_Kidney_day_15_2_Small_Paired_peaks.narrowPeak
## <logical>
## [1] TRUE
## [2] TRUE
## Sorted_Liver_day_12_1_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## Sorted_Liver_day_12_2_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
We filter out peaks in blacklists and in ChrM prior to testing to eliminate potential artifact differential calls.
<- import.bed("data/ENCFF547MET.bed.gz")
blklist <- allPeaksSet_nR[!allPeaksSet_nR %over% blklist & !seqnames(allPeaksSet_nR) %in%
nrToCount "chrM"]
nrToCount
## GRanges object with 89654 ranges and 6 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 3130336-3130413 * |
## [2] chr1 3400112-3400250 * |
## [3] chr1 3433954-3434095 * |
## [4] chr1 3515020-3515102 * |
## [5] chr1 3575895-3576078 * |
## ... ... ... ... .
## [89650] chrY 90805058-90805151 * |
## [89651] chrY 90807391-90807562 * |
## [89652] chrY 90808761-90808841 * |
## [89653] chrY 90812961-90813237 * |
## [89654] chrY 90829708-90829782 * |
## Sorted_Hindbrain_day_12_1_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## [3] FALSE
## [4] FALSE
## [5] FALSE
## ... ...
## [89650] FALSE
## [89651] FALSE
## [89652] FALSE
## [89653] TRUE
## [89654] FALSE
## Sorted_Hindbrain_day_12_2_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## [3] FALSE
## [4] FALSE
## [5] FALSE
## ... ...
## [89650] FALSE
## [89651] FALSE
## [89652] TRUE
## [89653] TRUE
## [89654] FALSE
## Sorted_Kidney_day_15_1_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## [3] TRUE
## [4] FALSE
## [5] TRUE
## ... ...
## [89650] FALSE
## [89651] TRUE
## [89652] FALSE
## [89653] TRUE
## [89654] FALSE
## Sorted_Kidney_day_15_2_Small_Paired_peaks.narrowPeak
## <logical>
## [1] TRUE
## [2] TRUE
## [3] TRUE
## [4] TRUE
## [5] TRUE
## ... ...
## [89650] FALSE
## [89651] FALSE
## [89652] FALSE
## [89653] FALSE
## [89654] TRUE
## Sorted_Liver_day_12_1_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## [3] FALSE
## [4] FALSE
## [5] FALSE
## ... ...
## [89650] TRUE
## [89651] FALSE
## [89652] FALSE
## [89653] TRUE
## [89654] FALSE
## Sorted_Liver_day_12_2_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## [3] FALSE
## [4] FALSE
## [5] FALSE
## ... ...
## [89650] FALSE
## [89651] TRUE
## [89652] FALSE
## [89653] TRUE
## [89654] FALSE
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
We now identify the number of samples in which our non-redundant peaks occur. Here we use the rowSums() function with our occurrence matrix and select those samples occuring in at least 2 samples.
library(Rsubread)
<- rowSums(as.data.frame(elementMetadata(nrToCount)))
occurrences
<- nrToCount[occurrences >= 2, ]
nrToCount nrToCount
## GRanges object with 36320 ranges and 6 metadata columns:
## seqnames ranges strand |
## <Rle> <IRanges> <Rle> |
## [1] chr1 3433954-3434095 * |
## [2] chr1 3575895-3576078 * |
## [3] chr1 3671534-3671885 * |
## [4] chr1 3994858-3995006 * |
## [5] chr1 4560267-4560344 * |
## ... ... ... ... .
## [36316] chrY 90784207-90784298 * |
## [36317] chrY 90784944-90785099 * |
## [36318] chrY 90801504-90801651 * |
## [36319] chrY 90807391-90807562 * |
## [36320] chrY 90812961-90813237 * |
## Sorted_Hindbrain_day_12_1_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## [3] FALSE
## [4] TRUE
## [5] FALSE
## ... ...
## [36316] FALSE
## [36317] TRUE
## [36318] TRUE
## [36319] FALSE
## [36320] TRUE
## Sorted_Hindbrain_day_12_2_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## [3] FALSE
## [4] TRUE
## [5] TRUE
## ... ...
## [36316] FALSE
## [36317] TRUE
## [36318] FALSE
## [36319] FALSE
## [36320] TRUE
## Sorted_Kidney_day_15_1_Small_Paired_peaks.narrowPeak
## <logical>
## [1] TRUE
## [2] TRUE
## [3] TRUE
## [4] FALSE
## [5] TRUE
## ... ...
## [36316] FALSE
## [36317] TRUE
## [36318] TRUE
## [36319] TRUE
## [36320] TRUE
## Sorted_Kidney_day_15_2_Small_Paired_peaks.narrowPeak
## <logical>
## [1] TRUE
## [2] TRUE
## [3] TRUE
## [4] FALSE
## [5] FALSE
## ... ...
## [36316] TRUE
## [36317] TRUE
## [36318] TRUE
## [36319] FALSE
## [36320] FALSE
## Sorted_Liver_day_12_1_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## [3] TRUE
## [4] FALSE
## [5] FALSE
## ... ...
## [36316] TRUE
## [36317] TRUE
## [36318] TRUE
## [36319] FALSE
## [36320] TRUE
## Sorted_Liver_day_12_2_Small_Paired_peaks.narrowPeak
## <logical>
## [1] FALSE
## [2] FALSE
## [3] FALSE
## [4] FALSE
## [5] FALSE
## ... ...
## [36316] FALSE
## [36317] TRUE
## [36318] TRUE
## [36319] TRUE
## [36320] TRUE
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
Now we have to set of regions to count in we can use summariseOverlaps() to count paired reads landing in peaks as we have done for ChIPseq.
We have to adjust our counting for paired-end reads so we additionally set the singleEnd parameter to FALSE.
library(GenomicAlignments)
<- dir("~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_BAM_forCounting/", full.names = TRUE,
bamsToCount pattern = "*.\\.bam$")
<- summarizeOverlaps(consensusToCount, bamsToCount, singleEnd = FALSE)
myCounts
colnames(myCounts) <- c("HindBrain_1", "HindBrain_2", "Kidney_1", "Kidney_2", "Liver_1",
"Liver_2")
With our counts of fragments in nucleosome free regions we can now contruct a DESeq2 object.
We pass the GRanges of regions we count to DESeqDataSetFromMatrix function so as to access these from DESeq2 later.
library(DESeq2)
load("data/myCounts.RData")
<- factor(c("HindBrain", "HindBrain", "Kidney", "Kidney", "Liver", "Liver"))
Group <- data.frame(Group, row.names = colnames(myCounts))
metaData
<- DESeqDataSetFromMatrix(assay(myCounts), metaData, ~Group, rowRanges = rowRanges(myCounts))
atacDDS <- DESeq(atacDDS) atacDDS
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## -- note: fitType='parametric', but the dispersion trend was not well captured by the
## function: y = a/x + b, and a local regression fit was automatically substituted.
## specify fitType='local' or 'mean' to avoid this message next time.
## final dispersion estimates
## fitting model and testing
With the new DESeq2 object we can now test for any differences in ATACseq signal between groups.
In this example we look at differences between the Hindbrain and Kidney samples.
We return a GRanges object here to allow us to perform some more GenomicRanges operations.
<- results(atacDDS, c("Group", "Kidney", "HindBrain"), format = "GRanges")
KidneyMinusHindbrain <- KidneyMinusHindbrain[order(KidneyMinusHindbrain$pvalue)]
KidneyMinusHindbrain KidneyMinusHindbrain
## GRanges object with 36320 ranges and 6 metadata columns:
## seqnames ranges strand | baseMean log2FoldChange
## <Rle> <IRanges> <Rle> | <numeric> <numeric>
## [1] chr4 22488379-22488925 * | 96.8199 -3.48476
## [2] chr1 24613428-24614006 * | 285.2533 1.46558
## [3] chrX 143483004-143483128 * | 1130.4129 -1.22535
## [4] chr2 150644059-150644430 * | 36.0214 -4.17235
## [5] chr15 66239217-66239619 * | 32.3408 -4.44791
## ... ... ... ... . ... ...
## [36316] chrX 93897236-93897401 * | 2.85026 0
## [36317] chrX 100475652-100475834 * | 2.19112 0
## [36318] chrX 153334237-153334437 * | 2.26341 0
## [36319] chrX 164197795-164198006 * | 4.02077 0
## [36320] chrX 169047684-169047857 * | 2.55896 0
## lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric>
## [1] 0.272029 -12.81021 1.43736e-37 5.01798e-33
## [2] 0.161832 9.05623 1.35040e-19 2.35719e-15
## [3] 0.148502 -8.25140 1.56553e-16 1.82181e-12
## [4] 0.521845 -7.99539 1.29168e-15 1.12735e-11
## [5] 0.583590 -7.62164 2.50474e-14 1.74886e-10
## ... ... ... ... ...
## [36316] 2.36122 0 1 1
## [36317] 2.56424 0 1 NA
## [36318] 2.54823 0 1 NA
## [36319] 2.22715 0 1 1
## [36320] 2.42940 0 1 NA
## -------
## seqinfo: 22 sequences from an unspecified genome; no seqlengths
We can subset to only open regions within promoters and then create a HTML table to review the results in IGV using the makebedtable() function in tracktables package.
You can see the result here
## [1] "data//KidneyMinusHindbrain.html"
library(TxDb.Mmusculus.UCSC.mm10.knownGene)
<- promoters(TxDb.Mmusculus.UCSC.mm10.knownGene, 500, 500)
toOverLap <- KidneyMinusHindbrain[(!is.na(KidneyMinusHindbrain$padj) &
KidneyMinusHindbrain $padj < 0.05) & KidneyMinusHindbrain %over% toOverLap, ]
KidneyMinusHindbrain<- makebedtable(KidneyMinusHindbrain, "KidneyMinusHindbrain.html", getwd())
myReport browseURL(myReport)
In the final part we can annotate our differential ATACseq regions to genes and then use gene information to test enrichment for GO sets.
Since we have subset regions to those within +/- 500bp of a TSS we can use a standard enrichment analysis at this point. Here we use clusterProfiler to identify enrichment
<- annotatePeak(KidneyMinusHindbrain, TxDb = TxDb.Mmusculus.UCSC.mm10.knownGene,
anno_KidneyMinusHindbrain verbose = FALSE)
<- as.data.frame(anno_KidneyMinusHindbrain)
DB_ATAC 1, ] DB_ATAC[
## seqnames start end width strand baseMean log2FoldChange lfcSE
## 1 chr4 22488379 22488925 547 * 96.81992 -3.484755 0.2720294
## stat pvalue padj annotation geneChr geneStart
## 1 -12.81021 1.437363e-37 5.017979e-33 Promoter (<=1kb) 4 22482780
## geneEnd geneLength geneStrand geneId transcriptId distanceToTSS
## 1 22488366 5587 2 18992 ENSMUST00000178174.2 -13
Since we have subset regions to those within +/- 500bp of a TSS we can use a standard enrichment analysis at this point. Here we use clusterProfiler to identify enrichment. For alternatives have a look at our ChIPseq course
library(clusterProfiler)
<- enrichGO(DB_ATAC$geneId, OrgDb = "org.Mm.eg.db", ont = "BP", maxGSSize = 5000)
go 1:2, 1:6] go[
## ID Description
## GO:0120036 GO:0120036 plasma membrane bounded cell projection organization
## GO:0030030 GO:0030030 cell projection organization
## GeneRatio BgRatio pvalue p.adjust
## GO:0120036 94/511 1609/23355 1.279533e-18 5.672171e-15
## GO:0030030 95/511 1656/23355 2.713130e-18 6.013653e-15
ATACseq should generate shorter fragments (our nucleosome free regions) around smaller protected areas such as transcription factor binding sites.
We can therefore look for the pile-up of cut-sites around motifs of interest within different tissues/celltypes/samples.
To produce cut-sites from our BAM file we first resize our reads to 1bp and make the shift of 4/-5 bp depending on strand to adjust for expected shift from insertion of Tn5 transposase.
Here we will identify CTCF motifs passing an arbitary cut-off and then use soGGi to plot cut-sites around them. We will jump back to our Greenleaf dataset to do this.
We need to identify the position of CTCF motifs across the genome so first we need to know what a CTCF motif looks like.
The motifDB package contains information on Motifs from public databases such as JASPAR. Here we use the query() function with our motif of interest (CTCF) to extract the CTCF motif.
library(MotifDb)
library(Biostrings)
library(BSgenome.Hsapiens.UCSC.hg19)
<- query(MotifDb, c("CTCF"))
CTCF CTCF
## MotifDb object of length 19
## | Created from downloaded public sources: 2013-Aug-30
## | 19 position frequency matrices from 9 sources:
## | HOCOMOCOv10: 3
## | HOCOMOCOv11-core-A: 2
## | HOMER: 3
## | JASPAR_2014: 2
## | JASPAR_CORE: 1
## | jaspar2016: 2
## | jaspar2018: 3
## | jolma2013: 1
## | SwissRegulon: 2
## | 4 organism/s
## | Hsapiens: 12
## | Dmelanogaster: 3
## | Mmusculus: 1
## | other: 3
## Hsapiens-HOCOMOCOv10-CTCFL_HUMAN.H10MO.A
## Hsapiens-HOCOMOCOv10-CTCF_HUMAN.H10MO.A
## Mmusculus-HOCOMOCOv10-CTCF_MOUSE.H10MO.A
## Hsapiens-HOCOMOCOv11-core-A-CTCFL_HUMAN.H11MO.0.A
## Hsapiens-HOCOMOCOv11-core-A-CTCF_HUMAN.H11MO.0.A
## ...
## Dmelanogaster-jaspar2018-CTCF-MA0531.1
## Hsapiens-jaspar2018-CTCFL-MA1102.1
## Hsapiens-jolma2013-CTCF
## Hsapiens-SwissRegulon-CTCFL.SwissRegulon
## Hsapiens-SwissRegulon-CTCF.SwissRegulon
We can extract a point weight matrix for CTCF which specifies the likelihood of a DNA base occurring in a CTCF motif. Here we extract the motif for CTCF from Human JASPAR Core database.
names(CTCF)
## [1] "Hsapiens-HOCOMOCOv10-CTCFL_HUMAN.H10MO.A"
## [2] "Hsapiens-HOCOMOCOv10-CTCF_HUMAN.H10MO.A"
## [3] "Mmusculus-HOCOMOCOv10-CTCF_MOUSE.H10MO.A"
## [4] "Hsapiens-HOCOMOCOv11-core-A-CTCFL_HUMAN.H11MO.0.A"
## [5] "Hsapiens-HOCOMOCOv11-core-A-CTCF_HUMAN.H11MO.0.A"
## [6] "NA-HOMER-BORIS(Zf)/K562-CTCFL-ChIP-Seq(GSE32465)/Homer"
## [7] "NA-HOMER-CTCF(Zf)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer"
## [8] "NA-HOMER-CTCF-SatelliteElement(Zf?)/CD4+-CTCF-ChIP-Seq(Barski_et_al.)/Homer"
## [9] "Hsapiens-JASPAR_CORE-CTCF-MA0139.1"
## [10] "Hsapiens-JASPAR_2014-CTCF-MA0139.1"
## [11] "Dmelanogaster-JASPAR_2014-CTCF-MA0531.1"
## [12] "Hsapiens-jaspar2016-CTCF-MA0139.1"
## [13] "Dmelanogaster-jaspar2016-CTCF-MA0531.1"
## [14] "Hsapiens-jaspar2018-CTCF-MA0139.1"
## [15] "Dmelanogaster-jaspar2018-CTCF-MA0531.1"
## [16] "Hsapiens-jaspar2018-CTCFL-MA1102.1"
## [17] "Hsapiens-jolma2013-CTCF"
## [18] "Hsapiens-SwissRegulon-CTCFL.SwissRegulon"
## [19] "Hsapiens-SwissRegulon-CTCF.SwissRegulon"
<- CTCF[[1]]
ctcfMotif 1:4] ctcfMotif[,
## 1 2 3 4
## A 0.100 0.162 0.302 0.072
## C 0.356 0.214 0.100 0.826
## G 0.122 0.406 0.436 0.050
## T 0.422 0.218 0.162 0.052
We can visualise the frequency of DNA bases in our motif using the seqLogo package and the seqLogo function.
library(seqLogo)
seqLogo(ctcfMotif)
We can now use the matchPWM() function with our newly acquired CTCF PWM.
Here we will search the sequence on Chr20 using the sequence provided within the BSgenome library for human BSgenome.Hsapiens.UCSC.hg19.
The result is a Views object, similar to the IRanges object. We convert
<- matchPWM(ctcfMotif, BSgenome.Hsapiens.UCSC.hg19[["chr20"]])
myRes myRes
## Views on a 63025520-letter DNAString subject
## subject: NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## views:
## start end width
## [1] 60545 60563 19 [GGACCACGAGGGGCCAGAG]
## [2] 216708 216726 19 [TGGGCTCTAGAGGGCAGGG]
## [3] 239773 239791 19 [TTGCCACTGGGGGGAGACA]
## [4] 246577 246595 19 [CAGCCCCCAGGTGGCAGCC]
## [5] 335282 335300 19 [GTACCACTAGAGGGAGCTC]
## ... ... ... ... ...
## [2327] 62917548 62917566 19 [TCGCCAGCAGGGGGTGCAC]
## [2328] 62917607 62917625 19 [CTGCAAGCAGGGGGCGGTC]
## [2329] 62922628 62922646 19 [GCACCAGAAGGTGGCAGGA]
## [2330] 62922894 62922912 19 [GATCCAGCAAGGGGCGGCA]
## [2331] 62961365 62961383 19 [CCACCACCACGTGGTGCCA]
We need to convert the Views object to a GRanges so we can use these in soGGi to plot cut sites over.
<- GRanges("chr20", ranges(myRes))
toCompare toCompare
## GRanges object with 2331 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr20 60545-60563 *
## [2] chr20 216708-216726 *
## [3] chr20 239773-239791 *
## [4] chr20 246577-246595 *
## [5] chr20 335282-335300 *
## ... ... ... ...
## [2327] chr20 62917548-62917566 *
## [2328] chr20 62917607-62917625 *
## [2329] chr20 62922628-62922646 *
## [2330] chr20 62922894-62922912 *
## [2331] chr20 62961365-62961383 *
## -------
## seqinfo: 1 sequence from an unspecified genome; no seqlengths
To plot cut-sites we will wish to consider only the 5’ end of reads and will need to adjust for a known offset of 5’ reads to actual T5 cut-sites.
This will involve capturing the 5’end of reads and shifting reads on positive and negative strand by 4bp or -5bp respectively.
First we read in our nucleosome free region BAM file and extract read pairs.
<- "~/Downloads/ATAC_Workshop/ATAC_Data/ATAC_BAM/Sorted_ATAC_50K_2_openRegions.bam"
BAM <- readGAlignmentPairs(BAM)
atacReads_Open <- first(atacReads_Open)
read1 <- second(atacReads_Open)
read2 1, ] read2[
## GAlignments object with 1 alignment and 0 metadata columns:
## seqnames strand cigar qwidth start end width
## <Rle> <Rle> <character> <integer> <integer> <integer> <integer>
## [1] chr20 - 50M 50 60446 60495 50
## njunc
## <integer>
## [1] 0
## -------
## seqinfo: 25 sequences from an unspecified genome
Now we can shift the 5’ end of both reads pairs by 4bp or -5bp depending on strand. This produces a GRanges of all our cut-sites from both reads.
<- resize(granges(read1), fix = "start", 1)
Firsts <- shift(granges(Firsts[strand(read1) == "+"]), 4)
First_Pos_toCut <- shift(granges(Firsts[strand(read1) == "-"]), -5)
First_Neg_toCut
<- resize(granges(read2), fix = "start", 1)
Seconds <- shift(granges(Seconds[strand(read2) == "+"]), 4)
Second_Pos_toCut <- shift(granges(Seconds[strand(read2) == "-"]), -5)
Second_Neg_toCut
<- c(First_Pos_toCut, First_Neg_toCut, Second_Pos_toCut, Second_Neg_toCut)
test_toCut 1:2, ] test_toCut[
## GRanges object with 2 ranges and 0 metadata columns:
## seqnames ranges strand
## <Rle> <IRanges> <Rle>
## [1] chr20 60440 +
## [2] chr20 61905 +
## -------
## seqinfo: 25 sequences from an unspecified genome
Now we can use the GRanges of cut-site positions to produce an RLElist of cut-sites across the genome using the coverage() function.
<- coverage(test_toCut)
cutsCoverage <- cutsCoverage["chr20"]
cutsCoverage20 1]] cutsCoverage20[[
## integer-Rle of length 63025520 with 215635 runs
## Lengths: 60439 1 49 1 1414 ... 5268 1 42 1 60252
## Values : 0 1 0 1 0 ... 0 1 0 1 0
We can use an RLElist with soGGi to produce a plot of cut-sites around our discovered CTCF motifs.
We change the format to rlelist and the distanceAround parameter to 500bp.
<- regionPlot(cutsCoverage20, testRanges = toCompare, style = "point",
CTCF_Cuts_open format = "rlelist", distanceAround = 500)
Now we can now plot our cut-sites using the plotRegion() function.
plotRegion(CTCF_Cuts_open, outliers = 0.001) + ggtitle("NucFree Cuts Centred on CTCF") +
theme_bw()
Exercise on ATACseq data can be found here
Answers can be found here