All prerequisites, links to material and slides for this course can be found on github.
Or can be downloaded as a zip archive from here.
Once the zip file in unarchived. All presentations as HTML slides and pages, their R code and HTML practical sheets will be available in the directories underneath.
Before running any of the code in the practicals or slides we need to set the working directory to the folder we unarchived.
You may navigate to the unarchived ATAC.Cut-Run.ChIP-master folder in the Rstudio menu.
Session -> Set Working Directory -> Choose Directory
or in the console.
There are a variety of genomics techniques to start to profile the epigenome of cells.
They fit a few different classes:
TF-bound/Histone Modifications: Cut and Run, Cut and Tag, ChIP-seq
Open Regions: ATAC-seq, MNase-seq and DNase-seq
Genome organization - Hi-C, HiCAR, Capture-C
DNaseseq - Enzymatic digestion to extract signal from open chromatin around transcription factor binding sites.
MNaseseq - Enzymatic digestion to extract signal repesenting nucleosome positioning.
ATACseq - Uses transposases and offers a method to simultaneously extract signal from transcription factors binding sites and nucleosome positions from a single sample.
ATACseq (Assay for Transposase-Accessible Chromatin using sequencing) uses a transposase to efficiently fragment accessible DNA prior to sequencing. The result provides a method to map the accessible/open chromatin genome wide.
In contrast to other techniques, ATACseq has several advantages including
Despite being different technologies that reveal differing biological phenomena the braod. strokes for analysis of all these techniques is the same:
We will review analysis of these data types with a unified approach, but highlight key differences that need to be considered between technologies as we go.
Our raw sequencing data will be in FASTQ format.
For your own data you will typically receive a fastq from the genomics group you are working with.
For published data the fastq will be available on GEO/SRA or ENA. For example we will be working today with data from the Fuchs lab: The pioneer factor SOX9 competes for epigenetic factors to switch stem cell fates
For more information on the data you can look at GEO or download the fastq files (if you want) directly from ENA:
We will work through how to run these early steps, but we will also provide intermediate files once we have completed some of the heavy processing as some of these steps will take too long to run in these sessions.
At this time we will want to review the QC of our data. We can look for features like:
There are two main approaches:
Once we have downloaded the raw FASTQ data we can use the ShortRead package to review our sequence data quality.
We have more detailed course material on how to work with raw sequencing data in the FASTQ in Bioconductor session.
First we load the ShortRead library.
First we will review the raw sequencing reads using functions in the ShortRead package.
We do not need to review all reads in the file to can gain an understanding of data quality. We can simply review a subsample of the reads and save ourselves some time and memory.
Note when we subsample we retrieve random reads from across the entire FASTQ file. This is important as FASTQ files are often ordered by their position on the sequencer.
We can subsample from a FASTQ file using functions in ShortRead package.
Here we use the FastqSampler and yield function to randomly sample a defined number of reads from a FASTQ file. Here we subsample 1 million reads. This should be enough to have an understanding of the quality of the data.
NOTE: This step may take a little time, so you can skip it for now if you want.
We have provided a subsampled version of our fastq file here for you to read in directly.
The resulting object is a ShortReadQ object showing information on the number of cycles, base pairs in reads, and number of reads in memory.
## class: ShortReadQ
## length: 1000000 reads; width: 40 cycles
If we wished, we can assess information from the FASTQ file using accessor functions.
## DNAStringSet object of length 1000000:
## width seq
## [1] 40 CAACTGAATTTAGATAGGGAAGTAGCAGAAGGTTACATCT
## [2] 40 AGGTTTTTGGGGTAAAATGCTCGCATAAATGACGGATCCC
## [3] 40 CACTTCGTCTGCCAGCGGCGCAACGGCGTGGCGATGCCGG
## [4] 40 TCTGAGCTCATCGTAAGAAGGCATGGCAGCAAGGGTGTGG
## [5] 40 TCAGCTGTGGTGTGAGCTGCTTGCAAAAACATCGTGCTAG
## ... ... ...
## [999996] 40 GTACGTTAAAACAGCTGTTACACATGGATCTATGCACTTA
## [999997] 40 GTGAGAGGGTGCGCCAGAGAACCTGACAGCTTCTGGAACA
## [999998] 40 GAAGCTGTGTCATATGTCATGCTCTGGTTAAAGGTTAACT
## [999999] 40 TAGAGATGAAACCCATGTCTCAGAGCTCTTTCCTTCACAT
## [1000000] 40 TTGCATACATTAACTGGCTTGAGGTAACTATTATTTTTCC
We can review the occurrence of DNA bases within reads and well as the occurrence of DNA bases across sequencing cycles using the alphabetFrequency() and alphabetByCycle() functions respectively.
Here we check the overall frequency of A, G, C, T and N (unknown bases) in our sequence reads.
readSequences <- sread(fastq)
readSequences_AlpFreq <- alphabetFrequency(readSequences)
readSequences_AlpFreq[1:3, ]
## A C G T M R W S Y K V H D B N - + .
## [1,] 15 5 10 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 11 7 11 11 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 5 14 15 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Once we have the frequency of DNA bases in our sequence reads we can retrieve the sum across all reads.
## A C G T N
## 11744925 8497685 8344368 11412114 908
We can review DNA base occurrence by cycle using the alphabetByCycle() function.
## cycle
## alphabet [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## A 336004 302683 305045 291736 297843 287897 303525 294275 277613 295295
## C 233900 185367 210355 230639 215323 211798 205233 208795 205654 207707
## G 174963 239691 197263 197345 207617 212597 205342 192901 212591 207859
## T 254822 272252 287337 280280 279217 287564 285893 304026 304142 289135
We often plot this to visualize the base occurrence over cycles to observe any bias. First we arrange the base frequency into a data frame.
Now we can plot the frequencies using ggplot2
library(ggplot2)
ggplot(toPlot, aes(y = Count, x = Cycle, colour = Base)) + geom_line() + theme_bw()
library(ggplot2)
ggplot(toPlot, aes(y = Count, x = Cycle, colour = Base)) + geom_line() + ylim(150000,
350000) + theme_bw()
We can also check some the reads’ quality scores for our subsampled FASTQ data.
We use the alphabetScore() function with our read’s qualitys to retrieve the sum quality for every read from our subsample.
## [1] 1200 1200 1200 1200 1200 1200 1200 1200 1200 1200
We can then produce a histogram of quality scores to get a better understanding of the distribution of scores.
toPlot <- data.frame(ReadQ = readQualities)
ggplot(toPlot, aes(x = ReadQ)) + geom_histogram() + theme_minimal()
In this case the distribution of reads quality scores is a little strange. Normally we see a wider range of scores. This is likely due to some kind of loss of information during some transfer/conversion.
For a more in-detail review of working with fastqs in R you can check here. This also includes how to run filters on the data based on quality.
Most of the time we will not read in our fastqs. We normally only do this if there is a clear issue that we need to take a deep dive on. Instead use software like FastP to do a comprehensive FastQ review in one simple command.
This is one of the most popular and respected QC tools around.
It does what many of the others do, but it is VERY fast. There is a wrapper in R (developed by our team) so we can run it form R with ease: Rfastp
There are many ways to customize including trimming options, barcodes and stringency. It will also handle the paired-end data simultaneously for us.
Trims low-quality bases from the 5′ and 3′ ends using a sliding window approach
Automatically detects and removes adapters
Corrects mismatched bases in overlapping regions of paired-end reads (based on quality)
Trims polyA/polyX tails from the 3′ ends, depending on library type
Filters out low-quality reads based on user-defined thresholds
To run Rfastp we simply provide our two fastq files (or just one if we are running single end). We will also provide a name for the filtered Fastq file which is output.
Running RfastP creates several files:
We can review the QC of our samples by checking out the html file. We have uploaded the result for this here
You can also review the samples computationally using the .json file.
Here is a second example report from a ChIPseq experiment of Myc from ENCODE.
Whenever considering QC we rarely rule samples out at this early stage. But we keep an eye out for consistency between samples and if any sample looks bad across multiple metrics.
Always also consider your experiment, and previous experiments that may be similar.
We can also look at some of the plots and QC metrics in R using the assigned object.
We have saved this in the data folder for you to load in data/fastp_res.RData.
## Before_QC After_QC
## total_reads 1.745457e+08 1.705088e+08
## total_bases 6.981829e+09 6.801466e+09
## q20_bases 6.934319e+09 6.801466e+09
## q30_bases 6.934319e+09 6.801466e+09
## q20_rate 9.931950e-01 1.000000e+00
## q30_rate 9.931950e-01 1.000000e+00
## read1_mean_length 4.000000e+01 3.900000e+01
## read2_mean_length 4.000000e+01 3.900000e+01
## gc_content 4.239230e-01 4.193530e-01
We typically use the default settings, but there are lots of options to control Rfastp. Specifically this could help more stringent filtering if you observe any QC problems i.e. minReadLength or qualityFilterPercent.
NOTE - this is single-end RNAseq, but for this example it doesn’t matter
The quality score dips pretty low at the start of our reads.
Trimming removes low-quality bases and ambiguous N bases, which can skew GC content at the 5′ end of reads
What threshold defines abnormal GC content? - No strict cutoff — depends on organism, library prep, and sequencing platform
Exercises for working with fastqs can be found here
Answers can be found here