In previous sessions we have seen how to handle unaligned sequence data as well as where and how to retrieve genome reference data in Fasta format.
We will want to produce a SAM/BAM file containing aligned reads in a format we can visualize in IGV and work with in our downstream analysis.
We now need a suitable aligner software to place unaligned reads onto our reference genome to produce a SAM/BAM file.
Aligner softwares can be broadly placed into two categories.
Popular genomic aligners include Bowtie, bwa, subread,GMAP.
Popular splice aware aligners include hisat2,Tophat, SpliceMap, subjunc, GSNAP, STAR.
A splice aware aligner is important for analysis of RNAseq where mRNA’s introns are spliced to stitch exons into continuous sequence.
A few of the popular aligners are wrapped up in R/Bioconductor packages allowing us to use our aligner software from R as well as make use of some of the R/Bioconductor objects we are growing to love.
The gampR packages offer convenient access to gmap/gsnap on Mac and Linux but sadly are not implemented on Windows.
The QuasR package offers an interface to Bowtie and SpliceMap on Windows, Mac and Linux and so provides access to a genomic and splice aware aligner on all systems.
The Rbowtie2 package offers a wrapper to the popular Bowtie2 software and offers significant improvement in memory and CPU usage to Bowtie packaged in QuasR.
rsubread offers convenient access to subread/subjunc, and recently got expanded from Mac and Linux only, to also include Windows. We recommend rsubread for analysis.
In this session we will be making use of some public datasets from the Encode consortium.
We will be using raw sequence reads in fastQ format which have been generated from an RNAseq experiment.
This RNAseq data has been generated from the human cell line GM12878 and the link to experiment can be found here or a direct link to FastQ for replicate 2 we are using can be found here.
To speed up the processing for this practical I have retrieved aligned data from Encode for the sample ENCSR297UBP and extracted reads mapping to ActB gene on hg20 human genome build. I have further downsampled these reads to only 10000 reads out of the 200000 mapping to this gene.
This sampled data can be found in data/sampledActin.fq.gz
The Rsubread package offers a fast aligner for both genomic and splice aware alignment.
Updated refrequently - Latest paper here
::install('Rsubread')
BiocManagerlibrary(Rsubread)
We will also require some reference data so lets install the BSgenome package for hg38.
::install("BSgenome.Hsapiens.UCSC.hg38")
BiocManagerlibrary(BSgenome.Hsapiens.UCSC.hg38)
To build an index for mapping we need our reference sequence in FASTA format.
Lets extract a FASTA file from out BSgenome object. Here we will create a FASTA file for just Chr7 (the location of ActB)
<- BSgenome.Hsapiens.UCSC.hg38[["chr7"]]
chr7hg38 <- DNAStringSet(list(chr7=chr7hg38))
chr7hg38Set writeXStringSet(chr7hg38Set,file="chr7.fa")
The Rsubread package requires we first build an index of our reference genome. This is what we will align our Fastq files to.
Here we use the buildindex() function, specifying the name of index to be built and the FASTA file to build the index from.
By default the memory (RAM) used when building an index is limited to 8GB. This is often too large for your laptop. You can dial this down using the memory argument.
This may take some time but you will only need to run this index step once.
buildindex("chr7","chr7.fa", memory=8000)
We can now use the align() function to align reads to the genome.
We must specify out index name, fastq file, desired output format and name of BAM file to be created.
align("chr7","data/sampledActin.fq.gz",
output_format = "BAM",
output_file = "data/Rsubread_NoSplicing_sampledActin.bam")
Now we have our aligned data as BAM formats we must perform two final operations on our BAM to make it ready for use in external programs.
These are:
The samtools software provide command line tools to handle SAM and BAM files such as indexing and sorting.
The Rsamtools package allows us to make us of the samtools functions within R.
First we can install and load the library.
::install("Rsamtools")
BiocManagerlibrary(Rsamtools)
We can use the Rsamtools function sortBam() to sort our BAM file.
The sortBam() function take as arguments the path to BAM file to sort and the prefix of sorted BAM output.
Note the prefix should not contain the .bam extension.
sortBam("data/Rsubread_NoSplicing_sampledActin.bam","SortedActB")
## [1] "SortedActB.bam"
After sorting, we can now index our sorted BAM file using the indexBAM() function.
indexBam("SortedActB.bam")
## SortedActB.bam
## "SortedActB.bam.bai"
We can get an overview of BAM file information using the quickBamFlagSummary() function.
quickBamFlagSummary("SortedActB.bam")
## group | nb of | nb of | mean / max
## of | records | unique | records per
## records | in group | QNAMEs | unique QNAME
## All records........................ A | 10000 | 10000 | 1 / 1
## o template has single segment.... S | 10000 | 10000 | 1 / 1
## o template has multiple segments. M | 0 | 0 | NA / NA
## - first segment.............. F | 0 | 0 | NA / NA
## - last segment............... L | 0 | 0 | NA / NA
## - other segment.............. O | 0 | 0 | NA / NA
##
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
##
## Details for group S:
## o record is mapped.............. S1 | 9748 | 9748 | 1 / 1
## - primary alignment......... S2 | 9748 | 9748 | 1 / 1
## - secondary alignment....... S3 | 0 | 0 | NA / NA
## o record is unmapped............ S4 | 252 | 252 | 1 / 1
We can now review our BAM file in IGV.
But if we look for splice junctions we see we are missing some information.
We are missing reads which would align across splice junctions. We can now use the subjunc() function to align reads in a splice aware manner.
We must again specify out index name, fastq file, desired output format and name of BAM file to be created.
subjunc("chr7","data/sampledActin.fq.gz",
output_format = "BAM",
output_file = "data/RsubreadsampledActin.bam")
We can now sort and index our splice aware alignment as before.
sortBam("data/RsubreadsampledActin.bam",
"SortedActBSpliced")
## [1] "SortedActBSpliced.bam"
indexBam("SortedActBSpliced.bam")
## SortedActBSpliced.bam
## "SortedActBSpliced.bam.bai"
We can get an overview of BAM file information using the quickBamFlagSummary() function.
quickBamFlagSummary("SortedActBSpliced.bam")
## group | nb of | nb of | mean / max
## of | records | unique | records per
## records | in group | QNAMEs | unique QNAME
## All records........................ A | 10000 | 10000 | 1 / 1
## o template has single segment.... S | 10000 | 10000 | 1 / 1
## o template has multiple segments. M | 0 | 0 | NA / NA
## - first segment.............. F | 0 | 0 | NA / NA
## - last segment............... L | 0 | 0 | NA / NA
## - other segment.............. O | 0 | 0 | NA / NA
##
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
##
## Details for group S:
## o record is mapped.............. S1 | 9756 | 9756 | 1 / 1
## - primary alignment......... S2 | 9756 | 9756 | 1 / 1
## - secondary alignment....... S3 | 0 | 0 | NA / NA
## o record is unmapped............ S4 | 244 | 244 | 1 / 1
We can now review our splice aware BAM file in IGV.
Now if we compare to the original alignment from Encode we can identify where some of our unaligned reads may have gone.
The default parameters for alignment within aligners are carefully selected for alignment with ChIP-seq, ATAC-seq, WGS and splice aware alignment in RNA-seq. We will review this performance over the coming sessions.
An example here shows alignment with STAR, which often allows for very long splicing. Here you can see reads map over the neighbouring gene to an unrelated region.
The Rbowtie2 library offers an alternative genomic alignment in R using Bowtie2.
::install("Rbowtie2")
BiocManagerlibrary(Rbowtie2)
As we saw with rsubread we first need to build an index to align too. We can build the index using the bowtie2_build function supplying the path to Fasta file to references argument and path for bowtie2 index to bt2Index command.
Again, this may take some time but you will only need to run this index step once.
bowtie2_build(references="chr7.fa",
bt2Index=file.path("chr7hg38"))
Now we have created an index to align to, we can align our fastq data to this index.
First however we will need to decompress our compressed fastq files to use them with Rbowtie2. The gunzip function in R allows us to decompress file from R.
library(R.utils)
gunzip("data/sampledActin.fq.gz")
We can align our data using the main bowtie2 function and specifying the index path to bt2Index argument, the output path for sam to samOutput argument and the sequence we wish to align to the seq1 path.
bowtie2(bt2Index = "chr7hg38",
samOutput = "sampledActin.sam",
seq1 = "data/sampledActin.fq")
The bowtie2 function outputs a SAM file. We will want to produce a BAM file we can sort and index from this SAM file using the asBam function.
<- asBam("sampledActin.sam")
bamFile_Bowtie2 bamFile_Bowtie2
We can then sort and index our BAM file to ensure we are ready for IGV and downstream analysis.
sortBam(bamFile_Bowtie2,"SortedActBSpliced_bowtie")
indexBam("SortedActBSpliced_bowtie.bam")
Aligners have a variety of costs/benefits. Rbowtie2 is a significant improvement the original Bowtie, but it is not splice aware alignment. It only allows for genomic alignment.
This means that Rbowtie2 is suitable for - ATAC-seq - ChIP-seq - WGS
Rbowtie2 is NOT suitable for - RNA-seq - Ribo-seq
Some wrappers have multiple aligners built in to them.
This makes it easy to switch between options.
::install("QuasR")
BiocManagerlibrary(QuasR)
The main function for alignment in the QuasR package is the qAlign() function.
The qAlign() function requires just two arguments.
The sample table requires is a tab-delimited file listing the path to fastq(s) to be aligned and the desired sample names.
<- "data/sampledActin.fq.gz"
FileName <- "sampledActin"
SampleName <- data.frame(FileName,SampleName)
sampleTable write.table(sampleTable,file="sampleTable.txt",sep="\t",quote=FALSE,row.names = FALSE)
sampleTable
## FileName SampleName
## 1 data/sampledActin.fq.gz sampledActin
The simplest way to specify a reference genome for alignment in qAlign() is to use a BSgenome object.
Here we can simply specify the name of BSgenome object we wish to use for alignment. Here we specify the BSgenome object for hg38 BSgenome.Hsapiens.UCSC.hg38.
library(QuasR)
qAlign("sampleTable.txt", "BSgenome.Hsapiens.UCSC.hg38")
Internally, QuasR will create a FASTA file from our BSgenome object prior to alignment.
We can also provide a FASTA file directly to the qAlign() function.
qAlign("sampleTable.txt","chr7.fa")
The aligner argument is used to control which algorithm you were going to use. Here we use Rhisat2 as it is a good algorithm for mapping reads across splice junctions.
qAlign("sampleTable.txt","chr7.fa", aligner="Rhisat2")