Alignment


Alignment - Fasta and fastq

In previous sessions we have seen how to handle unaligned sequence data as well as where and how to retrieve genome reference data in Fasta format.

igv

igv

Aligned data - SAM/BAM

We will want to produce a SAM/BAM file containing aligned reads in a format we can visualize in IGV and work with in our downstream analysis.

igv

Alignment softwares

We now need a suitable aligner software to place unaligned reads onto our reference genome to produce a SAM/BAM file.

Aligner softwares can be broadly placed into two categories.

  • Genomic aligner (WGS, ChIP-seq, ATAC-seq etc).
  • Splice Aware aligner (RNA-seq, Ribo-seq).

Popular genomic aligners include Bowtie, bwa, subread,GMAP.

Popular splice aware aligners include hisat2,Tophat, SpliceMap, subjunc, GSNAP, STAR.

Splice alignment

A splice aware aligner is important for analysis of RNAseq where mRNA’s introns are spliced to stitch exons into continuous sequence.

igv

Alignment software in R

A few of the popular aligners are wrapped up in R/Bioconductor packages allowing us to use our aligner software from R as well as make use of some of the R/Bioconductor objects we are growing to love.

  • Bowtie - Rbowtie, Rbowtie2, QuasR
  • GSNAP/GMAP - gmapR
  • SpliceMap - QuasR
  • subread/subjunc - rsubread
  • hisat2 - Rhisat2, QuasR

Alignment software in R

The gampR packages offer convenient access to gmap/gsnap on Mac and Linux but sadly are not implemented on Windows.

The QuasR package offers an interface to Bowtie and SpliceMap on Windows, Mac and Linux and so provides access to a genomic and splice aware aligner on all systems.

The Rbowtie2 package offers a wrapper to the popular Bowtie2 software and offers significant improvement in memory and CPU usage to Bowtie packaged in QuasR.

rsubread offers convenient access to subread/subjunc, and recently got expanded from Mac and Linux only, to also include Windows. We recommend rsubread for analysis.

Data

In this session we will be making use of some public datasets from the Encode consortium.

We will be using raw sequence reads in fastQ format which have been generated from an RNAseq experiment.

This RNAseq data has been generated from the human cell line GM12878 and the link to experiment can be found here or a direct link to FastQ for replicate 2 we are using can be found here.

Data

To speed up the processing for this practical I have retrieved aligned data from Encode for the sample ENCSR297UBP and extracted reads mapping to ActB gene on hg20 human genome build. I have further downsampled these reads to only 10000 reads out of the 200000 mapping to this gene.

This sampled data can be found in data/sampledActin.fq.gz

The Rsubread package

The Rsubread package offers a fast aligner for both genomic and splice aware alignment.

Updated refrequently - Latest paper here

BiocManager::install('Rsubread')
library(Rsubread)

BSGenome packages

We will also require some reference data so lets install the BSgenome package for hg38.

BiocManager::install("BSgenome.Hsapiens.UCSC.hg38")
library(BSgenome.Hsapiens.UCSC.hg38)

Write out FASTA

To build an index for mapping we need our reference sequence in FASTA format.

Lets extract a FASTA file from out BSgenome object. Here we will create a FASTA file for just Chr7 (the location of ActB)

chr7hg38 <- BSgenome.Hsapiens.UCSC.hg38[["chr7"]]
chr7hg38Set <- DNAStringSet(list(chr7=chr7hg38))
writeXStringSet(chr7hg38Set,file="chr7.fa")

Building an index

The Rsubread package requires we first build an index of our reference genome. This is what we will align our Fastq files to.

Here we use the buildindex() function, specifying the name of index to be built and the FASTA file to build the index from.

By default the memory (RAM) used when building an index is limited to 8GB. This is often too large for your laptop. You can dial this down using the memory argument.

This may take some time but you will only need to run this index step once.

buildindex("chr7","chr7.fa", memory=8000)

Aligning

We can now use the align() function to align reads to the genome.

We must specify out index name, fastq file, desired output format and name of BAM file to be created.

align("chr7","data/sampledActin.fq.gz",
        output_format = "BAM",
        output_file = "data/Rsubread_NoSplicing_sampledActin.bam")

Aligned data in BAM

Now we have our aligned data as BAM formats we must perform two final operations on our BAM to make it ready for use in external programs.

These are:

  • Sorting (here by coordinate)
  • Indexing

Rsamtools

The samtools software provide command line tools to handle SAM and BAM files such as indexing and sorting.

The Rsamtools package allows us to make us of the samtools functions within R.

First we can install and load the library.

BiocManager::install("Rsamtools")
library(Rsamtools)

Rsamtools sorting

We can use the Rsamtools function sortBam() to sort our BAM file.

The sortBam() function take as arguments the path to BAM file to sort and the prefix of sorted BAM output.

Note the prefix should not contain the .bam extension.

sortBam("data/Rsubread_NoSplicing_sampledActin.bam","SortedActB")
## [1] "SortedActB.bam"

Rsamtools sorting

After sorting, we can now index our sorted BAM file using the indexBAM() function.

indexBam("SortedActB.bam")
##       SortedActB.bam 
## "SortedActB.bam.bai"

Rsamtools BAM overview

We can get an overview of BAM file information using the quickBamFlagSummary() function.

quickBamFlagSummary("SortedActB.bam")
##                                 group |    nb of |    nb of | mean / max
##                                    of |  records |   unique | records per
##                               records | in group |   QNAMEs | unique QNAME
## All records........................ A |    10000 |    10000 |    1 / 1
##   o template has single segment.... S |    10000 |    10000 |    1 / 1
##   o template has multiple segments. M |        0 |        0 |   NA / NA
##       - first segment.............. F |        0 |        0 |   NA / NA
##       - last segment............... L |        0 |        0 |   NA / NA
##       - other segment.............. O |        0 |        0 |   NA / NA
## 
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
## 
## Details for group S:
##   o record is mapped.............. S1 |     9748 |     9748 |    1 / 1
##       - primary alignment......... S2 |     9748 |     9748 |    1 / 1
##       - secondary alignment....... S3 |        0 |        0 |   NA / NA
##   o record is unmapped............ S4 |      252 |      252 |    1 / 1

Review in IGV

We can now review our BAM file in IGV.

igv

Review in IGV

But if we look for splice junctions we see we are missing some information.

igv

Splice aware aligning

We are missing reads which would align across splice junctions. We can now use the subjunc() function to align reads in a splice aware manner.

We must again specify out index name, fastq file, desired output format and name of BAM file to be created.

subjunc("chr7","data/sampledActin.fq.gz",
        output_format = "BAM",
        output_file = "data/RsubreadsampledActin.bam")

Sort and index our file

We can now sort and index our splice aware alignment as before.

sortBam("data/RsubreadsampledActin.bam",
        "SortedActBSpliced")
## [1] "SortedActBSpliced.bam"
indexBam("SortedActBSpliced.bam")
##       SortedActBSpliced.bam 
## "SortedActBSpliced.bam.bai"

Rsamtools BAM overview

We can get an overview of BAM file information using the quickBamFlagSummary() function.

quickBamFlagSummary("SortedActBSpliced.bam")
##                                 group |    nb of |    nb of | mean / max
##                                    of |  records |   unique | records per
##                               records | in group |   QNAMEs | unique QNAME
## All records........................ A |    10000 |    10000 |    1 / 1
##   o template has single segment.... S |    10000 |    10000 |    1 / 1
##   o template has multiple segments. M |        0 |        0 |   NA / NA
##       - first segment.............. F |        0 |        0 |   NA / NA
##       - last segment............... L |        0 |        0 |   NA / NA
##       - other segment.............. O |        0 |        0 |   NA / NA
## 
## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M.
## Indentation reflects this.
## 
## Details for group S:
##   o record is mapped.............. S1 |     9756 |     9756 |    1 / 1
##       - primary alignment......... S2 |     9756 |     9756 |    1 / 1
##       - secondary alignment....... S3 |        0 |        0 |   NA / NA
##   o record is unmapped............ S4 |      244 |      244 |    1 / 1

Review in IGV

We can now review our splice aware BAM file in IGV.

igv

Review in IGV

Now if we compare to the original alignment from Encode we can identify where some of our unaligned reads may have gone.

igv

Review in IGV

The default parameters for alignment within aligners are carefully selected for alignment with ChIP-seq, ATAC-seq, WGS and splice aware alignment in RNA-seq. We will review this performance over the coming sessions.

An example here shows alignment with STAR, which often allows for very long splicing. Here you can see reads map over the neighbouring gene to an unrelated region.

igv

The Rbowtie2 package

The Rbowtie2 library offers an alternative genomic alignment in R using Bowtie2.

BiocManager::install("Rbowtie2")
library(Rbowtie2)

The Rbowtie2 package

As we saw with rsubread we first need to build an index to align too. We can build the index using the bowtie2_build function supplying the path to Fasta file to references argument and path for bowtie2 index to bt2Index command.

Again, this may take some time but you will only need to run this index step once.

bowtie2_build(references="chr7.fa", 
                       bt2Index=file.path("chr7hg38"))

The Rbowtie2 package

Now we have created an index to align to, we can align our fastq data to this index.

First however we will need to decompress our compressed fastq files to use them with Rbowtie2. The gunzip function in R allows us to decompress file from R.

library(R.utils)

gunzip("data/sampledActin.fq.gz")

The Rbowtie2 package

We can align our data using the main bowtie2 function and specifying the index path to bt2Index argument, the output path for sam to samOutput argument and the sequence we wish to align to the seq1 path.

bowtie2(bt2Index = "chr7hg38",
          samOutput = "sampledActin.sam",
          seq1 = "data/sampledActin.fq")

The Rbowtie2 package

The bowtie2 function outputs a SAM file. We will want to produce a BAM file we can sort and index from this SAM file using the asBam function.

bamFile_Bowtie2 <- asBam("sampledActin.sam")
bamFile_Bowtie2

The Rbowtie2 package

We can then sort and index our BAM file to ensure we are ready for IGV and downstream analysis.

sortBam(bamFile_Bowtie2,"SortedActBSpliced_bowtie")
indexBam("SortedActBSpliced_bowtie.bam")

The Rbowtie2 package

Aligners have a variety of costs/benefits. Rbowtie2 is a significant improvement the original Bowtie, but it is not splice aware alignment. It only allows for genomic alignment.

This means that Rbowtie2 is suitable for - ATAC-seq - ChIP-seq - WGS

Rbowtie2 is NOT suitable for - RNA-seq - Ribo-seq

QuasR

Some wrappers have multiple aligners built in to them.

This makes it easy to switch between options.

  • Bowtie - QuasR
  • SpliceMap - QuasR
  • hisat2 - QuasR

QuasR

BiocManager::install("QuasR")
library(QuasR)

QuasR genomic alignment

The main function for alignment in the QuasR package is the qAlign() function.

The qAlign() function requires just two arguments.

  • A Samples file - Tab-delimited file containing fastq location and sample names.
  • Reference genome - Character string of BSgenome object or FASTA file location.

QuasR Sample table

The sample table requires is a tab-delimited file listing the path to fastq(s) to be aligned and the desired sample names.

FileName <- "data/sampledActin.fq.gz"
SampleName <- "sampledActin"
sampleTable <- data.frame(FileName,SampleName)
write.table(sampleTable,file="sampleTable.txt",sep="\t",quote=FALSE,row.names = FALSE)
sampleTable
##                  FileName   SampleName
## 1 data/sampledActin.fq.gz sampledActin

QuasR genomic with BSGenome object

The simplest way to specify a reference genome for alignment in qAlign() is to use a BSgenome object.

Here we can simply specify the name of BSgenome object we wish to use for alignment. Here we specify the BSgenome object for hg38 BSgenome.Hsapiens.UCSC.hg38.

library(QuasR)
qAlign("sampleTable.txt", "BSgenome.Hsapiens.UCSC.hg38")

QuasR genomic with FASTA file

Internally, QuasR will create a FASTA file from our BSgenome object prior to alignment.

We can also provide a FASTA file directly to the qAlign() function.

qAlign("sampleTable.txt","chr7.fa")

Choosing aligner with QuasR

The aligner argument is used to control which algorithm you were going to use. Here we use Rhisat2 as it is a good algorithm for mapping reads across splice junctions.

qAlign("sampleTable.txt","chr7.fa", aligner="Rhisat2")

Time for an exercise.

Link_to_exercises

Link_to_answers