class: center, middle, inverse, title-slide # Alignment In Bioconductor
### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/Bioconductor_Introduction/
--- class: inverse, center, middle # Alignment <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Alignment - Fasta and fastq In previous sessions we have seen how to handle unaligned sequence data as well as where and how to retrieve genome reference data in Fasta format. .pull-left[ <div align="center"> <img src="imgs/newFASTA.png" alt="igv" height="400" width="800"> </div> ] .pull-right[ <div align="center"> <img src="imgs/fq1.png" alt="igv" height="150" width="400"> </div> ] --- ## Aligned data - SAM/BAM We will want to produce a SAM/BAM file containing aligned reads in a format we can visualize in IGV and work with in our downstream analysis. <div align="center"> <img src="imgs/sam2.png" alt="igv" height="300" width="900"> </div> --- ## Alignment softwares We now need a suitable **aligner** software to place unaligned reads onto our reference genome to produce a SAM/BAM file. Aligner softwares can be broadly placed into two categories. * Genomic aligner (WGS, ChIP-seq, ATAC-seq etc). * Splice Aware aligner (RNA-seq, Ribo-seq). Popular genomic aligners include **Bowtie**, **bwa**, **subread**,**GMAP**. Popular splice aware aligners include **hisat2**,**Tophat**, **SpliceMap**, **subjunc**, **GSNAP**, **STAR**. --- ## Splice alignment A splice aware aligner is important for analysis of RNAseq where mRNA's introns are spliced to stitch exons into continuous sequence. <div align="center"> <img src="imgs/splice.png" alt="igv" height="350" width="700"> </div> --- ## Alignment software in R A few of the popular aligners are wrapped up in R/Bioconductor packages allowing us to use our **aligner** software from R as well as make use of some of the R/Bioconductor objects we are growing to love. * Bowtie - **Rbowtie**, **Rbowtie2**, **QuasR** * GSNAP/GMAP - **gmapR** * SpliceMap - **QuasR** * subread/subjunc - **rsubread** * hisat2 - **Rhisat2**, **QuasR** --- ## Alignment software in R The **gampR** packages offer convenient access to gmap/gsnap on Mac and Linux but sadly are not implemented on Windows. The **QuasR** package offers an interface to Bowtie and SpliceMap on Windows, Mac and Linux and so provides access to a genomic and splice aware aligner on all systems. The **Rbowtie2** package offers a wrapper to the popular Bowtie2 software and offers significant improvement in memory and CPU usage to Bowtie packaged in QuasR. **rsubread** offers convenient access to subread/subjunc, and recently got expanded from Mac and Linux only, to also include Windows. We recommend rsubread for analysis. --- ## Data In this session we will be making use of some public datasets from the Encode consortium. We will be using raw sequence reads in fastQ format which have been generated from an RNAseq experiment. This RNAseq data has been generated from the human cell line **GM12878** and the link to experiment can be found [here](https://www.encodeproject.org/experiments/ENCSR297UBP/) or a direct link to FastQ for replicate 2 we are using can be found [here.](https://www.encodeproject.org/files/ENCFF000CXH/@@download/ENCFF000CXH.fastq.gz) --- ## Data To speed up the processing for this practical I have retrieved aligned data from Encode for the sample **ENCSR297UBP** and extracted reads mapping to ActB gene on hg20 human genome build. I have further downsampled these reads to only 10000 reads out of the 200000 mapping to this gene. This sampled data can be found in **data/sampledActin.fq.gz** --- ## The Rsubread package The Rsubread package offers a fast aligner for both genomic and splice aware alignment. Updated refrequently - [Latest paper here](https://academic.oup.com/nar/article/47/8/e47/5345150) ```r BiocManager::install('Rsubread') library(Rsubread) ``` --- ## BSGenome packages We will also require some reference data so lets install the BSgenome package for hg38. ```r BiocManager::install("BSgenome.Hsapiens.UCSC.hg38") library(BSgenome.Hsapiens.UCSC.hg38) ``` --- ## Write out FASTA To build an index for mapping we need our reference sequence in FASTA format. Lets extract a FASTA file from out BSgenome object. Here we will create a FASTA file for just Chr7 (the location of ActB) ```r chr7hg38 <- BSgenome.Hsapiens.UCSC.hg38[["chr7"]] chr7hg38Set <- DNAStringSet(list(chr7=chr7hg38)) writeXStringSet(chr7hg38Set,file="chr7.fa") ``` --- ## Building an index The Rsubread package requires we first build an index of our reference genome. This is what we will align our Fastq files to. Here we use the **buildindex()** function, specifying the name of index to be built and the FASTA file to build the index from. By default the memory (RAM) used when building an index is limited to 8GB. This is often too large for your laptop. You can dial this down using the memory argument. This may take some time but you will only need to run this index step once. ```r buildindex("chr7","chr7.fa", memory=8000) ``` --- ## Aligning We can now use the **align()** function to align reads to the genome. We must specify out index name, fastq file, desired output format and name of BAM file to be created. ```r align("chr7","data/sampledActin.fq.gz", output_format = "BAM", output_file = "data/Rsubread_NoSplicing_sampledActin.bam") ``` --- ## Aligned data in BAM Now we have our aligned data as BAM formats we must perform two final operations on our BAM to make it ready for use in external programs. These are: * Sorting (here by coordinate) * Indexing --- ## Rsamtools The **samtools** software provide command line tools to handle SAM and BAM files such as indexing and sorting. The **Rsamtools** package allows us to make us of the samtools functions within R. First we can install and load the library. ```r BiocManager::install("Rsamtools") library(Rsamtools) ``` --- ## Rsamtools sorting We can use the **Rsamtools** function **sortBam()** to sort our BAM file. The **sortBam()** function take as arguments the path to BAM file to sort and the **prefix** of sorted BAM output. Note the **prefix** should not contain the *.bam* extension. ```r sortBam("data/Rsubread_NoSplicing_sampledActin.bam","SortedActB") ``` ``` ## [1] "SortedActB.bam" ``` --- ## Rsamtools sorting After sorting, we can now index our sorted BAM file using the **indexBAM()** function. ```r indexBam("SortedActB.bam") ``` ``` ## SortedActB.bam ## "SortedActB.bam.bai" ``` --- ## Rsamtools BAM overview We can get an overview of BAM file information using the **quickBamFlagSummary()** function. ```r quickBamFlagSummary("SortedActB.bam") ## group | nb of | nb of | mean / max ## of | records | unique | records per ## records | in group | QNAMEs | unique QNAME ## All records........................ A | 10000 | 10000 | 1 / 1 ## o template has single segment.... S | 10000 | 10000 | 1 / 1 ## o template has multiple segments. M | 0 | 0 | NA / NA ## - first segment.............. F | 0 | 0 | NA / NA ## - last segment............... L | 0 | 0 | NA / NA ## - other segment.............. O | 0 | 0 | NA / NA ## ## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M. ## Indentation reflects this. ## ## Details for group S: ## o record is mapped.............. S1 | 9748 | 9748 | 1 / 1 ## - primary alignment......... S2 | 9748 | 9748 | 1 / 1 ## - secondary alignment....... S3 | 0 | 0 | NA / NA ## o record is unmapped............ S4 | 252 | 252 | 1 / 1 ``` --- ## Review in IGV We can now review our BAM file in IGV. <div align="center"> <img src="imgs/alnActB.png" alt="igv" height="400" width="600"> </div> --- ## Review in IGV But if we look for splice junctions we see we are missing some information. <div align="center"> <img src="imgs/alnActBnojunc.png" alt="igv" height="400" width="600"> </div> --- ## Splice aware aligning We are missing reads which would align across splice junctions. We can now use the **subjunc()** function to align reads in a splice aware manner. We must again specify out index name, fastq file, desired output format and name of BAM file to be created. ```r subjunc("chr7","data/sampledActin.fq.gz", output_format = "BAM", output_file = "data/RsubreadsampledActin.bam") ``` --- ## Sort and index our file We can now sort and index our splice aware alignment as before. ```r sortBam("data/RsubreadsampledActin.bam", "SortedActBSpliced") ``` ``` ## [1] "SortedActBSpliced.bam" ``` ```r indexBam("SortedActBSpliced.bam") ``` ``` ## SortedActBSpliced.bam ## "SortedActBSpliced.bam.bai" ``` --- ## Rsamtools BAM overview We can get an overview of BAM file information using the **quickBamFlagSummary()** function. ```r quickBamFlagSummary("SortedActBSpliced.bam") ## group | nb of | nb of | mean / max ## of | records | unique | records per ## records | in group | QNAMEs | unique QNAME ## All records........................ A | 10000 | 10000 | 1 / 1 ## o template has single segment.... S | 10000 | 10000 | 1 / 1 ## o template has multiple segments. M | 0 | 0 | NA / NA ## - first segment.............. F | 0 | 0 | NA / NA ## - last segment............... L | 0 | 0 | NA / NA ## - other segment.............. O | 0 | 0 | NA / NA ## ## Note that (S, M) is a partitioning of A, and (F, L, O) is a partitioning of M. ## Indentation reflects this. ## ## Details for group S: ## o record is mapped.............. S1 | 9756 | 9756 | 1 / 1 ## - primary alignment......... S2 | 9756 | 9756 | 1 / 1 ## - secondary alignment....... S3 | 0 | 0 | NA / NA ## o record is unmapped............ S4 | 244 | 244 | 1 / 1 ``` --- ## Review in IGV We can now review our splice aware BAM file in IGV. <div align="center"> <img src="imgs/2020_alnActjunc.png" alt="igv" height="400" width="600"> </div> --- ## Review in IGV Now if we compare to the original alignment from Encode we can identify where some of our unaligned reads may have gone. <div align="center"> <img src="imgs/originalVsNew.png" alt="igv" height="400" width="600"> </div> --- ## Review in IGV The default parameters for alignment within aligners are carefully selected for alignment with ChIP-seq, ATAC-seq, WGS and splice aware alignment in RNA-seq. We will review this performance over the coming sessions. An example here shows alignment with STAR, which often allows for very long splicing. Here you can see reads map over the neighbouring gene to an unrelated region. <div align="center"> <img src="imgs/starCrazy.png" alt="igv" height="200" width="300"> </div> --- ## The Rbowtie2 package The **Rbowtie2** library offers an alternative genomic alignment in R using Bowtie2. ```r BiocManager::install("Rbowtie2") library(Rbowtie2) ``` --- ## The Rbowtie2 package As we saw with **rsubread** we first need to build an index to align too. We can build the index using the **bowtie2_build** function supplying the path to Fasta file to **references** argument and path for bowtie2 index to **bt2Index** command. Again, this may take some time but you will only need to run this index step once. ```r bowtie2_build(references="chr7.fa", bt2Index=file.path("chr7hg38")) ``` --- ## The Rbowtie2 package Now we have created an index to align to, we can align our fastq data to this index. First however we will need to decompress our compressed fastq files to use them with Rbowtie2. The **gunzip** function in R allows us to decompress file from R. ```r library(R.utils) gunzip("data/sampledActin.fq.gz") ``` --- ## The Rbowtie2 package We can align our data using the main **bowtie2** function and specifying the index path to **bt2Index** argument, the output path for sam to **samOutput** argument and the sequence we wish to align to the **seq1** path. ```r bowtie2(bt2Index = "chr7hg38", samOutput = "sampledActin.sam", seq1 = "data/sampledActin.fq") ``` --- ## The Rbowtie2 package The **bowtie2** function outputs a SAM file. We will want to produce a BAM file we can sort and index from this SAM file using the **asBam** function. ```r bamFile_Bowtie2 <- asBam("sampledActin.sam") bamFile_Bowtie2 ``` --- ## The Rbowtie2 package We can then sort and index our BAM file to ensure we are ready for IGV and downstream analysis. ```r sortBam(bamFile_Bowtie2,"SortedActBSpliced_bowtie") indexBam("SortedActBSpliced_bowtie.bam") ``` --- ## The Rbowtie2 package Aligners have a variety of costs/benefits. **Rbowtie2** is a significant improvement the original **Bowtie**, but it is not splice aware alignment. It only allows for genomic alignment. This means that **Rbowtie2** is suitable for - ATAC-seq - ChIP-seq - WGS **Rbowtie2** is NOT suitable for - RNA-seq - Ribo-seq --- ## QuasR Some wrappers have multiple aligners built in to them. This makes it easy to switch between options. * Bowtie - **QuasR** * SpliceMap - **QuasR** * hisat2 - **QuasR** --- ## QuasR ```r BiocManager::install("QuasR") library(QuasR) ``` --- ## QuasR genomic alignment The main function for alignment in the QuasR package is the **qAlign()** function. The **qAlign()** function requires just two arguments. * A Samples file - Tab-delimited file containing fastq location and sample names. * Reference genome - Character string of **BSgenome** object or FASTA file location. --- ## QuasR Sample table The sample table requires is a tab-delimited file listing the path to fastq(s) to be aligned and the desired sample names. ```r FileName <- "data/sampledActin.fq.gz" SampleName <- "sampledActin" sampleTable <- data.frame(FileName,SampleName) write.table(sampleTable,file="sampleTable.txt",sep="\t",quote=FALSE,row.names = FALSE) sampleTable ``` ``` ## FileName SampleName ## 1 data/sampledActin.fq.gz sampledActin ``` --- ## QuasR genomic with BSGenome object The simplest way to specify a reference genome for alignment in **qAlign()** is to use a BSgenome object. Here we can simply specify the name of BSgenome object we wish to use for alignment. Here we specify the BSgenome object for hg38 **BSgenome.Hsapiens.UCSC.hg38**. ```r library(QuasR) qAlign("sampleTable.txt", "BSgenome.Hsapiens.UCSC.hg38") ``` --- ## QuasR genomic with FASTA file Internally, QuasR will create a FASTA file from our BSgenome object prior to alignment. We can also provide a FASTA file directly to the **qAlign()** function. ```r qAlign("sampleTable.txt","chr7.fa") ``` --- ## Choosing aligner with QuasR The aligner argument is used to control which algorithm you were going to use. Here we use Rhisat2 as it is a good algorithm for mapping reads across splice junctions. ```r qAlign("sampleTable.txt","chr7.fa", aligner="Rhisat2") ``` --- ## Time for an exercise. [Link_to_exercises](../../exercises/exercises/alignment_exercise.html) [Link_to_answers](../../exercises/answers/alignment_answers.html)