class: center, middle, inverse, title-slide # Analysis of RNAseq data in R and Bioconductor (part 1)
###
http://rockefelleruniversity.github.io/RU_RNAseq/
--- ## RNAseq RNAseq offers a method to simultaneously measure the genome wide expression of annotated and novel transcripts. <div align="center"> <img src="imgs/overview.png" alt="igv" height="400" width="500"> </div> --- ## RNAseq for differential gene expression A common goal in RNAseq is to identify genes and/or transcripts which are expressed at differing levels between our user defined sample groups. Absolute quantification of the expression level of gene is subject to measurement biases (GC bias in PCR and/or RNA capture/selection) so it is most appropriate to compare the levels of the expression of same gene across conditions than to compare differing genes' expression levels within a condition. This commonly referred to as differential gene or transcript expression (DGE/DTE). <div align="center"> <img src="imgs/dge.png" alt="igv" height="200" width="600"> </div> --- ## RNAseq for differential transcript usage A less frequent but important goal in RNAseq is to identify any changes in the relative abundance of transcripts for a gene between conditions. This may be able to identify changes in a genes' functional role between conditions i.e. one isoform of a protein may bind and activate, another may simply bind and obstruct other active forms. We call this differential transcript usage (DTU). <div align="center"> <img src="imgs/dtu.png" alt="igv" height="200" width="600"> </div> --- ## What we will cover * _Session 1: Alignment and counting_ * Session 2: Differential gene expression analysis * Session 3: Visualizing results through PCA and clustering * Session 4: Gene Set Analysis * Session 5: Differential Transcript Utilization analysis --- ## The data Over these sessions we will review data from Christina Leslie's lab at MSKCC on T-Reg cells. This can be found on the Encode portal. T-Reg data [here](https://www.encodeproject.org/experiments/ENCSR486LMB/) and activated T-Reg data [here](https://www.encodeproject.org/experiments/ENCSR726DNP/). --- ## The data Several intermediate files have already been made for you to use. You can download the course content from GitHub [here](https://github.com/RockefellerUniversity/RU_RNAseq/archive/master.zip). Once downloaded you will need to unzip the folder and then navigate into the *r_course* directory of the download. This will mean the paths in the code are correct. ```r setwd("Path/to/Download/RU_RNAseq-master") ``` --- class: inverse, center, middle # Working with raw RNAseq data <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Working with raw RNAseq data FASTQ data can be directly read into R with the [ShortRead package](https://bioconductor.org/packages/release/bioc/html/ShortRead.html) to review sequence data quality. We have covered how to work with raw sequencing data in the [**FASTQ in Bioconductor** session.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/FastQInBioconductor.html#1). Reading everything into R can take a lot of time and memory. Most times we just want to do some QC and maybe filter out low quality reads or trim. To do this in a fast or simple manner we can use the [*Rfastp*](https://bioconductor.org/packages/release/bioc/html/Rfastp.html) package. --- ## Our FASTQ file We can download the full FASTQ using this R code. ```r fq <- "https://www.encodeproject.org/files/ENCFF332KDA/@@download/ENCFF332KDA.fastq.gz" download.file(fq, "ENCFF332KDA.fastq.gz") ``` Alternatively a subset is available in the data directory: *data/ENCFF332KDA_sampled.fastq.gz* --- ## Rfastp To run QC with Rfastp, we just need to use the *rfastp()* function on our FASTQ file. We also need to provide the prefix we want for the outputs to *outputFastq*. ```r json_report <- rfastp(read1 = "data/ENCFF332KDA_sampled.fastq.gz", outputFastq = "ENCFF332KDA_rfastp") ``` --- ## Rfastp outputs Running RfastP creates several files: 1. XXX_R1.fastq.gz - FASTQ with poor quality reads filtered out 2. XXX.html - HTML file contains a QC report 3. XXX.json - JSON file with all the summary statistics ``` ## [1] "ENCFF332KDA_rfastp_R1.fastq.gz" "ENCFF332KDA_rfastp.html" ## [3] "ENCFF332KDA_rfastp.json" ``` --- ## Rfastp outputs in R The object saved in R is in json format. There are several functions that allow you to extract the information out into a more readable format such as the **qcSummary()** function. This details the Before/After stats. ```r qcSummary(json_report) ``` ``` ## Before_QC After_QC ## total_reads 1.000000e+06 9.967650e+05 ## total_bases 5.000000e+07 4.983825e+07 ## q20_bases 4.870494e+07 4.864858e+07 ## q30_bases 4.740574e+07 4.739273e+07 ## q20_rate 9.740990e-01 9.761290e-01 ## q30_rate 9.481150e-01 9.509310e-01 ## read1_mean_length 5.000000e+01 5.000000e+01 ## gc_content 4.823030e-01 4.821960e-01 ``` --- ## Rfastp - Base Quality The curvePlot function can also be used to plot specific QC aspects like base quality and GC content. ```r curvePlot(json_report) ``` ![](RU_RNAseq_p1_files/figure-html/unnamed-chunk-8-1.png)<!-- --> ## Rfastp - GC content ```r curvePlot(json_report, curve = "content_curves") ``` ![](RU_RNAseq_p1_files/figure-html/unnamed-chunk-9-1.png)<!-- --> --- ## Rfastp, trimming and more There are a lot of options for the *rfastp()* function that allow you to customize filtering, trimming and more. ```r `?`(rfastp) ``` --- class: inverse, center, middle # Aligning RNAseq data <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Aligning RNAseq reads Following assessment of read quality, we will want to align our reads to the genome taking into account splice junctions. Not all RNAseq reads will align continuously against our reference genome. Instead they will map across splice junctions, so we need to use [splice aware aligners, that we have seen in previous sessions.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignmentInBioconductor.html#6) The resulting BAM file will contain aligned sequence reads for use in further analysis. <div align="center"> <img src="imgs/sam2.png" alt="igv" height="200" width="600"> </div> --- ## Creating a reference genome First we need to retrieve the sequence information for the genome of interest in [FASTA format.](https://rockefelleruniversity.github.io/Genomic_Data/presentations/slides/GenomicsData.html#9) We can use the [BSgenome libraries to retrieve the full sequence information.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/SequencesInBioconductor.html#4) For the mouse mm10 genome we load the package **BSgenome.Mmusculus.UCSC.mm10**. ```r library(BSgenome.Mmusculus.UCSC.mm10) BSgenome.Mmusculus.UCSC.mm10 ``` ``` ## Mouse genome: ## # organism: Mus musculus (Mouse) ## # genome: mm10 ## # provider: UCSC ## # release date: Sep 2017 ## # 239 sequences: ## # chr1 chr2 chr3 ## # chr4 chr5 chr6 ## # chr7 chr8 chr9 ## # chr10 chr11 chr12 ## # chr13 chr14 chr15 ## # ... ... ... ## # chrX_KZ289094_fix chrX_KZ289095_fix chrY_JH792832_fix ## # chrY_JH792833_fix chrY_JH792834_fix chr1_KK082441_alt ## # chr11_KZ289073_alt chr11_KZ289074_alt chr11_KZ289075_alt ## # chr11_KZ289077_alt chr11_KZ289078_alt chr11_KZ289079_alt ## # chr11_KZ289080_alt chr11_KZ289081_alt ## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator ## # to access a given sequence) ``` --- ## Creating a reference genome We will only use the major chromosomes for our analysis so we may exclude random and unplaced contigs. Here we cycle through the major chromosomes and create a [**DNAStringSet** object from the retrieved sequences](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/SequencesInBioconductor.html#22). ```r mainChromosomes <- paste0("chr", c(1:19, "X", "Y", "M")) mainChrSeq <- lapply(mainChromosomes, function(x) BSgenome.Mmusculus.UCSC.mm10[[x]]) names(mainChrSeq) <- mainChromosomes mainChrSeqSet <- DNAStringSet(mainChrSeq) mainChrSeqSet ``` ``` ## DNAStringSet object of length 22: ## width seq names ## [1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr1 ## [2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr2 ## [3] 160039680 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr3 ## [4] 156508116 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr4 ## [5] 151834684 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr5 ## ... ... ... ## [18] 90702639 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr18 ## [19] 61431566 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr19 ## [20] 171031299 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrX ## [21] 91744698 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrY ## [22] 16299 GTTAATGTAGCTTAATAACAA...TACGCAATAAACATTAACAA chrM ``` --- ## Creating a reference genome Now we have a **DNAStringSet** object we can use the [**writeXStringSet** to create our FASTA file of sequences to align to.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/SequencesInBioconductor.html#22) ```r writeXStringSet(mainChrSeqSet, "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa") ``` --- ## Creating an Rsubread index We will be aligning using the **subjunc** algorithm from the folks behind subread. We can therefore use the **Rsubread** package. Before we attempt to align our FASTQ files, we will need to first build an index from our reference genome using the **buildindex()** function. The [**buildindex()** function simply takes the parameters of our desired index name and the FASTA file to build index from.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignmentInBioconductor.html#14) REMEMBER: Building an index is memory intensive and by default is set to 8GB. This may be too large for your laptop or desktop computer. ```r library(Rsubread) buildindex("mm10_mainchrs", "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa", memory = 8000, indexSplit = TRUE) ``` --- ## Rsubread RNAseq alignment We can align our raw sequence data in FASTQ format to the new FASTA file of our mm10 genome sequence using the **Rsubread** package. Specifically we will be using the **subjunc** function as it is splice aware. This means it will detect reads that span introns. This is the major difference between alignment of RNAseq and other genomic technologies like DNAseq and ChIPseq where we will use the align function. We can also provide a SAF or GTF to our Rsubread call. Although largely unnecessary for gene expression estimation, this will allow us to capture non-canonical splice sites. We simply need to provide a SAF/gtf to **annot.ext** parameter and set **useAnnotation** and **isGTF** to FALSE/TRUE depending on whether we use SAF or GTF as external data. --- ## SAF format using exons() SAF format is used by Rsubread to hold feature information. We simply need a table of exons' chromosome locations (chromosome,start,end,strand) and a feature/metafeature ID. We can retrieve exon locations and their gene ids using **exons()** function. We further select only exons which are annotated to exactly 1 gene. ```r library(TxDb.Mmusculus.UCSC.mm10.knownGene) myExons <- exons(TxDb.Mmusculus.UCSC.mm10.knownGene, columns = c("tx_id", "gene_id")) myExons <- myExons[lengths(myExons$gene_id) == 1] myExons ``` ``` ## GRanges object with 376333 ranges and 2 metadata columns: ## seqnames ranges strand | tx_id ## <Rle> <IRanges> <Rle> | <IntegerList> ## [1] chr1 4807788-4807982 + | 14 ## [2] chr1 4807823-4807982 + | 15 ## [3] chr1 4807830-4807982 + | 16 ## [4] chr1 4807892-4807982 + | 17 ## [5] chr1 4807896-4807982 + | 18 ## ... ... ... ... . ... ## [376329] chrUn_JH584304 55112-55701 - | 142445,142446 ## [376330] chrUn_JH584304 56986-57151 - | 142445,142446 ## [376331] chrUn_JH584304 58564-58835 - | 142445 ## [376332] chrUn_JH584304 58564-59690 - | 142446 ## [376333] chrUn_JH584304 59592-59667 - | 142445 ## gene_id ## <CharacterList> ## [1] 18777 ## [2] 18777 ## [3] 18777 ## [4] 18777 ## [5] 18777 ## ... ... ## [376329] 66776 ## [376330] 66776 ## [376331] 66776 ## [376332] 66776 ## [376333] 66776 ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome ``` --- ## SAF format With the exons GRanges we can create the appropriate data.frame of SAF format for Rsubread. ```r dfExons <- as.data.frame(myExons) SAF <- data.frame(GeneID = dfExons$gene_id, Chr = dfExons$seqnames, Start = dfExons$start, End = dfExons$end, Strand = dfExons$strand) ``` --- ## RNA alignment with external annotation Now we have our SAF formatted exon information we can provide the SAF data.frame to the **annot.ext** argument, set **isGTF** as FALSE and set **useAnnotation** to TRUE. ```r myMapped <- subjunc("mm10_mainchrs", "ENCFF332KDA.fastq.gz", output_format = "BAM", output_file = "Treg_1.bam", useAnnotation = TRUE, annot.ext = SAF, isGTF = FALSE, nthreads = 4) ``` --- ## Sort and index reads As before, we sort and index our files using the [**Rsamtools** packages **sortBam()** and **indexBam()** functions respectively.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignedDataInBioconductor.html#11) The resulting sorted and indexed BAM file is now ready for use in external programs such as IGV as well as for further downstream analysis in R. ```r library(Rsamtools) sortBam("Treg_1.bam", "Sorted_Treg_1") indexBam("Sorted_Treg_1.bam") ``` <div align="center"> <img src="imgs/alignedData.png" alt="igv" height="200" width="600"> </div> --- class: inverse, center, middle # Counting with aligned RNAseq data <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Counting in gene models With our newly aligned reads it is now possible to assign reads to genes in order to quantify a genes' expression level (as reads) within a sample. First we need to gather our gene models of exons and splice junctions which we can use in our later counting steps. ```r library(TxDb.Mmusculus.UCSC.mm10.knownGene) geneExons <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene") class(geneExons) ``` ``` ## [1] "CompressedGRangesList" ## attr(,"package") ## [1] "GenomicRanges" ``` --- ## Counting in gene models Now we have our GRangesList with each entry corresponding to a gene and within each entry a GRanges object of the genes' exons. ```r geneExons[1:2] ``` ``` ## GRangesList object of length 2: ## $`100009600` ## GRanges object with 9 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr9 21062393-21062717 - | 233853 <NA> ## [2] chr9 21062400-21062717 - | 233855 <NA> ## [3] chr9 21062894-21062987 - | 233856 <NA> ## [4] chr9 21063314-21063396 - | 233857 <NA> ## [5] chr9 21066024-21066377 - | 233858 <NA> ## [6] chr9 21066940-21067093 - | 233859 <NA> ## [7] chr9 21066940-21067925 - | 233860 <NA> ## [8] chr9 21068030-21068117 - | 233867 <NA> ## [9] chr9 21073075-21073096 - | 233869 <NA> ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome ## ## $`100009609` ## GRanges object with 8 ranges and 2 metadata columns: ## seqnames ranges strand | exon_id exon_name ## <Rle> <IRanges> <Rle> | <integer> <character> ## [1] chr7 84935565-84941088 - | 190288 <NA> ## [2] chr7 84943141-84943264 - | 190289 <NA> ## [3] chr7 84943504-84943722 - | 190290 <NA> ## [4] chr7 84943504-84947000 - | 190291 <NA> ## [5] chr7 84946200-84947000 - | 190292 <NA> ## [6] chr7 84947372-84947651 - | 190293 <NA> ## [7] chr7 84948507-84949184 - | 190294 <NA> ## [8] chr7 84963816-84964115 - | 190295 <NA> ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome ``` --- ## Counting in gene models We can now use the **summarizeOverlaps()** to [count the reads in our BAM that overlap genes](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/Summarising_Scores_In_Bioconductor.html#30). We specify a BamFile object using the **BamFile()** function and specifying the **yieldSize** parameter to 10000 to control memory footprint. The resulting **RangedSummarizedExperiment** object containing our counts and **GRanges** object. ```r library(GenomicAlignments) myBam <- BamFile("Sorted_Treg_1.bam", yieldSize = 10000) tregGeneCounts <- summarizeOverlaps(geneExons, myBam, ignore.strand = TRUE) tregGeneCounts ``` ``` ## class: RangedSummarizedExperiment ## dim: 24594 1 ## metadata(0): ## assays(1): counts ## rownames(24594): 100009600 100009609 ... 99929 99982 ## rowData names(0): ## colnames(1): Sorted_Treg_1.bam ## colData names(0): ``` --- ## Counting in exons models If want to start to evaluate differential transcript usage we will often evaluate counts over exons and assess for changes in exon usage. <div align="center"> <img src="imgs/dtu.png" alt="igv" height="300" width="600"> </div> --- ## Counting in exons models To count over exons however we will encounter an issue of assigning reads to overlapping exons. <div align="center"> <img src="imgs/overExons.png" alt="igv" height="300" width="600"> </div> --- ## Counting in exons models We can then work to differentiate exons by collapsing exons to the nonoverlapping, disjoint regions. <div align="center"> <img src="imgs/overExons_Disjoint.png" alt="igv" height="300" width="600"> </div> --- ## Counting in exons models We can use the GenomicFeatures' package's **disjointExons()** function with our TxDb object, TxDb.Mmusculus.UCSC.mm10.knownGene, to extract a GRanges of our disjoint exons with their associated **gene_id**, **tx_name**, **exonic_part** in the metadata columns. We add some sensible names to our GRanges for use in later steps. ```r library(GenomicFeatures) nonOverlappingExons <- disjointExons(TxDb.Mmusculus.UCSC.mm10.knownGene) ``` ``` ## Warning: disjointExons() is deprecated. Please use exonicParts() instead. ``` ```r names(nonOverlappingExons) <- paste(mcols(nonOverlappingExons)$gene_id, mcols(nonOverlappingExons)$exonic_part, sep = "_") nonOverlappingExons[1:3, ] ``` ``` ## GRanges object with 3 ranges and 3 metadata columns: ## seqnames ranges strand | gene_id ## <Rle> <IRanges> <Rle> | <CharacterList> ## 100009600_1 chr9 21062393-21062399 - | 100009600 ## 100009600_2 chr9 21062400-21062717 - | 100009600 ## 100009600_3 chr9 21062894-21062987 - | 100009600 ## tx_name ## <CharacterList> ## 100009600_1 ENSMUST00000115494.2,ENSMUST00000216967.1 ## 100009600_2 ENSMUST00000115494.2,ENSMUST00000216967.1,ENSMUST00000213826.1 ## 100009600_3 ENSMUST00000115494.2,ENSMUST00000213826.1 ## exonic_part ## <integer> ## 100009600_1 1 ## 100009600_2 2 ## 100009600_3 3 ## ------- ## seqinfo: 66 sequences (1 circular) from mm10 genome ``` --- ## Counting in exons models We can now again use the **summarizeOverlaps()** function with our nonOverlapping, disjoint exons to identify a BAM file of interest. We need to set the **inter.feature** to FALSE to allow us to count reads overlapping multiple features (such as a read spanning between two exons). ```r tregExonCounts <- summarizeOverlaps(nonOverlappingExons, myBam, ignore.strand = TRUE, inter.feature = FALSE) ``` --- ## Counting in exons models Now we can review our counts from our gene-level and exon-level counting. We retrieve a matrix of counts from either **RangedSummarizedExperiment** object using the **assay()** function. ```r geneCounts <- assay(tregGeneCounts) exonCounts <- assay(tregExonCounts) head(exonCounts, 2) ``` ``` ## Sorted_Treg_1.bam ## 100009600_1 0 ## 100009600_2 1 ``` ```r head(geneCounts, 2) ``` ``` ## Sorted_Treg_1.bam ## 100009600 1 ## 100009609 0 ``` --- class: inverse, center, middle # Transcript quantification with pseudo-alignment <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Transcript quantification More recently methods have developed to directly quantify transcript abundance from FASTQ files using k-mer counting. k-mer counting offers a super fast method of gaining transcript abundance estimates, but does not produce a genomic alignment so not so useful for visualization. As we are most often interested in gene expression changes this offers a fast alternative to counting for alignment and counting. The two leading pieces of software for k-mer counting * [Salmon](http://salmon.readthedocs.io/en/latest/) - Relationship with DESeq2. * [Kallisto](https://pachterlab.github.io/kallisto/about) - History of splice aware aligners and transcript quantification. --- ## Salmon Salmon is the latest in a set of k-mer counting tools from the Kingsford lab (Jellyfish and Sailfish) and offers a workflow for transcript quantification from FASTQ raw sequencing data and a FASTA file of transcript sequences. ![](imgs/salmon.png) --- ## Installing programs that aren't in R There is no R package for Salmon. It is possible to install it on Mac and Linux using the [Anaconda](https://anaconda.org/bioconda/salmon) package repository (unfortunately there is not a Windows implementation). Anaconda is a huge collection of version controlled packages that can be installed through the conda package management system. With conda it is easy to create and manage environments that have a variety of different packages in them. Though there is no Salmon R package, we can interact with the anaconda package system using the R package [Herper](https://bioconductor.org/packages/release/bioc/html/Herper.html). This was created by us here at the BRC and is part of Bioconductor. ```r BiocManager::install("Herper") library(Herper) ``` --- ## Install Salmon with Herper First, we will use Herper to install Salmon with the *install_CondaTools* function. We just need to tell *install_CondaTools* what tool/s you want and the name of the environment you want to build. ```r salmon_paths <- install_CondaTools(tools = "salmon", env = "RNAseq_analysis") salmon_paths ``` Behind the scenes, Herper will install the most minimal version of conda (called miniconda), and then will create a new environment into which Salmon will be installed. When you run the function it prints out where Salmon is installed. There are additional arguments to control where miniconda is installed using *pathToMiniConda* and also update an existing environment with *updateEnv*. --- ## Reference for transcript quantification Once Salmon is installed we need to make a Salmon index. First we need a FASTA file of transcript sequences. We can use the GenomicFeatures' packages **extractTranscriptSeqs** function to retrieve a DNAstringSet object of transcript sequences using our sequence data from **BSgenome.Mmusculus.UCSC.mm10** and gene models from **TxDb.Mmusculus.UCSC.mm10.knownGene** object. ```r allTxSeq <- extractTranscriptSeqs(BSgenome.Mmusculus.UCSC.mm10, TxDb.Mmusculus.UCSC.mm10.knownGene, use.names = TRUE) allTxSeq ``` --- ## Reference for transcript quantification With our new DNAstringSet object of transcript sequences we can write a FASTA file for use in Salmon with the **writeXStringSet** function. ```r writeXStringSet(allTxSeq, "mm10Trans.fa") ``` --- ## Reference with decoy sequences We can also provide a set of decoy sequences for building an index. This allows salmon to consider similar sequences outside of transcriptomic regions when mapping. To do this we need to add the main chromosomes to our transcriptome FASTA for creating an index. ```r mainChromosomes <- paste0("chr", c(1:19, "X", "Y", "M")) mainChrSeq <- lapply(mainChromosomes, function(x) BSgenome.Mmusculus.UCSC.mm10[[x]]) names(mainChrSeq) <- mainChromosomes mainChrSeqSet <- DNAStringSet(mainChrSeq) gentrome <- c(allTxSeq, mainChrSeqSet) ``` --- ## Reference with dummy sequence We will also need to provide a config file containing the names of sequences to be used for decoys. ```r writeXStringSet(gentrome, "mm10Gentrome.fa") write.table(as.data.frame(mainChromosomes), "decoy.txt", row.names = FALSE, col.names = FALSE, quote = FALSE) ``` --- ## Salmon index from R As with standard aligners, the first part of our quantification requires us to make an index from our FASTA file using the **Salmon index** command. We specify the transcript FASTA file (-t) and index name (-i). Here we arrange our Salmon command into a system call from R. This means we are directly triggering an OS command from within R. ```{} salmon index -i mm10Trans -t mm10Trans.fa ``` Herper allows us to run conda packages from within R. Salmon has been installed into the environment *RNAseq_analysis*. So we can use this environment from R using *with_CondaEnv()*. ```r fastaTx <- "mm10Trans.fa" indexName <- "mm10Trans" with_CondaEnv("RNAseq_analysis", system2(command="salmon",args = c("index", "-i",indexName, "-t",fastaTx), stdout = TRUE)) ``` --- ## Salmon index from R To create index with decoys we provide our transcriptome + chromosome FASTA as well as our decoy file to the -d paramter. ```r with_CondaEnv("RNAseq_analysis", system2(command="salmon",args = c("index", "-i",indexName, "-t",fastaTx, "-d decoy.txt"), stdout = TRUE)) ``` --- ## Salmon quant from R To quantify transcript abundance we use the **Salmon quant** command. We specify the index location (-i), the reads to quantify (-r), the output directory (-o) and the library type as automatic detection of type (-l A). ```{} salmon quant -i mm10Trans -r ~/Downloads/ENCFF332KDA.fastq.gz -o TReg_1_Quant -l A ``` ```r fq <- "ENCFF332KDA.fastq.gz" outDir <- "TReg_1_Quant" with_CondaEnv("RNAseq_analysis", system2(command="salmon",args = c("quant", "-i",indexName, "-o",outDir, "-l A", "-r",fq), stdout = TRUE)) ``` --- ## Salmon outputs The output from Salmon is our quantification of transcripts in the **quant.sf** file in the user defined output directory. We will use a new package **tximport** to handle these counts in next session, but for now we can review file using standard data import R libraries. ```r myQuant <- read.delim("TReg_1_Quant/quant.sf") myQuant[1:3, ] ``` ``` ## Name Length EffectiveLength TPM NumReads ## 1 ENSMUST00000193812.1 1070 820.000 0.0000 0 ## 2 ENSMUST00000082908.1 110 3.749 0.0000 0 ## 3 ENSMUST00000192857.1 480 230.000 0.8368 7 ``` --- ## Time for an exercise [Link_to_exercises](../../exercises/exercises/RNAseq_part1_exercise.html) [Link_to_answers](../../exercises/answers/RNAseq_part1_answers.html)