RNAseq (part 1)


RNAseq

RNAseq offers a method to simultaneously measure the genome wide expression of annotated and novel transcripts.

igv

RNAseq for differential gene expression

A common goal in RNAseq is to identify genes and/or transcripts which are expressed at differing levels between our user defined sample groups.

Absolute quantification of the expression level of gene is subject to measurement biases (GC bias in PCR and/or RNA capture/selection) so it is most appropriate to compare the levels of the expression of same gene across conditions than to compare differing genes’ expression levels within a condition.

This commonly referred to as differential gene or transcript expression (DGE/DTE).

igv

RNAseq for differential transcript usage

A less frequent but important goal in RNAseq is to identify any changes in the relative abundance of transcripts for a gene between conditions. This may be able to identify changes in a genes’ functional role between conditions i.e. one isoform of a protein may bind and activate, another may simply bind and obstruct other active forms. We call this differential transcript usage (DTU).

igv

What we will cover

  • Session 1: Alignment and counting

  • Session 2: Differential gene expression analysis

  • Session 3: Visualizing results through PCA and clustering

  • Session 4: Gene Set Analysis

  • Session 5: Differential Transcript Utilization analysis

The data

Over these sessions we will review data from Christina Leslie’s lab at MSKCC on T-Reg cells. This can be found on the Encode portal. T-Reg data here and activated T-Reg data here.

The data

Several intermediate files have already been made for you to use. You can download the course content from GitHub here. Once downloaded you will need to unzip the folder and then navigate into the r_course directory of the download. This will mean the paths in the code are correct.

setwd("Path/to/Download/RU_RNAseq-master")

Working with raw RNAseq data


Working with raw RNAseq data

FASTQ data can be directly read into R with the ShortRead package to review sequence data quality. We have covered how to work with raw sequencing data in the FASTQ in Bioconductor session..

Reading everything into R can take a lot of time and memory. Most times we just want to do some QC and maybe filter out low quality reads or trim. To do this in a fast or simple manner we can use the Rfastp package.

Our FASTQ file

We can download the full FASTQ using this R code.

fq <- "https://www.encodeproject.org/files/ENCFF332KDA/@@download/ENCFF332KDA.fastq.gz"
download.file(fq, "ENCFF332KDA.fastq.gz")

Alternatively a subset is available in the data directory: data/ENCFF332KDA_sampled.fastq.gz

Rfastp

To run QC with Rfastp, we just need to use the rfastp() function on our FASTQ file. We also need to provide the prefix we want for the outputs to outputFastq.

json_report <- rfastp(read1 = "data/ENCFF332KDA_sampled.fastq.gz", outputFastq = "ENCFF332KDA_rfastp")

Rfastp outputs

Running RfastP creates several files:

  1. XXX_R1.fastq.gz - FASTQ with poor quality reads filtered out
  2. XXX.html - HTML file contains a QC report
  3. XXX.json - JSON file with all the summary statistics
## [1] "ENCFF332KDA_rfastp_R1.fastq.gz" "ENCFF332KDA_rfastp.html"       
## [3] "ENCFF332KDA_rfastp.json"

Rfastp outputs in R

The object saved in R is in json format. There are several functions that allow you to extract the information out into a more readable format such as the qcSummary() function. This details the Before/After stats.

qcSummary(json_report)
##                      Before_QC     After_QC
## total_reads       1.000000e+06 9.967650e+05
## total_bases       5.000000e+07 4.983825e+07
## q20_bases         4.870494e+07 4.864858e+07
## q30_bases         4.740574e+07 4.739273e+07
## q20_rate          9.740990e-01 9.761290e-01
## q30_rate          9.481150e-01 9.509310e-01
## read1_mean_length 5.000000e+01 5.000000e+01
## gc_content        4.823030e-01 4.821960e-01

Rfastp - Base Quality

The curvePlot function can also be used to plot specific QC aspects like base quality and GC content.

curvePlot(json_report)

Rfastp - GC content

curvePlot(json_report, curve = "content_curves")

Rfastp, trimming and more

There are a lot of options for the rfastp() function that allow you to customize filtering, trimming and more.

`?`(rfastp)

Aligning RNAseq data


Aligning RNAseq reads

Following assessment of read quality, we will want to align our reads to the genome taking into account splice junctions.

Not all RNAseq reads will align continuously against our reference genome. Instead they will map across splice junctions, so we need to use splice aware aligners, that we have seen in previous sessions. The resulting BAM file will contain aligned sequence reads for use in further analysis.

igv

Creating a reference genome

First we need to retrieve the sequence information for the genome of interest in FASTA format.

We can use the BSgenome libraries to retrieve the full sequence information.

For the mouse mm10 genome we load the package BSgenome.Mmusculus.UCSC.mm10.

library(BSgenome.Mmusculus.UCSC.mm10)
BSgenome.Mmusculus.UCSC.mm10
## Mouse genome:
## # organism: Mus musculus (Mouse)
## # genome: mm10
## # provider: UCSC
## # release date: Sep 2017
## # 239 sequences:
## #   chr1                 chr2                 chr3                
## #   chr4                 chr5                 chr6                
## #   chr7                 chr8                 chr9                
## #   chr10                chr11                chr12               
## #   chr13                chr14                chr15               
## #   ...                  ...                  ...                 
## #   chrX_KZ289094_fix    chrX_KZ289095_fix    chrY_JH792832_fix   
## #   chrY_JH792833_fix    chrY_JH792834_fix    chr1_KK082441_alt   
## #   chr11_KZ289073_alt   chr11_KZ289074_alt   chr11_KZ289075_alt  
## #   chr11_KZ289077_alt   chr11_KZ289078_alt   chr11_KZ289079_alt  
## #   chr11_KZ289080_alt   chr11_KZ289081_alt                       
## # (use 'seqnames()' to see all the sequence names, use the '$' or '[[' operator
## # to access a given sequence)

Creating a reference genome

We will only use the major chromosomes for our analysis so we may exclude random and unplaced contigs. Here we cycle through the major chromosomes and create a DNAStringSet object from the retrieved sequences.

mainChromosomes <- paste0("chr", c(1:19, "X", "Y", "M"))
mainChrSeq <- lapply(mainChromosomes, function(x) BSgenome.Mmusculus.UCSC.mm10[[x]])
names(mainChrSeq) <- mainChromosomes
mainChrSeqSet <- DNAStringSet(mainChrSeq)
mainChrSeqSet
## DNAStringSet object of length 22:
##          width seq                                          names               
##  [1] 195471971 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr1
##  [2] 182113224 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr2
##  [3] 160039680 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr3
##  [4] 156508116 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr4
##  [5] 151834684 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr5
##  ...       ... ...
## [18]  90702639 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr18
## [19]  61431566 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chr19
## [20] 171031299 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrX
## [21]  91744698 NNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNN chrY
## [22]     16299 GTTAATGTAGCTTAATAACAA...TACGCAATAAACATTAACAA chrM

Creating a reference genome

Now we have a DNAStringSet object we can use the writeXStringSet to create our FASTA file of sequences to align to.

writeXStringSet(mainChrSeqSet, "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa")

Creating an Rsubread index

We will be aligning using the subjunc algorithm from the folks behind subread. We can therefore use the Rsubread package. Before we attempt to align our FASTQ files, we will need to first build an index from our reference genome using the buildindex() function.

The buildindex() function simply takes the parameters of our desired index name and the FASTA file to build index from.

REMEMBER: Building an index is memory intensive and by default is set to 8GB. This may be too large for your laptop or desktop computer.

library(Rsubread)
buildindex("mm10_mainchrs", "BSgenome.Mmusculus.UCSC.mm10.mainChrs.fa", memory = 8000,
    indexSplit = TRUE)

Rsubread RNAseq alignment

We can align our raw sequence data in FASTQ format to the new FASTA file of our mm10 genome sequence using the Rsubread package. Specifically we will be using the subjunc function as it is splice aware. This means it will detect reads that span introns. This is the major difference between alignment of RNAseq and other genomic technologies like DNAseq and ChIPseq where we will use the align function.

We can also provide a SAF or GTF to our Rsubread call. Although largely unnecessary for gene expression estimation, this will allow us to capture non-canonical splice sites.

We simply need to provide a SAF/gtf to annot.ext parameter and set useAnnotation and isGTF to FALSE/TRUE depending on whether we use SAF or GTF as external data.

SAF format using exons()

SAF format is used by Rsubread to hold feature information. We simply need a table of exons’ chromosome locations (chromosome,start,end,strand) and a feature/metafeature ID.

We can retrieve exon locations and their gene ids using exons() function. We further select only exons which are annotated to exactly 1 gene.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
myExons <- exons(TxDb.Mmusculus.UCSC.mm10.knownGene, columns = c("tx_id", "gene_id"))
myExons <- myExons[lengths(myExons$gene_id) == 1]
myExons
## GRanges object with 376333 ranges and 2 metadata columns:
##                  seqnames          ranges strand |         tx_id
##                     <Rle>       <IRanges>  <Rle> | <IntegerList>
##        [1]           chr1 4807788-4807982      + |            14
##        [2]           chr1 4807823-4807982      + |            15
##        [3]           chr1 4807830-4807982      + |            16
##        [4]           chr1 4807892-4807982      + |            17
##        [5]           chr1 4807896-4807982      + |            18
##        ...            ...             ...    ... .           ...
##   [376329] chrUn_JH584304     55112-55701      - | 142445,142446
##   [376330] chrUn_JH584304     56986-57151      - | 142445,142446
##   [376331] chrUn_JH584304     58564-58835      - |        142445
##   [376332] chrUn_JH584304     58564-59690      - |        142446
##   [376333] chrUn_JH584304     59592-59667      - |        142445
##                    gene_id
##            <CharacterList>
##        [1]           18777
##        [2]           18777
##        [3]           18777
##        [4]           18777
##        [5]           18777
##        ...             ...
##   [376329]           66776
##   [376330]           66776
##   [376331]           66776
##   [376332]           66776
##   [376333]           66776
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

SAF format

With the exons GRanges we can create the appropriate data.frame of SAF format for Rsubread.

dfExons <- as.data.frame(myExons)
SAF <- data.frame(GeneID = dfExons$gene_id, Chr = dfExons$seqnames, Start = dfExons$start,
    End = dfExons$end, Strand = dfExons$strand)

RNA alignment with external annotation

Now we have our SAF formatted exon information we can provide the SAF data.frame to the annot.ext argument, set isGTF as FALSE and set useAnnotation to TRUE.

myMapped <- subjunc("mm10_mainchrs", "ENCFF332KDA.fastq.gz", output_format = "BAM",
    output_file = "Treg_1.bam", useAnnotation = TRUE, annot.ext = SAF, isGTF = FALSE,
    nthreads = 4)

Sort and index reads

As before, we sort and index our files using the Rsamtools packages sortBam() and indexBam() functions respectively.

The resulting sorted and indexed BAM file is now ready for use in external programs such as IGV as well as for further downstream analysis in R.

library(Rsamtools)
sortBam("Treg_1.bam", "Sorted_Treg_1")
indexBam("Sorted_Treg_1.bam")

igv

Counting with aligned RNAseq data


Counting in gene models

With our newly aligned reads it is now possible to assign reads to genes in order to quantify a genes’ expression level (as reads) within a sample.

First we need to gather our gene models of exons and splice junctions which we can use in our later counting steps.

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
geneExons <- exonsBy(TxDb.Mmusculus.UCSC.mm10.knownGene, by = "gene")
class(geneExons)
## [1] "CompressedGRangesList"
## attr(,"package")
## [1] "GenomicRanges"

Counting in gene models

Now we have our GRangesList with each entry corresponding to a gene and within each entry a GRanges object of the genes’ exons.

geneExons[1:2]
## GRangesList object of length 2:
## $`100009600`
## GRanges object with 9 ranges and 2 metadata columns:
##       seqnames            ranges strand |   exon_id   exon_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr9 21062393-21062717      - |    233853        <NA>
##   [2]     chr9 21062400-21062717      - |    233855        <NA>
##   [3]     chr9 21062894-21062987      - |    233856        <NA>
##   [4]     chr9 21063314-21063396      - |    233857        <NA>
##   [5]     chr9 21066024-21066377      - |    233858        <NA>
##   [6]     chr9 21066940-21067093      - |    233859        <NA>
##   [7]     chr9 21066940-21067925      - |    233860        <NA>
##   [8]     chr9 21068030-21068117      - |    233867        <NA>
##   [9]     chr9 21073075-21073096      - |    233869        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome
## 
## $`100009609`
## GRanges object with 8 ranges and 2 metadata columns:
##       seqnames            ranges strand |   exon_id   exon_name
##          <Rle>         <IRanges>  <Rle> | <integer> <character>
##   [1]     chr7 84935565-84941088      - |    190288        <NA>
##   [2]     chr7 84943141-84943264      - |    190289        <NA>
##   [3]     chr7 84943504-84943722      - |    190290        <NA>
##   [4]     chr7 84943504-84947000      - |    190291        <NA>
##   [5]     chr7 84946200-84947000      - |    190292        <NA>
##   [6]     chr7 84947372-84947651      - |    190293        <NA>
##   [7]     chr7 84948507-84949184      - |    190294        <NA>
##   [8]     chr7 84963816-84964115      - |    190295        <NA>
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

Counting in gene models

We can now use the summarizeOverlaps() to count the reads in our BAM that overlap genes. We specify a BamFile object using the BamFile() function and specifying the yieldSize parameter to 10000 to control memory footprint.

The resulting RangedSummarizedExperiment object containing our counts and GRanges object.

library(GenomicAlignments)
myBam <- BamFile("Sorted_Treg_1.bam", yieldSize = 10000)
tregGeneCounts <- summarizeOverlaps(geneExons, myBam, ignore.strand = TRUE)
tregGeneCounts
## class: RangedSummarizedExperiment 
## dim: 24594 1 
## metadata(0):
## assays(1): counts
## rownames(24594): 100009600 100009609 ... 99929 99982
## rowData names(0):
## colnames(1): Sorted_Treg_1.bam
## colData names(0):

Counting in exons models

If want to start to evaluate differential transcript usage we will often evaluate counts over exons and assess for changes in exon usage.

igv

Counting in exons models

To count over exons however we will encounter an issue of assigning reads to overlapping exons.

igv

Counting in exons models

We can then work to differentiate exons by collapsing exons to the nonoverlapping, disjoint regions.

igv

Counting in exons models

We can use the GenomicFeatures’ package’s disjointExons() function with our TxDb object, TxDb.Mmusculus.UCSC.mm10.knownGene, to extract a GRanges of our disjoint exons with their associated gene_id, tx_name, exonic_part in the metadata columns.

We add some sensible names to our GRanges for use in later steps.

library(GenomicFeatures)
nonOverlappingExons <- disjointExons(TxDb.Mmusculus.UCSC.mm10.knownGene)
## Warning: disjointExons() is deprecated. Please use exonicParts() instead.
names(nonOverlappingExons) <- paste(mcols(nonOverlappingExons)$gene_id, mcols(nonOverlappingExons)$exonic_part,
    sep = "_")
nonOverlappingExons[1:3, ]
## GRanges object with 3 ranges and 3 metadata columns:
##               seqnames            ranges strand |         gene_id
##                  <Rle>         <IRanges>  <Rle> | <CharacterList>
##   100009600_1     chr9 21062393-21062399      - |       100009600
##   100009600_2     chr9 21062400-21062717      - |       100009600
##   100009600_3     chr9 21062894-21062987      - |       100009600
##                                                                      tx_name
##                                                              <CharacterList>
##   100009600_1                      ENSMUST00000115494.2,ENSMUST00000216967.1
##   100009600_2 ENSMUST00000115494.2,ENSMUST00000216967.1,ENSMUST00000213826.1
##   100009600_3                      ENSMUST00000115494.2,ENSMUST00000213826.1
##               exonic_part
##                 <integer>
##   100009600_1           1
##   100009600_2           2
##   100009600_3           3
##   -------
##   seqinfo: 66 sequences (1 circular) from mm10 genome

Counting in exons models

We can now again use the summarizeOverlaps() function with our nonOverlapping, disjoint exons to identify a BAM file of interest.

We need to set the inter.feature to FALSE to allow us to count reads overlapping multiple features (such as a read spanning between two exons).

tregExonCounts <- summarizeOverlaps(nonOverlappingExons, myBam, ignore.strand = TRUE,
    inter.feature = FALSE)

Counting in exons models

Now we can review our counts from our gene-level and exon-level counting.

We retrieve a matrix of counts from either RangedSummarizedExperiment object using the assay() function.

geneCounts <- assay(tregGeneCounts)
exonCounts <- assay(tregExonCounts)
head(exonCounts, 2)
##             Sorted_Treg_1.bam
## 100009600_1                 0
## 100009600_2                 1
head(geneCounts, 2)
##           Sorted_Treg_1.bam
## 100009600                 1
## 100009609                 0

Transcript quantification with pseudo-alignment


Transcript quantification

More recently methods have developed to directly quantify transcript abundance from FASTQ files using k-mer counting.

k-mer counting offers a super fast method of gaining transcript abundance estimates, but does not produce a genomic alignment so not so useful for visualization.

As we are most often interested in gene expression changes this offers a fast alternative to counting for alignment and counting.

The two leading pieces of software for k-mer counting

  • Salmon - Relationship with DESeq2.

  • Kallisto - History of splice aware aligners and transcript quantification.

Salmon

Salmon is the latest in a set of k-mer counting tools from the Kingsford lab (Jellyfish and Sailfish) and offers a workflow for transcript quantification from FASTQ raw sequencing data and a FASTA file of transcript sequences.

Installing programs that aren’t in R

There is no R package for Salmon. It is possible to install it on Mac and Linux using the Anaconda package repository (unfortunately there is not a Windows implementation). Anaconda is a huge collection of version controlled packages that can be installed through the conda package management system. With conda it is easy to create and manage environments that have a variety of different packages in them.

Though there is no Salmon R package, we can interact with the anaconda package system using the R package Herper. This was created by us here at the BRC and is part of Bioconductor.

BiocManager::install("Herper")
library(Herper)

Install Salmon with Herper

First, we will use Herper to install Salmon with the install_CondaTools function. We just need to tell install_CondaTools what tool/s you want and the name of the environment you want to build.

salmon_paths <- install_CondaTools(tools = "salmon", env = "RNAseq_analysis")
salmon_paths

Behind the scenes, Herper will install the most minimal version of conda (called miniconda), and then will create a new environment into which Salmon will be installed. When you run the function it prints out where Salmon is installed. There are additional arguments to control where miniconda is installed using pathToMiniConda and also update an existing environment with updateEnv.

Reference for transcript quantification

Once Salmon is installed we need to make a Salmon index. First we need a FASTA file of transcript sequences. We can use the GenomicFeatures’ packages extractTranscriptSeqs function to retrieve a DNAstringSet object of transcript sequences using our sequence data from BSgenome.Mmusculus.UCSC.mm10 and gene models from TxDb.Mmusculus.UCSC.mm10.knownGene object.

allTxSeq <- extractTranscriptSeqs(BSgenome.Mmusculus.UCSC.mm10, TxDb.Mmusculus.UCSC.mm10.knownGene,
    use.names = TRUE)
allTxSeq

Reference for transcript quantification

With our new DNAstringSet object of transcript sequences we can write a FASTA file for use in Salmon with the writeXStringSet function.

writeXStringSet(allTxSeq, "mm10Trans.fa")

Reference with decoy sequences

We can also provide a set of decoy sequences for building an index. This allows salmon to consider similar sequences outside of transcriptomic regions when mapping.

To do this we need to add the main chromosomes to our transcriptome FASTA for creating an index.

mainChromosomes <- paste0("chr", c(1:19, "X", "Y", "M"))
mainChrSeq <- lapply(mainChromosomes, function(x) BSgenome.Mmusculus.UCSC.mm10[[x]])
names(mainChrSeq) <- mainChromosomes
mainChrSeqSet <- DNAStringSet(mainChrSeq)
gentrome <- c(allTxSeq, mainChrSeqSet)

Reference with dummy sequence

We will also need to provide a config file containing the names of sequences to be used for decoys.

writeXStringSet(gentrome, "mm10Gentrome.fa")
write.table(as.data.frame(mainChromosomes), "decoy.txt", row.names = FALSE, col.names = FALSE,
    quote = FALSE)

Salmon index from R

As with standard aligners, the first part of our quantification requires us to make an index from our FASTA file using the Salmon index command. We specify the transcript FASTA file (-t) and index name (-i).

Here we arrange our Salmon command into a system call from R. This means we are directly triggering an OS command from within R.

salmon index -i mm10Trans -t mm10Trans.fa

Herper allows us to run conda packages from within R. Salmon has been installed into the environment RNAseq_analysis. So we can use this environment from R using with_CondaEnv().

fastaTx <- "mm10Trans.fa"
indexName <- "mm10Trans"

with_CondaEnv("RNAseq_analysis",
                      system2(command="salmon",args = c("index",
                        "-i",indexName,
                        "-t",fastaTx),
                        
                        stdout = TRUE))

Salmon index from R

To create index with decoys we provide our transcriptome + chromosome FASTA as well as our decoy file to the -d paramter.

with_CondaEnv("RNAseq_analysis",
                      system2(command="salmon",args = c("index",
                        "-i",indexName,
                        "-t",fastaTx,
                        "-d decoy.txt"),
                        
                        stdout = TRUE))

Salmon quant from R

To quantify transcript abundance we use the Salmon quant command.

We specify the index location (-i), the reads to quantify (-r), the output directory (-o) and the library type as automatic detection of type (-l A).

salmon quant -i mm10Trans -r ~/Downloads/ENCFF332KDA.fastq.gz -o TReg_1_Quant -l A
fq <- "ENCFF332KDA.fastq.gz"
outDir <- "TReg_1_Quant"

with_CondaEnv("RNAseq_analysis",
                      system2(command="salmon",args = c("quant",
                        "-i",indexName,
                        "-o",outDir,
                        "-l A",
                        "-r",fq),
                        
                        stdout = TRUE))

Salmon outputs

The output from Salmon is our quantification of transcripts in the quant.sf file in the user defined output directory.

We will use a new package tximport to handle these counts in next session, but for now we can review file using standard data import R libraries.

myQuant <- read.delim("TReg_1_Quant/quant.sf")
myQuant[1:3, ]
##                   Name Length EffectiveLength    TPM NumReads
## 1 ENSMUST00000193812.1   1070         820.000 0.0000        0
## 2 ENSMUST00000082908.1    110           3.749 0.0000        0
## 3 ENSMUST00000192857.1    480         230.000 0.8368        7

Time for an exercise

Link_to_exercises

Link_to_answers