Why We Use Illumina Sequencing

We can now massively, parallel sequence DNA fragments using illumina/solexa sequencing.
This allows us to rapidly profile abundance of specific DNA sequences in a pool of DNA sequences (or cDNA sequences).
With the full genome sequence we can identify where our abundant sequences are in linear genome.
And/or with transcriptome models we can estimate abundance of transcripts/genes.

##Principals of Illumina Sequencing

igv

##FastQ Sequences

As we have seen earlier,sequences returned from the Illumina sequences machines are often stored in FASTQ format.

$igv$

##FastQ in Bioconductor.

Illumina sequences as FastQ files can be handled in Bioconductor using the functions in the Biostrings package as well as the ShortRead package.

##ShortRead Packages

To make use of a ShortRead package we must first install and load the library.

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ShortRead")

##Data

In this session we will be making use of some public datasets from the ENCODE consortium.

We will be using raw sequence reads in fastQ format which have been generated from an RNAseq experiment.

This RNAseq data has been generated from the human cell line GM12878 and the link to experiment can be found here or a direct link to FastQ for replicate 2 we are using can be found here.

##Data Locally

For this session, I have taken the first 100,000 reads from the FastQ file ENCFF000CXH.fastq.gz to create data/sampled_ENCFF000CXH.fastq.gz. This should allow us to quickly perform some processing and analysis of this data as well as include the smaller file size example data in our zip archive.

This can be found in:

data/sampled_ENCFF000CXH.fastq.gz

##Data Locally

Even though we will work with smaller sample for the presentation, the ShortRead allows us to handle large sequencing datasets in a memory efficient manner.

We will review this at the end of the session.

##Reading in FastQ File

The ShortRead package allows us to import FastQ files into R using the readFastq() function. This function returns a ShortReadQ storing the information from the FastQ file (sequence, quality of sequence and unique read identifiers).

library(ShortRead)
fastQ <- readFastq("data/sampled_ENCFF000CXH.fastq.gz")
class(fastQ)

## [1] "ShortReadQ"
## attr(,"package")
## [1] "ShortRead"

##FastQ Data Object

We can get a very simple summary of the ShortReadQ object by typing in the variable name.

fastQ

## class: ShortReadQ
## length: 100000 reads; width: 76 cycles

##FastQ Data Object

We can use the familiar length() function to report the total number of reads as we have done for vectors or GRanges objects.

length(fastQ)

## [1] 100000

##FastQ Data Object

Or we can use the width() function to find the size of each read/sequence in fastQ as we have done with DNAstringSet and GRanges objects

readLengths <- width(fastQ)
readLengths[1:10]

##  [1] 76 76 76 76 76 76 76 76 76 76

##FastQ Data Object

We can subset or index ShortReadQ objects using the same methods as we have for vectors and GRanges.

Here I use a vector to retrieve first 10 reads in file [Quick reminder that 1:10 is the same as c(1,2,3,4,5,6,7,8,9,10)]

fastQ[1:10]

## class: ShortReadQ
## length: 10 reads; width: 76 cycles

##FastQ Accessors

As we have seen, just typing the object name provide a summary of total reads and maximum read length but to retrieve information such a the sequence, quality or ID we will need to use some special accessor functions.

sread() - Retrieve sequence of reads.
quality() - Retrieve quality of reads as ASCII scores.
ids() - Retrieve IDs of reads.

##FastQ Sequences

We can retrieve all read sequences using the sread() accessor function and the ShortReadQ object.

sequenceOfReads <- sread(fastQ)
class(sequenceOfReads)

## [1] "DNAStringSet"
## attr(,"package")
## [1] "Biostrings"

##FastQ Sequences

The sequences in reads are themselves held in an object we are quite familiar with, a DNAStringSet object.

sequenceOfReads

## DNAStringSet object of length 100000:
##          width seq
##      [1]    76 NNNNNNAANNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNCTCTGCNN
##      [2]    76 NNNNNNTTNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNCGGCGCNN
##      [3]    76 NNNNNNCANNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNGGATTANN
##      [4]    76 NNNNNNTANNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNAGCGCANN
##      [5]    76 NNNNNNACNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNTCCTGCNN
##      ...   ... ...
##  [99996]    76 ACTGCTAAATGCACAACATCGGGCAGGGAAG...GCCCAAGGGGAAGAAATCCGCAGCTCAACAG
##  [99997]    76 CAACAGGACTCGGTGGAGTGCGCTACTCAGG...AACATAGAGAAGGACATTGCGGCTCATATCA
##  [99998]    76 CCCCGCCGGCGGCAGCGGCTTTGGACGAGGG...GCAAGGGGTCCACCCCTACCGACGCTAGACC
##  [99999]    76 GAGATTGGCATTCCCGTGGCCCTGCTCAGCT...GACATCTTCTCGCGTTTCGGCCGCACGGGGA
## [100000]    76 ATCTCGTATGCCCTCTTCTGCTTTTAAAAAA...NTGNCGGTTNCGGCGGTCACCACACGCAGCA

##FastQ Sequences

This means we can use all the great functions from the Biostrings library on this DNAStringSet object.

Here we get the occurrence of nucleotide bases in reads. Here we get the alphabet frequency of the first two.

alpFreq <- alphabetFrequency(sequenceOfReads)
alpFreq[1:2,]

##      A C G T M R W S Y K V H D B  N - + .
## [1,] 2 3 1 2 0 0 0 0 0 0 0 0 0 0 68 0 0 0
## [2,] 0 3 3 2 0 0 0 0 0 0 0 0 0 0 68 0 0 0

##FastQ IDs

We can also extract the IDs for every read using the id() function.

The function again returns an object from the Biostrings packages, here the more generic BStringSet object.

idsOfReads <- id(fastQ)
class(idsOfReads)

## [1] "BStringSet"
## attr(,"package")
## [1] "Biostrings"

idsOfReads[1:2]

## BStringSet object of length 2:
##     width seq
## [1]    31 42JV5AAXX_HWI-EAS229_1:6:1:0:39
## [2]    32 42JV5AAXX_HWI-EAS229_1:6:1:0:155

##FastQ IDs

Again we can act on the BStringSet object just as we did on DNAStringSet objects.

Here we simply convert the BStringSet to a character vector with the as.character() function.

Ids <- as.character(idsOfReads)
Ids[1:4]

## [1] "42JV5AAXX_HWI-EAS229_1:6:1:0:39"  "42JV5AAXX_HWI-EAS229_1:6:1:0:155"
## [3] "42JV5AAXX_HWI-EAS229_1:6:1:0:372" "42JV5AAXX_HWI-EAS229_1:6:1:0:851"

##FastQ Quality

A fundamental difference betweem Fasta and FastQ files is the Quality scores containined in FastQ.

As we have seen, quality scores are stored as ASCII characters representing -log10 probability of base being wrong (Larger scores would be associated to more confident base calls).

A comprehensive description of phred quality can be found on the wiki page for FastQ.

##FastQ Quality

We can also extract the qualities for every read using the quality() function.

The returned object is a special extension of BStringSet object, the FastqQuality object.

quals <- quality(fastQ)
class(quals)

## [1] "FastqQuality"
## attr(,"package")
## [1] "ShortRead"

##FastQ Quality

The object itself contains our quality scores as characters in ASCII format.

quals

## class: FastqQuality
## quality:
## BStringSet object of length 100000:
##          width seq
##      [1]    76 ###############################...###############################
##      [2]    76 ###############################...###############################
##      [3]    76 ###############################...###############################
##      [4]    76 ###############################...###############################
##      [5]    76 ###############################...###############################
##      ...   ... ...
##  [99996]    76 ###############################...###############################
##  [99997]    76 BBBABB@?BABC?=@B?A8??B8???@?8@A...?=?=8=46<;8?;315689/<5,5468856<
##  [99998]    76 BBBBABBBBBBB@BA@AB9=B@';+0:3;=B...###############################
##  [99999]    76 B==35;:1<;;54;;845555;545436435...45444455544####################
## [100000]    76 ?BB<C>/6A@AC6=BB@A75B##########...###############################

##FastQ Quality

We can find out how to translate from these ASCII characters into their corresponding -log10 pvalues using the encoding() function on our FastqQuality object of qualities.

qualityEncoding <- encoding(quals)
qualityEncoding

##  !  "  #  $  %  &  '  (  )  *  +  ,  -  .  /  0  1  2  3  4  5  6  7  8  9  : 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
##  U  V  W  X  Y  Z  [ \\  ]  ^  _  `  a  b  c  d  e  f  g  h  i  j  k  l  m  n 
## 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 
##  o  p  q  r  s  t  u  v  w  x  y  z  {  |  }  ~ 
## 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93

##FastQ Quality

We could use our named vector of quality encoding to translate our FastqQuality object into qualities one read at a time.

We can subset the FastqQuality object using standard vector indexing using [.

We can extract a BString object using list indexing [[

quals[1]

## class: FastqQuality
## quality:
## BStringSet object of length 1:
##     width seq
## [1]    76 ##################################...#################################

quals[[1]]

## 76-letter BString object
## seq: ####################################...####################################

##FastQ Quality

We can use the strsplit() to split out quality into an array of individual ASCII quality score and use the quality encoding to translate to -log10 pvalues.

The strsplit() function takes a character vector to split and a character vector to split by. Here we split by nothing "" to give us a vector of individual characters.

toTranslateList <- strsplit(as.character(quals[[1]]),"")
toTranslate <- unlist(toTranslateList)
toTranslate

##  [1] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#"
## [20] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#"
## [39] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#"
## [58] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#"

##FastQ Quality

And finally we can translate vector of characters to scores using the named vector of quality encoding.

qualityEncoding[toTranslate]

## # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
## # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2

##FastQ Quality

Thankfully, there are some functions available in the ShortRead package which will perform this translation for us.

We can obtain the sum -log10 pvalues using the alphabetScore() function for all reads.

readScores<- alphabetScore(quals)
readScores[1]

## [1] 152

sum(qualityEncoding[toTranslate])

## [1] 152

##FastQ Quality

We can also obtain the quality scores over cycles for all reads using the function as(myQualities, “matrix”)

matrixOfQualities <- as(quals,"matrix")
rowSums(matrixOfQualities)[1]

## [1] 152

##ShortRead Functions

The ShortRead package has many functions available to allow us to collect useful metrics from our ShortRead object.

One very useful function is the alphabetByCycle() function which provides a quick method to summarise base occurrence of cycles.

Here we apply alphabetByCycle() function to the sequence information and show the occurrence of main 4 bases over first 15 cycles.

alpByCyc <- alphabetByCycle(sequenceOfReads)
alpByCyc[1:4,1:15]

##         cycle
## alphabet  [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
##        A 35581 28074 26165 26396 25916 25073 24888 31856 24865 24992 24783
##        C 19925 17360 26879 20644 28101 22186 22014 21524 22156 21995 28825
##        G 23285 25894 25343 24627 24004 30682 23864 23881 23718 31308 24423
##        T 20613 27041 19606 26067 19529 19519 26678 20178 26740 19083 19364
##         cycle
## alphabet [,12] [,13] [,14] [,15]
##        A 24898 24829 24821 25086
##        C 28943 22270 22359 29346
##        G 24294 31085 23875 23505
##        T 19263 19163 26182 19304

##ShortRead Functions

We can also apply our new alphabetByCycle() function to the quality scores.

qualsByCyc <- alphabetByCycle(quals)
qualsByCyc[1:4,1:15]

##         cycle
## alphabet [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
##             0    0    0    0    0    0    0    0    0     0     0     0     0
##        !    0    0    0    0    0    0    0    0    0     0     0     0     0
##        "    0    0    0    0    0    0    0    0    0     0     0     0     0
##        # 4808 5204 5514 5728 5900 6101 6274 6490 6666  6877  7157  7452  7756
##         cycle
## alphabet [,14] [,15]
##              0     0
##        !     0     0
##        "     0     0
##        #  8056  8345

##ShortRead Functions

We can use the table function to identify the number of times a sequence appears in our FastQ file’s sequence reads.

readOccurence <- table(sequenceOfReads)
sort(readOccurence,decreasing = TRUE)[1:2]

##                                                                sequenceOfReads
## 1 ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## 2 CTGATGTTCGCCGACCGTTGACTATTCTCTACAAACCACAAAGACATTGGAACACTATACCTATTATTCGGCGCAT
##   Freq
## 1  403
## 2  291

##ShortRead Functions

We can identify duplicated reads (potentially arising from PCR over amplification) by using the srduplicated() function and the ShortReadQ object.

This returns a logical vector identifying which reads’ sequences are duplicates (occur more than once in file). Note that the first time a sequence appears in file is not a duplicate but the second, third, fourth times etc are.

duplicates <- srduplicated(fastQ)
duplicates[1:3]

## [1] FALSE FALSE FALSE

##ShortRead Functions

We can use this now to get a quick measure of sequence duplication rate using the table function.

table(duplicates)

## duplicates
## FALSE  TRUE 
## 93437  6563

##ShortReads’ qa() Function

The ShortRead package also contains a function to generate a simple quality control report.

The qa() function accepts a FastQ file and returns a FastqQA object.

my_QA <- qa("data/sampled_ENCFF000CXH.fastq.gz")
my_QA

## class: FastqQA(10)
## QA elements (access with qa[["elt"]]):
##   readCounts: data.frame(1 3)
##   baseCalls: data.frame(1 5)
##   readQualityScore: data.frame(512 4)
##   baseQuality: data.frame(95 3)
##   alignQuality: data.frame(1 3)
##   frequentSequences: data.frame(50 4)
##   sequenceDistribution: data.frame(38 4)
##   perCycle: list(2)
##     baseCall: data.frame(380 4)
##     quality: data.frame(2416 5)
##   perTile: list(2)
##     readCounts: data.frame(0 4)
##     medianReadQualityScore: data.frame(0 4)
##   adapterContamination: data.frame(1 1)

##ShortReads’ qa() Function

We can then use the report() function to generate a simple report.

myReport <- report(my_QA)
myReport

## [1] "/tmp/RtmppVwLJW/file16333d70a21d/index.html"

If you want to keep the report, you can save it to a specific directory

report(my_QA, dest="QC_report")

## Warning in dir.create(dest, recursive = TRUE): 'QC_report' already exists

## [1] "QC_report/index.html"

##ShortReads’ qa() Function

Finally we can review the report in a browser or use the browseURL function to open it in a browser from R.

browseURL(myReport)

##Trimming

When we observe low quality at the end of reads we may wish to remove the low quality bases for later alignment to the genome. The trimTails() function trims reads from the 3’, removing bases which fall below a desired quality. The trimTails() function accepts arguments specifying the ShortReadQ object, the minimum number of successive bases required to be below quality cut-off for trimming and the actual cut-off score.

TrimmedFastq <- trimTails(fastQ,20,"5")
TrimmedFastq

## class: ShortReadQ
## length: 100000 reads; width: 19..76 cycles

##Exporting Fastq Files

Now we have trimmed our FastQ reads, we can export these reads for further analysis using the writeFastq() function.

writeFastq(TrimmedFastq,"myTrimmed_Fastq.fastq.gz")

Can something do this all for me?

There are several utility programs that will provide you with QC and trim your data for you, with less input from you. We are fans of the fastp as it does some basic QC and trims your fastqs, and it does it very quickly. To make this available in R, the BRC wrapped this in the Bioconductor package Rfastp.

library(Rfastp)

rfastp_report <- rfastp(read1 = "data/sampled_ENCFF000CXH.fastq.gz", outputFastq ="data/Rfastp_ENCFF000CXH")

Rfastp QC Summary

By default fastp will make a html report to summarize your result. But the Rfastp wrapper allows you to look at some of them in R.

dfsummary <- qcSummary(rfastp_report)
dfsummary

##                      Before_QC     After_QC
## total_reads       1.000000e+05 6.777800e+04
## total_bases       7.600000e+06 5.151128e+06
## q20_bases         4.965570e+06 4.297295e+06
## q30_bases         2.872723e+06 2.486809e+06
## q20_rate          6.533640e-01 8.342430e-01
## q30_rate          3.779900e-01 4.827700e-01
## read1_mean_length 7.600000e+01 7.600000e+01
## gc_content        4.895160e-01 4.875440e-01

Rfastp QC Plot

curvePlot(rfastp_report, curve="quality_curves")

curvePlot(rfastp_report, curve="content_curves")

##Handling Large Files

So far we have used a subsampled FastQ file to review some of the functions available to us in the ShortRead package.

The FastqSampler() and FastqStreamer() functions allow us to read portions of a FastQ file.

##Handling Large Files

For evaluating quality of our FastQ, we do not require the entire file but perhaps only 100,000 random reads to assess quality.

The FastqSampler() function allows us to specifiy how many reads we wish to sample when reading in a file. The FastqSampler() function returns a FastqSampler object we can use with yield() function to obtain a sample from

sampleToRead <- FastqSampler("data/sampled_ENCFF000CXH.fastq.gz",
                             n=100)
yield(sampleToRead)

## class: ShortReadQ
## length: 100 reads; width: 76 cycles

##Handling Large Files

The FastqStreamer() function works in a similar manner accept instead of randomly sampling the file, it allows us to read chunks of the file at a time.

sampleToRead <- FastqStreamer("data/sampled_ENCFF000CXH.fastq.gz",
                              n=100)
first100Reads <- yield(sampleToRead)
second100Reads <- yield(sampleToRead)

##Handling Large Files

We can stream over entire files using a loop, here in chunks of 25000

fq <- FastqStreamer("data/sampled_ENCFF000CXH.fastq.gz",
                   n=25000)
while (length(fq_stream <- yield(fq)) > 0) {
    print(length(fq_stream ))
}

## [1] 25000
## [1] 25000
## [1] 25000
## [1] 25000

##Time for an Exercise

Link_to_exercises

Link_to_answers

FastQ Sequence data In Bioconductor

Rockefeller University, Bioinformatics Resource Centre

http://rockefelleruniversity.github.io/Bioconductor_Introduction/

Why We Use Illumina Sequencing

Can something do this all for me?

Rfastp QC Summary

Rfastp QC Plot