FastQInBioconductor.knit

class: center, middle, inverse, title-slide

# FastQ Sequence data In Bioconductor
<html>
<div style="float:left">

</div>
<hr color='#EB811B' size=1px width=796px>
</html>
### Rockefeller University, Bioinformatics Resource Centre
### <a href="http://rockefelleruniversity.github.io/Bioconductor_Introduction/" class="uri">http://rockefelleruniversity.github.io/Bioconductor_Introduction/</a>

---

## Why We Use Illumina Sequencing

- We can now massively, parallel sequence DNA fragments using illumina/solexa sequencing.
- This allows us to rapidly profile abundance of specific DNA sequences in a pool of DNA sequences (or cDNA sequences).
- With the full genome sequence we can identify where our abundant sequences are in linear genome.
- And/or with transcriptome models we can estimate abundance of transcripts/genes.

---

##Principals of Illumina Sequencing

---

##FastQ Sequences

As we have seen earlier,sequences returned from the Illumina sequences machines are often stored in FASTQ format.

---

##FastQ in Bioconductor.

Illumina sequences as FastQ files can be handled in Bioconductor using the functions in the **Biostrings** package as well as the **ShortRead** package.

---

##ShortRead Packages

To make use of a ShortRead package we must first install and load the library.

```r
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("ShortRead")
```

---

##Data

In this session we will be making use of some public datasets from the ENCODE consortium.

We will be using raw sequence reads in fastQ format which have been generated from an RNAseq experiment.

This RNAseq data has been generated from the human cell line **GM12878** and the link to experiment can be found [here](https://www.encodeproject.org/experiments/ENCSR297UBP/) or a direct link to FastQ for replicate 2 we are using can be found [here.](https://www.encodeproject.org/files/ENCFF000CXH/@@download/ENCFF000CXH.fastq.gz)

---

##Data Locally

For this session, I have taken the first 100,000 reads from the FastQ file **ENCFF000CXH.fastq.gz** to create **data/sampled_ENCFF000CXH.fastq.gz**. This should allow us to quickly perform some processing and analysis of this data as well as include the smaller file size example data in our zip archive.

This can be found in:

**data/sampled_ENCFF000CXH.fastq.gz**

---

##Data Locally

Even though we will work with smaller sample for the presentation, the ShortRead allows us to handle large sequencing datasets in a memory efficient manner.

We will review this at the end of the session.

---

##Reading in FastQ File

The **ShortRead** package allows us to import FastQ files into R using the **readFastq()** function. This function returns a **ShortReadQ** storing the information from the FastQ file (sequence, quality of sequence and unique read identifiers).

```r
library(ShortRead)
fastQ <- readFastq("data/sampled_ENCFF000CXH.fastq.gz")
class(fastQ)
```

```
## [1] "ShortReadQ"
## attr(,"package")
## [1] "ShortRead"
```

---

##FastQ Data Object

We can get a very simple summary of the **ShortReadQ** object by typing in the variable name.

```r
fastQ
```

```
## class: ShortReadQ
## length: 100000 reads; width: 76 cycles
```

---
##FastQ Data Object

We can use the familiar **length()** function to report the total number of reads as we have done for vectors or GRanges objects.

```r
length(fastQ)
```

```
## [1] 100000
```

---
##FastQ Data Object

Or we can use the **width()** function to find the size of each read/sequence in fastQ as we have done with DNAstringSet and GRanges objects

```r
readLengths <- width(fastQ)
readLengths[1:10]
```

```
##  [1] 76 76 76 76 76 76 76 76 76 76
```

---
##FastQ Data Object

We can subset or index **ShortReadQ** objects using the same methods as we have for vectors and GRanges.

Here I use a vector to retrieve first 10 reads in file **[Quick reminder that 1:10 is the same as c(1,2,3,4,5,6,7,8,9,10)]**

```r
fastQ[1:10]
```

```
## class: ShortReadQ
## length: 10 reads; width: 76 cycles
```

---
##FastQ Accessors

As we have seen, just typing the object name provide a summary of total reads and maximum read length but to retrieve information such a the sequence, quality or ID we will need to use some special accessor functions.

* **sread()** - Retrieve sequence of reads.
* **quality()** - Retrieve quality of reads as ASCII scores.
* **ids()** - Retrieve IDs of reads.
---

##FastQ Sequences

We can retrieve all read sequences using the **sread()** accessor function and the **ShortReadQ** object.

```r
sequenceOfReads <- sread(fastQ)
class(sequenceOfReads)
```

```
## [1] "DNAStringSet"
## attr(,"package")
## [1] "Biostrings"
```
---
##FastQ Sequences

The sequences in reads are themselves held in an object we are quite familiar with, a **DNAStringSet** object.

```r
sequenceOfReads 
```

```
## DNAStringSet object of length 100000:
##          width seq
##      [1]    76 NNNNNNAANNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNCTCTGCNN
##      [2]    76 NNNNNNTTNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNCGGCGCNN
##      [3]    76 NNNNNNCANNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNGGATTANN
##      [4]    76 NNNNNNTANNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNAGCGCANN
##      [5]    76 NNNNNNACNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNTCCTGCNN
##      ...   ... ...
##  [99996]    76 ACTGCTAAATGCACAACATCGGGCAGGGAAG...GCCCAAGGGGAAGAAATCCGCAGCTCAACAG
##  [99997]    76 CAACAGGACTCGGTGGAGTGCGCTACTCAGG...AACATAGAGAAGGACATTGCGGCTCATATCA
##  [99998]    76 CCCCGCCGGCGGCAGCGGCTTTGGACGAGGG...GCAAGGGGTCCACCCCTACCGACGCTAGACC
##  [99999]    76 GAGATTGGCATTCCCGTGGCCCTGCTCAGCT...GACATCTTCTCGCGTTTCGGCCGCACGGGGA
## [100000]    76 ATCTCGTATGCCCTCTTCTGCTTTTAAAAAA...NTGNCGGTTNCGGCGGTCACCACACGCAGCA
```
---
##FastQ Sequences

This means we can use all the great functions from the **Biostrings** library on this **DNAStringSet** object.

Here we get the occurrence of nucleotide bases in reads. Here we get the alphabet frequency of the first two.

```r
alpFreq <- alphabetFrequency(sequenceOfReads)
alpFreq[1:2,]
```

```
##      A C G T M R W S Y K V H D B  N - + .
## [1,] 2 3 1 2 0 0 0 0 0 0 0 0 0 0 68 0 0 0
## [2,] 0 3 3 2 0 0 0 0 0 0 0 0 0 0 68 0 0 0
```

---
##FastQ IDs

We can also extract the IDs for every read using the **id()** function.

The function again returns an object from the **Biostrings** packages, here the more generic **BStringSet** object.

```r
idsOfReads <- id(fastQ)
class(idsOfReads)
```

```
## [1] "BStringSet"
## attr(,"package")
## [1] "Biostrings"
```

```r
idsOfReads[1:2]
```

```
## BStringSet object of length 2:
##     width seq
## [1]    31 42JV5AAXX_HWI-EAS229_1:6:1:0:39
## [2]    32 42JV5AAXX_HWI-EAS229_1:6:1:0:155
```
---
##FastQ IDs

Again we can act on the **BStringSet** object just as we did on **DNAStringSet** objects.

Here we simply convert the **BStringSet** to a character vector with the **as.character()** function.

```r
Ids <- as.character(idsOfReads)
Ids[1:4]
```

```
## [1] "42JV5AAXX_HWI-EAS229_1:6:1:0:39"  "42JV5AAXX_HWI-EAS229_1:6:1:0:155"
## [3] "42JV5AAXX_HWI-EAS229_1:6:1:0:372" "42JV5AAXX_HWI-EAS229_1:6:1:0:851"
```

---
##FastQ Quality

A fundamental difference betweem Fasta and FastQ files is the **Q**uality scores containined in FastQ.

As we have seen, quality scores are stored as ASCII characters representing -log10 probability of base being wrong (Larger scores would be associated to more confident base calls).

A comprehensive description of phred quality can be found on the wiki page for [FastQ](https://en.wikipedia.org/wiki/FASTQ_format#Quality).

---
##FastQ Quality

We can also extract the qualities for every read using the **quality()** function.

The returned object is a special extension of **BStringSet** object, the **FastqQuality** object.

```r
quals <- quality(fastQ)
class(quals)
```

```
## [1] "FastqQuality"
## attr(,"package")
## [1] "ShortRead"
```

---
##FastQ Quality

The object itself contains our quality scores as characters in ASCII format.

```r
quals
```

```
## class: FastqQuality
## quality:
## BStringSet object of length 100000:
##          width seq
##      [1]    76 ###############################...###############################
##      [2]    76 ###############################...###############################
##      [3]    76 ###############################...###############################
##      [4]    76 ###############################...###############################
##      [5]    76 ###############################...###############################
##      ...   ... ...
##  [99996]    76 ###############################...###############################
##  [99997]    76 BBBABB@?BABC?=@B?A8??B8???@?8@A...?=?=8=46<;8?;315689/<5,5468856<
##  [99998]    76 BBBBABBBBBBB@BA@AB9=B@';+0:3;=B...###############################
##  [99999]    76 B==35;:1<;;54;;845555;545436435...45444455544####################
## [100000]    76 ?BB<C>/6A@AC6=BB@A75B##########...###############################
```

---
##FastQ Quality

We can find out how to translate from these ASCII characters into their corresponding -log10 pvalues using the **encoding()** function on our **FastqQuality** object of qualities.

```r
qualityEncoding <- encoding(quals)
qualityEncoding
```

```
##  !  "  #  $  %  &  '  (  )  *  +  ,  -  .  /  0  1  2  3  4  5  6  7  8  9  : 
##  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 
##  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T 
## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 
##  U  V  W  X  Y  Z  [ \\  ]  ^  _  `  a  b  c  d  e  f  g  h  i  j  k  l  m  n 
## 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 
##  o  p  q  r  s  t  u  v  w  x  y  z  {  |  }  ~ 
## 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
```
---
##FastQ Quality

We could use our named vector of quality encoding to translate our **FastqQuality** object into qualities one read at a time.

We can subset the **FastqQuality** object using standard vector indexing using **[**.

We can extract a **BString** object using list indexing **[[**

```r
quals[1]
```

```
## class: FastqQuality
## quality:
## BStringSet object of length 1:
##     width seq
## [1]    76 ##################################...#################################
```

```r
quals[[1]]
```

```
## 76-letter BString object
## seq: ####################################...####################################
```
---
##FastQ Quality

We can use the **strsplit()** to split out quality into an array of individual ASCII quality score and use the quality encoding to translate to -log10 pvalues.

The **strsplit()** function takes a character vector to split and a character vector to split by. Here we split by nothing **""** to give us a vector of individual characters.

```r
toTranslateList <- strsplit(as.character(quals[[1]]),"")
toTranslate <- unlist(toTranslateList)
toTranslate
```

```
##  [1] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#"
## [20] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#"
## [39] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#"
## [58] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#"
```
---
##FastQ Quality

And finally we can translate vector of characters to scores using the named vector of quality encoding.

```r
qualityEncoding[toTranslate]
```

```
## # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 
## # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # 
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
```

---
##FastQ Quality

Thankfully, there are some functions available in the **ShortRead** package which will perform this translation for us.

We can obtain the sum -log10 pvalues using the **alphabetScore()** function for all reads.

```r
readScores<- alphabetScore(quals)
readScores[1]
```

```
## [1] 152
```

```r
sum(qualityEncoding[toTranslate])
```

```
## [1] 152
```
---
##FastQ Quality

We can also obtain the quality scores over cycles for all reads using the function **as(_myQualities_, "matrix")**

```r
matrixOfQualities <- as(quals,"matrix")
rowSums(matrixOfQualities)[1]
```

```
## [1] 152
```

---
##ShortRead Functions

The **ShortRead** package has many functions available to allow us to collect useful metrics from our ShortRead object.

One very useful function is the **alphabetByCycle()** function which provides a quick method to summarise base occurrence of cycles.

Here we apply **alphabetByCycle()** function to the sequence information and show the occurrence of main 4 bases over first 15 cycles.

```r
alpByCyc <- alphabetByCycle(sequenceOfReads)
alpByCyc[1:4,1:15]
```

```
##         cycle
## alphabet  [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11]
##        A 35581 28074 26165 26396 25916 25073 24888 31856 24865 24992 24783
##        C 19925 17360 26879 20644 28101 22186 22014 21524 22156 21995 28825
##        G 23285 25894 25343 24627 24004 30682 23864 23881 23718 31308 24423
##        T 20613 27041 19606 26067 19529 19519 26678 20178 26740 19083 19364
##         cycle
## alphabet [,12] [,13] [,14] [,15]
##        A 24898 24829 24821 25086
##        C 28943 22270 22359 29346
##        G 24294 31085 23875 23505
##        T 19263 19163 26182 19304
```
---
##ShortRead Functions

We can also apply our new  **alphabetByCycle()** function to the quality scores.

```r
qualsByCyc <- alphabetByCycle(quals)
qualsByCyc[1:4,1:15]
```

```
##         cycle
## alphabet [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
##             0    0    0    0    0    0    0    0    0     0     0     0     0
##        !    0    0    0    0    0    0    0    0    0     0     0     0     0
##        "    0    0    0    0    0    0    0    0    0     0     0     0     0
##        # 4808 5204 5514 5728 5900 6101 6274 6490 6666  6877  7157  7452  7756
##         cycle
## alphabet [,14] [,15]
##              0     0
##        !     0     0
##        "     0     0
##        #  8056  8345
```
---
##ShortRead Functions

We can use the **table** function to identify the number of times a sequence appears in our FastQ file's sequence reads.

```r
readOccurence <- table(sequenceOfReads)
sort(readOccurence,decreasing = TRUE)[1:2]
```

```
##                                                                sequenceOfReads
## 1 ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
## 2 CTGATGTTCGCCGACCGTTGACTATTCTCTACAAACCACAAAGACATTGGAACACTATACCTATTATTCGGCGCAT
##   Freq
## 1  403
## 2  291
```

---
##ShortRead Functions

We can identify duplicated reads (potentially arising from PCR over amplification) by using the **srduplicated()** function and the **ShortReadQ** object.

This returns a logical vector identifying which reads' sequences are duplicates (occur more than once in file). Note that the first time a sequence appears in file is not a duplicate but the second, third, fourth times etc are.

```r
duplicates <- srduplicated(fastQ)
duplicates[1:3]
```

```
## [1] FALSE FALSE FALSE
```
---
##ShortRead Functions

We can use this now to get a quick measure of sequence duplication rate using the **table** function.

```r
table(duplicates)
```

```
## duplicates
## FALSE  TRUE 
## 93437  6563
```

---
##ShortReads' qa() Function

The **ShortRead** package also contains a function to generate a simple quality control report.

The **qa()** function accepts a FastQ file and returns a **FastqQA**  object.

```r
my_QA <- qa("data/sampled_ENCFF000CXH.fastq.gz")
my_QA
```

```
## class: FastqQA(10)
## QA elements (access with qa[["elt"]]):
##   readCounts: data.frame(1 3)
##   baseCalls: data.frame(1 5)
##   readQualityScore: data.frame(512 4)
##   baseQuality: data.frame(95 3)
##   alignQuality: data.frame(1 3)
##   frequentSequences: data.frame(50 4)
##   sequenceDistribution: data.frame(38 4)
##   perCycle: list(2)
##     baseCall: data.frame(380 4)
##     quality: data.frame(2416 5)
##   perTile: list(2)
##     readCounts: data.frame(0 4)
##     medianReadQualityScore: data.frame(0 4)
##   adapterContamination: data.frame(1 1)
```

---
##ShortReads' qa() Function

We can then use the **report()** function to generate a simple report.

```r
myReport <- report(my_QA)
myReport
```

```
## [1] "/tmp/RtmppVwLJW/file163311fc1a00/index.html"
```

If you want to keep the report, you can save it to a specific directory

```r
report(my_QA, dest="QC_report")
```

```
## [1] "QC_report/index.html"
```

---
##ShortReads' qa() Function

Finally we can review the report in a browser or use the browseURL function to open it in a browser from R.

```r
browseURL(myReport)
```

---
##Trimming

When we observe low quality at the end of reads we may wish to remove the low quality bases for later alignment to the genome.
The **trimTails()** function trims reads from the 3', removing bases which fall below a desired quality. 
The **trimTails()** function accepts arguments specifying the **ShortReadQ** object, the minimum number of successive bases required to be below quality cut-off for trimming and the actual cut-off score.

```r
TrimmedFastq <- trimTails(fastQ,20,"5")
TrimmedFastq
```

```
## class: ShortReadQ
## length: 100000 reads; width: 19..76 cycles
```

---
##Exporting Fastq Files

Now we have trimmed our FastQ reads, we can export these reads for further analysis using the **writeFastq()** function.

```r
writeFastq(TrimmedFastq,"myTrimmed_Fastq.fastq.gz")
```

# Can something do this all for me?

There are several utility programs that will provide you with QC and trim your data for you, with less input from you. We are fans of the [fastp](https://pubmed.ncbi.nlm.nih.gov/30423086/) as it does some basic QC and trims your fastqs, and it does it very quickly. To make this available in R, the BRC wrapped this in the Bioconductor package [Rfastp](http://www.bioconductor.org/packages/release/bioc/html/Rfastp.html).

```r
library(Rfastp)

rfastp_report <- rfastp(read1 = "data/sampled_ENCFF000CXH.fastq.gz", outputFastq ="data/Rfastp_ENCFF000CXH")
```
# Rfastp QC Summary
By default fastp will make a html report to summarize your result. But the Rfastp wrapper allows you to look at some of them in R.

```r
dfsummary <- qcSummary(rfastp_report)
dfsummary
```

```
##                      Before_QC     After_QC
## total_reads       1.000000e+05 6.777800e+04
## total_bases       7.600000e+06 5.151128e+06
## q20_bases         4.965570e+06 4.297295e+06
## q30_bases         2.872723e+06 2.486809e+06
## q20_rate          6.533640e-01 8.342430e-01
## q30_rate          3.779900e-01 4.827700e-01
## read1_mean_length 7.600000e+01 7.600000e+01
## gc_content        4.895160e-01 4.875440e-01
```

# Rfastp QC Plot
.pull-left[

```r
curvePlot(rfastp_report, curve="quality_curves")
```

![](FastQInBioconductor_files/figure-html/unnamed-chunk-4-1.png)
  ]
  
.pull-left[

```r
curvePlot(rfastp_report, curve="content_curves")
```

![](FastQInBioconductor_files/figure-html/unnamed-chunk-5-1.png)
  ]

---
##Handling Large Files

So far we have used a subsampled FastQ file to review some of the functions available to us in the ShortRead package.

The **FastqSampler()** and **FastqStreamer()** functions allow us to read portions of a FastQ file. 
---
##Handling Large Files

For evaluating quality of our FastQ, we do not require the entire file but perhaps only 100,000 random reads to assess quality.

The **FastqSampler()** function allows us to specifiy how many reads we wish to sample when reading in a file. The **FastqSampler()** function returns a FastqSampler object we can use with **yield()** function to obtain a sample from

```r
sampleToRead <- FastqSampler("data/sampled_ENCFF000CXH.fastq.gz",
                             n=100)
yield(sampleToRead)
```

```
## class: ShortReadQ
## length: 100 reads; width: 76 cycles
```

---
##Handling Large Files

The **FastqStreamer()** function works in a similar manner accept instead of randomly sampling the file, it allows us to read chunks of the file at a time.

```r
sampleToRead <- FastqStreamer("data/sampled_ENCFF000CXH.fastq.gz",
                              n=100)
first100Reads <- yield(sampleToRead)
second100Reads <- yield(sampleToRead)
```
---

##Handling Large Files

We can stream over entire files using a loop, here in chunks of 25000

```r
fq <- FastqStreamer("data/sampled_ENCFF000CXH.fastq.gz",
                   n=25000)
while (length(fq_stream <- yield(fq)) > 0) {
    print(length(fq_stream ))
}
```

```
## [1] 25000
## [1] 25000
## [1] 25000
## [1] 25000
```

---

##Time for an Exercise

[Link_to_exercises](../../exercises/exercises/fastq_exercise.html)

[Link_to_answers](../../exercises/answers/fastq_answers.html)