class: center, middle, inverse, title-slide # FastQ Sequence data In Bioconductor
### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/Bioconductor_Introduction/
--- ## Why We Use Illumina Sequencing - We can now massively, parallel sequence DNA fragments using illumina/solexa sequencing. - This allows us to rapidly profile abundance of specific DNA sequences in a pool of DNA sequences (or cDNA sequences). - With the full genome sequence we can identify where our abundant sequences are in linear genome. - And/or with transcriptome models we can estimate abundance of transcripts/genes. --- ##Principals of Illumina Sequencing <div align="center"> <img src="imgs/wps.png" alt="igv" height="500" width="800"> </div> --- ##FastQ Sequences As we have seen earlier,sequences returned from the Illumina sequences machines are often stored in FASTQ format. <div align="center"> <img src="imgs/fq1.png" alt="igv" height="200" width="600"> </div> --- ##FastQ in Bioconductor. Illumina sequences as FastQ files can be handled in Bioconductor using the functions in the **Biostrings** package as well as the **ShortRead** package. --- ##ShortRead Packages To make use of a ShortRead package we must first install and load the library. ```r if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("ShortRead") ``` --- ##Data In this session we will be making use of some public datasets from the ENCODE consortium. We will be using raw sequence reads in fastQ format which have been generated from an RNAseq experiment. This RNAseq data has been generated from the human cell line **GM12878** and the link to experiment can be found [here](https://www.encodeproject.org/experiments/ENCSR297UBP/) or a direct link to FastQ for replicate 2 we are using can be found [here.](https://www.encodeproject.org/files/ENCFF000CXH/@@download/ENCFF000CXH.fastq.gz) --- ##Data Locally For this session, I have taken the first 100,000 reads from the FastQ file **ENCFF000CXH.fastq.gz** to create **data/sampled_ENCFF000CXH.fastq.gz**. This should allow us to quickly perform some processing and analysis of this data as well as include the smaller file size example data in our zip archive. This can be found in: **data/sampled_ENCFF000CXH.fastq.gz** --- ##Data Locally Even though we will work with smaller sample for the presentation, the ShortRead allows us to handle large sequencing datasets in a memory efficient manner. We will review this at the end of the session. --- ##Reading in FastQ File The **ShortRead** package allows us to import FastQ files into R using the **readFastq()** function. This function returns a **ShortReadQ** storing the information from the FastQ file (sequence, quality of sequence and unique read identifiers). ```r library(ShortRead) fastQ <- readFastq("data/sampled_ENCFF000CXH.fastq.gz") class(fastQ) ``` ``` ## [1] "ShortReadQ" ## attr(,"package") ## [1] "ShortRead" ``` --- ##FastQ Data Object We can get a very simple summary of the **ShortReadQ** object by typing in the variable name. ```r fastQ ``` ``` ## class: ShortReadQ ## length: 100000 reads; width: 76 cycles ``` --- ##FastQ Data Object We can use the familiar **length()** function to report the total number of reads as we have done for vectors or GRanges objects. ```r length(fastQ) ``` ``` ## [1] 100000 ``` --- ##FastQ Data Object Or we can use the **width()** function to find the size of each read/sequence in fastQ as we have done with DNAstringSet and GRanges objects ```r readLengths <- width(fastQ) readLengths[1:10] ``` ``` ## [1] 76 76 76 76 76 76 76 76 76 76 ``` --- ##FastQ Data Object We can subset or index **ShortReadQ** objects using the same methods as we have for vectors and GRanges. Here I use a vector to retrieve first 10 reads in file **[Quick reminder that 1:10 is the same as c(1,2,3,4,5,6,7,8,9,10)]** ```r fastQ[1:10] ``` ``` ## class: ShortReadQ ## length: 10 reads; width: 76 cycles ``` --- ##FastQ Accessors As we have seen, just typing the object name provide a summary of total reads and maximum read length but to retrieve information such a the sequence, quality or ID we will need to use some special accessor functions. * **sread()** - Retrieve sequence of reads. * **quality()** - Retrieve quality of reads as ASCII scores. * **ids()** - Retrieve IDs of reads. --- ##FastQ Sequences We can retrieve all read sequences using the **sread()** accessor function and the **ShortReadQ** object. ```r sequenceOfReads <- sread(fastQ) class(sequenceOfReads) ``` ``` ## [1] "DNAStringSet" ## attr(,"package") ## [1] "Biostrings" ``` --- ##FastQ Sequences The sequences in reads are themselves held in an object we are quite familiar with, a **DNAStringSet** object. ```r sequenceOfReads ``` ``` ## DNAStringSet object of length 100000: ## width seq ## [1] 76 NNNNNNAANNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNCTCTGCNN ## [2] 76 NNNNNNTTNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNCGGCGCNN ## [3] 76 NNNNNNCANNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNGGATTANN ## [4] 76 NNNNNNTANNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNAGCGCANN ## [5] 76 NNNNNNACNNNNNNNNNNNNNNNNNNNNNNN...NNNNNNNNNNNNNNNNNNNNNNNTCCTGCNN ## ... ... ... ## [99996] 76 ACTGCTAAATGCACAACATCGGGCAGGGAAG...GCCCAAGGGGAAGAAATCCGCAGCTCAACAG ## [99997] 76 CAACAGGACTCGGTGGAGTGCGCTACTCAGG...AACATAGAGAAGGACATTGCGGCTCATATCA ## [99998] 76 CCCCGCCGGCGGCAGCGGCTTTGGACGAGGG...GCAAGGGGTCCACCCCTACCGACGCTAGACC ## [99999] 76 GAGATTGGCATTCCCGTGGCCCTGCTCAGCT...GACATCTTCTCGCGTTTCGGCCGCACGGGGA ## [100000] 76 ATCTCGTATGCCCTCTTCTGCTTTTAAAAAA...NTGNCGGTTNCGGCGGTCACCACACGCAGCA ``` --- ##FastQ Sequences This means we can use all the great functions from the **Biostrings** library on this **DNAStringSet** object. Here we get the occurrence of nucleotide bases in reads. Here we get the alphabet frequency of the first two. ```r alpFreq <- alphabetFrequency(sequenceOfReads) alpFreq[1:2,] ``` ``` ## A C G T M R W S Y K V H D B N - + . ## [1,] 2 3 1 2 0 0 0 0 0 0 0 0 0 0 68 0 0 0 ## [2,] 0 3 3 2 0 0 0 0 0 0 0 0 0 0 68 0 0 0 ``` --- ##FastQ IDs We can also extract the IDs for every read using the **id()** function. The function again returns an object from the **Biostrings** packages, here the more generic **BStringSet** object. ```r idsOfReads <- id(fastQ) class(idsOfReads) ``` ``` ## [1] "BStringSet" ## attr(,"package") ## [1] "Biostrings" ``` ```r idsOfReads[1:2] ``` ``` ## BStringSet object of length 2: ## width seq ## [1] 31 42JV5AAXX_HWI-EAS229_1:6:1:0:39 ## [2] 32 42JV5AAXX_HWI-EAS229_1:6:1:0:155 ``` --- ##FastQ IDs Again we can act on the **BStringSet** object just as we did on **DNAStringSet** objects. Here we simply convert the **BStringSet** to a character vector with the **as.character()** function. ```r Ids <- as.character(idsOfReads) Ids[1:4] ``` ``` ## [1] "42JV5AAXX_HWI-EAS229_1:6:1:0:39" "42JV5AAXX_HWI-EAS229_1:6:1:0:155" ## [3] "42JV5AAXX_HWI-EAS229_1:6:1:0:372" "42JV5AAXX_HWI-EAS229_1:6:1:0:851" ``` --- ##FastQ Quality A fundamental difference betweem Fasta and FastQ files is the **Q**uality scores containined in FastQ. As we have seen, quality scores are stored as ASCII characters representing -log10 probability of base being wrong (Larger scores would be associated to more confident base calls). A comprehensive description of phred quality can be found on the wiki page for [FastQ](https://en.wikipedia.org/wiki/FASTQ_format#Quality). --- ##FastQ Quality We can also extract the qualities for every read using the **quality()** function. The returned object is a special extension of **BStringSet** object, the **FastqQuality** object. ```r quals <- quality(fastQ) class(quals) ``` ``` ## [1] "FastqQuality" ## attr(,"package") ## [1] "ShortRead" ``` --- ##FastQ Quality The object itself contains our quality scores as characters in ASCII format. ```r quals ``` ``` ## class: FastqQuality ## quality: ## BStringSet object of length 100000: ## width seq ## [1] 76 ###############################...############################### ## [2] 76 ###############################...############################### ## [3] 76 ###############################...############################### ## [4] 76 ###############################...############################### ## [5] 76 ###############################...############################### ## ... ... ... ## [99996] 76 ###############################...############################### ## [99997] 76 BBBABB@?BABC?=@B?A8??B8???@?8@A...?=?=8=46<;8?;315689/<5,5468856< ## [99998] 76 BBBBABBBBBBB@BA@AB9=B@';+0:3;=B...############################### ## [99999] 76 B==35;:1<;;54;;845555;545436435...45444455544#################### ## [100000] 76 ?BB<C>/6A@AC6=BB@A75B##########...############################### ``` --- ##FastQ Quality We can find out how to translate from these ASCII characters into their corresponding -log10 pvalues using the **encoding()** function on our **FastqQuality** object of qualities. ```r qualityEncoding <- encoding(quals) qualityEncoding ``` ``` ## ! " # $ % & ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 ## ; < = > ? @ A B C D E F G H I J K L M N O P Q R S T ## 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 ## U V W X Y Z [ \\ ] ^ _ ` a b c d e f g h i j k l m n ## 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 ## o p q r s t u v w x y z { | } ~ ## 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 ``` --- ##FastQ Quality We could use our named vector of quality encoding to translate our **FastqQuality** object into qualities one read at a time. We can subset the **FastqQuality** object using standard vector indexing using **[**. We can extract a **BString** object using list indexing **[[** ```r quals[1] ``` ``` ## class: FastqQuality ## quality: ## BStringSet object of length 1: ## width seq ## [1] 76 ##################################...################################# ``` ```r quals[[1]] ``` ``` ## 76-letter BString object ## seq: ####################################...#################################### ``` --- ##FastQ Quality We can use the **strsplit()** to split out quality into an array of individual ASCII quality score and use the quality encoding to translate to -log10 pvalues. The **strsplit()** function takes a character vector to split and a character vector to split by. Here we split by nothing **""** to give us a vector of individual characters. ```r toTranslateList <- strsplit(as.character(quals[[1]]),"") toTranslate <- unlist(toTranslateList) toTranslate ``` ``` ## [1] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" ## [20] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" ## [39] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" ## [58] "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" "#" ``` --- ##FastQ Quality And finally we can translate vector of characters to scores using the named vector of quality encoding. ```r qualityEncoding[toTranslate] ``` ``` ## # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # ## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ## # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # ## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ``` --- ##FastQ Quality Thankfully, there are some functions available in the **ShortRead** package which will perform this translation for us. We can obtain the sum -log10 pvalues using the **alphabetScore()** function for all reads. ```r readScores<- alphabetScore(quals) readScores[1] ``` ``` ## [1] 152 ``` ```r sum(qualityEncoding[toTranslate]) ``` ``` ## [1] 152 ``` --- ##FastQ Quality We can also obtain the quality scores over cycles for all reads using the function **as(_myQualities_, "matrix")** ```r matrixOfQualities <- as(quals,"matrix") rowSums(matrixOfQualities)[1] ``` ``` ## [1] 152 ``` --- ##ShortRead Functions The **ShortRead** package has many functions available to allow us to collect useful metrics from our ShortRead object. One very useful function is the **alphabetByCycle()** function which provides a quick method to summarise base occurrence of cycles. Here we apply **alphabetByCycle()** function to the sequence information and show the occurrence of main 4 bases over first 15 cycles. ```r alpByCyc <- alphabetByCycle(sequenceOfReads) alpByCyc[1:4,1:15] ``` ``` ## cycle ## alphabet [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] ## A 35581 28074 26165 26396 25916 25073 24888 31856 24865 24992 24783 ## C 19925 17360 26879 20644 28101 22186 22014 21524 22156 21995 28825 ## G 23285 25894 25343 24627 24004 30682 23864 23881 23718 31308 24423 ## T 20613 27041 19606 26067 19529 19519 26678 20178 26740 19083 19364 ## cycle ## alphabet [,12] [,13] [,14] [,15] ## A 24898 24829 24821 25086 ## C 28943 22270 22359 29346 ## G 24294 31085 23875 23505 ## T 19263 19163 26182 19304 ``` --- ##ShortRead Functions We can also apply our new **alphabetByCycle()** function to the quality scores. ```r qualsByCyc <- alphabetByCycle(quals) qualsByCyc[1:4,1:15] ``` ``` ## cycle ## alphabet [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] ## 0 0 0 0 0 0 0 0 0 0 0 0 0 ## ! 0 0 0 0 0 0 0 0 0 0 0 0 0 ## " 0 0 0 0 0 0 0 0 0 0 0 0 0 ## # 4808 5204 5514 5728 5900 6101 6274 6490 6666 6877 7157 7452 7756 ## cycle ## alphabet [,14] [,15] ## 0 0 ## ! 0 0 ## " 0 0 ## # 8056 8345 ``` --- ##ShortRead Functions We can use the **table** function to identify the number of times a sequence appears in our FastQ file's sequence reads. ```r readOccurence <- table(sequenceOfReads) sort(readOccurence,decreasing = TRUE)[1:2] ``` ``` ## sequenceOfReads ## 1 ANNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN ## 2 CTGATGTTCGCCGACCGTTGACTATTCTCTACAAACCACAAAGACATTGGAACACTATACCTATTATTCGGCGCAT ## Freq ## 1 403 ## 2 291 ``` --- ##ShortRead Functions We can identify duplicated reads (potentially arising from PCR over amplification) by using the **srduplicated()** function and the **ShortReadQ** object. This returns a logical vector identifying which reads' sequences are duplicates (occur more than once in file). Note that the first time a sequence appears in file is not a duplicate but the second, third, fourth times etc are. ```r duplicates <- srduplicated(fastQ) duplicates[1:3] ``` ``` ## [1] FALSE FALSE FALSE ``` --- ##ShortRead Functions We can use this now to get a quick measure of sequence duplication rate using the **table** function. ```r table(duplicates) ``` ``` ## duplicates ## FALSE TRUE ## 93437 6563 ``` --- ##ShortReads' qa() Function The **ShortRead** package also contains a function to generate a simple quality control report. The **qa()** function accepts a FastQ file and returns a **FastqQA** object. ```r my_QA <- qa("data/sampled_ENCFF000CXH.fastq.gz") my_QA ``` ``` ## class: FastqQA(10) ## QA elements (access with qa[["elt"]]): ## readCounts: data.frame(1 3) ## baseCalls: data.frame(1 5) ## readQualityScore: data.frame(512 4) ## baseQuality: data.frame(95 3) ## alignQuality: data.frame(1 3) ## frequentSequences: data.frame(50 4) ## sequenceDistribution: data.frame(38 4) ## perCycle: list(2) ## baseCall: data.frame(380 4) ## quality: data.frame(2416 5) ## perTile: list(2) ## readCounts: data.frame(0 4) ## medianReadQualityScore: data.frame(0 4) ## adapterContamination: data.frame(1 1) ``` --- ##ShortReads' qa() Function We can then use the **report()** function to generate a simple report. ```r myReport <- report(my_QA) myReport ``` ``` ## [1] "/tmp/RtmppVwLJW/file163311fc1a00/index.html" ``` If you want to keep the report, you can save it to a specific directory ```r report(my_QA, dest="QC_report") ``` ``` ## [1] "QC_report/index.html" ``` --- ##ShortReads' qa() Function Finally we can review the report in a browser or use the browseURL function to open it in a browser from R. ```r browseURL(myReport) ``` --- ##Trimming When we observe low quality at the end of reads we may wish to remove the low quality bases for later alignment to the genome. The **trimTails()** function trims reads from the 3', removing bases which fall below a desired quality. The **trimTails()** function accepts arguments specifying the **ShortReadQ** object, the minimum number of successive bases required to be below quality cut-off for trimming and the actual cut-off score. ```r TrimmedFastq <- trimTails(fastQ,20,"5") TrimmedFastq ``` ``` ## class: ShortReadQ ## length: 100000 reads; width: 19..76 cycles ``` --- ##Exporting Fastq Files Now we have trimmed our FastQ reads, we can export these reads for further analysis using the **writeFastq()** function. ```r writeFastq(TrimmedFastq,"myTrimmed_Fastq.fastq.gz") ``` # Can something do this all for me? There are several utility programs that will provide you with QC and trim your data for you, with less input from you. We are fans of the [fastp](https://pubmed.ncbi.nlm.nih.gov/30423086/) as it does some basic QC and trims your fastqs, and it does it very quickly. To make this available in R, the BRC wrapped this in the Bioconductor package [Rfastp](http://www.bioconductor.org/packages/release/bioc/html/Rfastp.html). ```r library(Rfastp) rfastp_report <- rfastp(read1 = "data/sampled_ENCFF000CXH.fastq.gz", outputFastq ="data/Rfastp_ENCFF000CXH") ``` # Rfastp QC Summary By default fastp will make a html report to summarize your result. But the Rfastp wrapper allows you to look at some of them in R. ```r dfsummary <- qcSummary(rfastp_report) dfsummary ``` ``` ## Before_QC After_QC ## total_reads 1.000000e+05 6.777800e+04 ## total_bases 7.600000e+06 5.151128e+06 ## q20_bases 4.965570e+06 4.297295e+06 ## q30_bases 2.872723e+06 2.486809e+06 ## q20_rate 6.533640e-01 8.342430e-01 ## q30_rate 3.779900e-01 4.827700e-01 ## read1_mean_length 7.600000e+01 7.600000e+01 ## gc_content 4.895160e-01 4.875440e-01 ``` # Rfastp QC Plot .pull-left[ ```r curvePlot(rfastp_report, curve="quality_curves") ``` ![](FastQInBioconductor_files/figure-html/unnamed-chunk-4-1.png)<!-- --> ] .pull-left[ ```r curvePlot(rfastp_report, curve="content_curves") ``` ![](FastQInBioconductor_files/figure-html/unnamed-chunk-5-1.png)<!-- --> ] --- ##Handling Large Files So far we have used a subsampled FastQ file to review some of the functions available to us in the ShortRead package. The **FastqSampler()** and **FastqStreamer()** functions allow us to read portions of a FastQ file. --- ##Handling Large Files For evaluating quality of our FastQ, we do not require the entire file but perhaps only 100,000 random reads to assess quality. The **FastqSampler()** function allows us to specifiy how many reads we wish to sample when reading in a file. The **FastqSampler()** function returns a FastqSampler object we can use with **yield()** function to obtain a sample from ```r sampleToRead <- FastqSampler("data/sampled_ENCFF000CXH.fastq.gz", n=100) yield(sampleToRead) ``` ``` ## class: ShortReadQ ## length: 100 reads; width: 76 cycles ``` --- ##Handling Large Files The **FastqStreamer()** function works in a similar manner accept instead of randomly sampling the file, it allows us to read chunks of the file at a time. ```r sampleToRead <- FastqStreamer("data/sampled_ENCFF000CXH.fastq.gz", n=100) first100Reads <- yield(sampleToRead) second100Reads <- yield(sampleToRead) ``` --- ##Handling Large Files We can stream over entire files using a loop, here in chunks of 25000 ```r fq <- FastqStreamer("data/sampled_ENCFF000CXH.fastq.gz", n=25000) while (length(fq_stream <- yield(fq)) > 0) { print(length(fq_stream )) } ``` ``` ## [1] 25000 ## [1] 25000 ## [1] 25000 ## [1] 25000 ``` --- ##Time for an Exercise [Link_to_exercises](../../exercises/exercises/fastq_exercise.html) [Link_to_answers](../../exercises/answers/fastq_answers.html)