class: center, middle, inverse, title-slide # CrispR screen analysis using R and Bioconductor
### Rockefeller University, Bioinformatics Resource Centre ###
http://rockefelleruniversity.github.io/Bioconductor_Introduction/
--- ## High-throughput Crispr Screening High-throughput Crispr or shRNA screening provides a method to simultaneously assess the functional roles of 1000s genes under specific conditions/perturbations.  --- ## Analysis of Crispr screens We can take advantage of high throughput sequencing to assess the abundance of cells containing each CrispR/shRNA. By comparing the abundance of CrispR/shRNA in test conditions versus control conditions we can assess which genes have a functional role related to the test condition. i.e. If a set sgRNA guides targeting are enriched after drug treatment, that gene is important to susceptiabilty to the drug. --- ## Tools for analysis There are many tools available to analyse sgRNA/shRNA screens. These incude the popular PinAPL-Py and MAGeCK toolsets. [PinAPL-Py: A comprehensive web-application for the analysis of CRISPR/Cas9 screens](https://www.nature.com/articles/s41598-017-16193-9) [MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-014-0554-4) --- ## Tools in R. Much of the tools used in these toolsets are implemented within R (PinAPL=Py uses Bowtie, MAGeCK uses very similar normalisations to DESeq. We can therefore use some of the skills we have learnt to deal with other high throughput sequencing data and then apply it to identifying enriched sgRNA/shRNAs in our test conditions versus control. --- ## PinAPL-Py [PinAPL-Py](https://www.nature.com/articles/s41598-017-16193-9) offers a python/command line and web GUI interface to sgRNA screen analysis. We will follow some of their methods and use their test datasets in our upcoming analysis. You can find the GUI and test datasets available [here.](http://pinapl-py.ucsd.edu) --- ## Download the data We will use their test data and results data from this GUI. We can download this directly by using the download.file command ```r download.file("https://github.com/LewisLabUCSD/PinAPL-Py/archive/master.zip","pieappleData.zip") unzip("pieappleData.zip") download.file("http://pinapl-py.ucsd.edu/example-data","TestData.zip") unzip("TestData.zip") download.file("http://pinapl-py.ucsd.edu/run/download/example-run","Results.zip") unzip("Results.zip") ``` --- ## Quality control of FastQ Once we have downloaded the raw fastQ data (either from their Gtihub or from SRA) we can use the [ShortRead package](https://bioconductor.org/packages/release/bioc/html/ShortRead.html) to review our sequence data quality. We have reviewed how to work with raw sequencing data in the [**FastQ in Bioconductor** session.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#1) First we load the [ShortRead library.](https://bioconductor.org/packages/release/bioc/html/ShortRead.html) ```r library(ShortRead) ``` --- ## Working with FQs from Crispr screen. First we will review the raw sequencing reads using functions in the [ShortRead package.](https://bioconductor.org/packages/release/bioc/html/ShortRead.html) This is similar to our QC we performed for RNAseq and ChIPseq. We do not need to review all reads in the file to can gain an understanding of data quality. We can simply review a subsample of the reads and save ourselves some time and memory. **Remember** when we subsample we retrieve random reads from across the entire fastQ file. This is important as fastQ files are often ordered by their position on the sequencer. --- ## Crispr screen FQs. We can subsample from a fastQ file using functions in **ShortRead** package. Here we use the [**FastqSampler** and **yield** function](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#41) to randomly sample a defined number of reads from a fastQ file. Here we subsample 1 million reads. ```r fqSample <- FastqSampler("PinAPL-Py-master/Data/Tox-A_R01_98_S2_L008_R1_001_x.fastq.gz",n=10^6) fastq <- yield(fqSample) ``` --- ## Working with Crispr fastQ The resulting object is a [ShortReadQ object](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#10) showing information on the number of cycles, base pairs in reads, and number of reads in memory. ```r fastq ``` ``` ## class: ShortReadQ ## length: 500000 reads; width: 50 cycles ``` --- ## Working with Crispr fastQ If we wished, we can assess information from the fastQ file using our [familiar accessor functions.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#15) * **sread()** - Retrieve sequence of reads. * **quality()** - Retrieve quality of reads as ASCI scores. * **id()** - Retrieve IDs of reads. ```r readSequences <- sread(fastq) readQuality <- quality(fastq) readIDs <- id(fastq) readSequences ``` ``` ## DNAStringSet object of length 500000: ## width seq ## [1] 50 TCGAATCTTGTGGAAATGACGAAACACAGCCCTTGACATGACAATTGTGG ## [2] 50 TCGAATCTTGTGGAAAGGACGAAACACCGAAGAGGAATAACCCGGCCTGG ## [3] 50 TCGAATCTTGTGGAAAGGACGAAACGCAGACAATCCCCCCCCTGACATAG ## [4] 50 TCGAATCTTGTGGAAAGGGCGAAACACTGCTGGTCGAGGCGTTCCGCATG ## [5] 50 TCGAATCTTGTGGAAAGAGGAAACACCGAGCCATCCCAGCCCACTAGCGG ## ... ... ... ## [499996] 50 TCGAATCTTGTGGAAAGGACGAAACACCGGCTTCTTGTCCGCGCGCACGG ## [499997] 50 TCGAATCTTGTGGAAAGGACGAAACACCGCCCCAGACGGAAGCTCATGTG ## [499998] 50 TCGAATCTTGTGGAAAGGACGAAACACCGTGATGAATTCAGCTTTCCGGT ## [499999] 50 TCGAATCTTGTGGAAAGGACGAAACACCGAGTGCAATTGCAAGGTCATAG ## [500000] 50 TCGAATCTTGTGGAAAGGACGAAACACCATGTCCGTGTCACAACTTTGTT ``` --- ## Working with Crispr fastQ We can check some simple quality metrics for our subsampled fastQ data. First, we can review the overall reads' quality scores. We use the [**alphabetScore()** function with our read's qualitys](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#28) to retrieve the sum quality for every read from our subsample. ```r readQuality <- quality(fastq) readQualities <- alphabetScore(readQuality) readQualities[1:10] ``` ``` ## [1] 1282 1254 1095 1239 1382 1408 1324 1325 1318 1474 ``` --- ## Working with Crispr fastQ We can then produce a histogram of quality scores to get a better understanding of the distribution of scores. ```r library(ggplot2) toPlot <- data.frame(ReadQ=readQualities) ggplot(toPlot,aes(x=ReadQ))+geom_histogram()+theme_minimal() ``` ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` <!-- --> --- ## Working with Crispr fastQ We can review the occurrence of DNA bases within reads and well as the occurrence of DNA bases across sequencing cycles using the [**alphabetFrequency()**](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#18) and [**alphabetByCycle()**](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#30) functions respectively. Here we check the overall frequency of **A, G, C, T and N (unknown bases)** in our sequence reads. ```r readSequences <- sread(fastq) readSequences_AlpFreq <- alphabetFrequency(readSequences) readSequences_AlpFreq[1:3,] ``` ``` ## A C G T M R W S Y K V H D B N - + . ## [1,] 16 10 12 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [2,] 17 11 15 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ## [3,] 16 15 11 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ``` --- ## Working with Crispr fastQ Once we have the frequency of DNA bases in our sequence reads we can retrieve the sum across all reads. ```r summed__AlpFreq <- colSums(readSequences_AlpFreq) summed__AlpFreq[c("A","C","G","T","N")] ``` ``` ## A C G T N ## 7275781 6043883 6564129 5105990 10217 ``` --- ## Working with Crispr fastQ We can review DNA base occurrence by cycle using the [**alphabetByCycle()** function.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#30) ```r readSequences_AlpbyCycle <- alphabetByCycle(readSequences) readSequences_AlpbyCycle[1:4,1:10] ``` ``` ## cycle ## alphabet [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] ## A 374 7816 9634 482525 481226 257 203 172 189 241 ## C 9077 482302 7821 538 7954 9864 481477 336 248 132 ## G 7630 9323 481741 9073 750 131 482 7825 10814 488911 ## T 473741 559 804 7864 10058 489748 17838 491667 488749 10716 ``` --- ## Working with Crispr fastQ We often plot this to visualise the base occurrence over cycles to observe any bias. First we arrange the base frequency into a data frame. ```r AFreq <- readSequences_AlpbyCycle["A",] CFreq <- readSequences_AlpbyCycle["C",] GFreq <- readSequences_AlpbyCycle["G",] TFreq <- readSequences_AlpbyCycle["T",] toPlot <- data.frame(Count=c(AFreq,CFreq,GFreq,TFreq), Cycle=rep(1:max(width(readSequences)),4), Base=rep(c("A","C","G","T"),each=max(width(readSequences)))) ``` --- ## Working with Crispr fastQ Now we can plot the frequencies using ggplot2 ```r ggplot(toPlot,aes(y=Count,x=Cycle,colour=Base))+geom_line()+ theme_bw() ``` <!-- --> --- ## Assess by cycle with raw Crispr data We can also assess mean read quality over cycles. This will allow us to identify whether there are any isses with quality dropping off over time. For this we use the [**as(*read_quality*,"matrix")**](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/FastQInBioconductor.html#29) function first to translate our ASCI quality scores to numeric quality scores. ```r qualAsMatrix <- as(readQuality,"matrix") qualAsMatrix[1:2,] ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] ## [1,] 32 32 12 32 37 41 37 41 41 22 32 37 37 32 ## [2,] 32 27 32 32 32 32 37 37 41 27 37 32 37 37 ## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] ## [1,] 12 37 12 32 32 27 22 27 37 27 22 27 ## [2,] 12 32 37 32 12 12 32 22 22 12 27 27 ## [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] ## [1,] 12 12 27 12 22 12 22 22 12 12 12 27 ## [2,] 22 27 22 22 37 12 27 37 22 27 22 12 ## [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50] ## [1,] 27 12 22 37 37 37 37 27 12 12 12 27 ## [2,] 22 12 27 12 27 12 22 22 12 12 12 22 ``` --- ## Assess by cycle with raw Crispr data We can now [visualise qualities across cycles using a lineplot.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/exercises/answers/fastq_answers.html) ```r toPlot <- colMeans(qualAsMatrix) plot(toPlot) ``` <!-- --> --- ## Assess by cycle with raw Crispr data In this case the distribution of reads quality scores and read qualities over time look okay. We will often want to access fastQ samples together to see if any samples stick out by these metrics. Here we observed a second population of low quality scores so will remove some reads with low quality scores and high unknown bases. --- class: inverse, center, middle # Aligning data <html><div style='float:left'></div><hr color='#EB811B' size=1px width=720px></html> --- ## Aligning reads to sgRNA library Following assessment of read quality and any read filtering we applied, we will want to align our reads to the sgRNA library so as to quantify sgRNA abundance in our samples. Since sgRNA screen reads will align continously against our sgRNA we can use [our genomic aligners we have seen in previous sessions.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignmentInBioconductor.html#7) The resulting BAM file will contain aligned sequence reads for use in further analysis. <div align="center"> <img src="imgs/sam2.png" alt="igv" height="200" width="600"> </div> --- ## Creating a sgRNA reference First we need to retrieve the sequence information for the sgRNA guides of interest in [FASTA format](https://rockefelleruniversity.github.io/Genomic_Data/presentations/slides/GenomicsData.html#9) We can read the sequences of our sgRNA probes from the TSV file downloaded from the Github page [here](http://pinapl-py.ucsd.edu/example-data). ```r GeCKO <- read.delim("PinAPL-py_demo_data/GeCKOv21_Human.tsv") GeCKO[1:2,] ``` ``` ## gene_id UID seq ## 1 A1BG HGLibA_00001 GTCGCTGAGCTCCGATTCGA ## 2 A1BG HGLibA_00002 ACCTGTAGTTGCCGGCGTGC ``` --- ## Creating a sgRNA fasta Now we can create a [**DNAStringSet** object from the retrieved sequences](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/SequencesInBioconductor.html#17) as we have done for full chromosome sequences. ```r require(Biostrings) sgRNAs <- DNAStringSet(GeCKO$seq) names(sgRNAs) <- GeCKO$UID ``` --- ## Creating a sgRNA fasta Now we have a **DNAStringSet** object we can use the [**writeXStringSet** to create our FASTA file of sequences to align to.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/SequencesInBioconductor.html#22) ```r writeXStringSet(sgRNAs, file="GeCKO.fa") ``` --- ## Creating an Rsubread index We will be aligning using the **subjunc** algorithm from the authors of subread. We can therefore use the **Rsubread** package. Before we attempt to align our fastq files, we will need to first build an index from our reference genome using the **buildindex()** function. The [**buildindex()** function simply takes the parameters of our desired index name and the FASTA file to build index from.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/r_course/presentations/slides/AlignmentInBioconductor.html#14) The index is quite small so we dont need to split index or worry about memory parameters. ```r library(Rsubread) buildindex("GeCKO","GeCKO.fa", indexSplit=FALSE) ``` ``` ## ## ========== _____ _ _ ____ _____ ______ _____ ## ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ## ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ ## Rsubread 2.2.6 ## ## //================================= setting ==================================\\ ## || || ## || Index name : GeCKO || ## || Index space : base space || ## || Index split : no-split || ## || Repeat threshold : 100 repeats || ## || Gapped index : no || ## || || ## || Free / total memory : 5.1GB / 6.8GB || ## || || ## || Input files : 1 file in total || ## || o GeCKO.fa || ## || || ## \\============================================================================// ## ## //================================= Running ==================================\\ ## || || ## || Check the integrity of provided reference sequences ... || ## || No format issues were found || ## || Scan uninformative subreads in reference sequences ... || ## || Estimate the index size... || ## || 1.4 GB of memory is needed for index building. || ## || Build the index... || ## || Save current index block... || ## || [ 0.0% finished ] || ## || [ 10.0% finished ] || ## || [ 20.0% finished ] || ## || [ 30.0% finished ] || ## || [ 40.0% finished ] || ## || [ 50.0% finished ] || ## || [ 60.0% finished ] || ## || [ 70.0% finished ] || ## || [ 80.0% finished ] || ## || [ 90.0% finished ] || ## || [ 100.0% finished ] || ## || || ## || Total running time: 0.1 minutes. || ## || Index GeCKO was successfully built. || ## || || ## \\============================================================================// ``` --- ## Rsubread sgRNA alignment We can align our raw sequence data in fastQ format to the new FASTA file of sgRNA sequences using the **Rsubread** package. Specifically we will be using the **align** function as this utilizes the subread genomic alignment algorithm. The [**align()** function accepts arguments for the index to align to, the fastQ to align, the name of output BAM, the mode of alignment (rna or dna) and the phredOffset.](https://rockefelleruniversity.github.io/Bioconductor_Introduction/presentations/slides/AlignmentInBioconductor.html#15) ```r myFQs <- "PinAPL-py_demo_data/Control_R1_S14_L008_R1_001_x.fastq.gz" myMapped <- align("GeCKO",myFQs,output_file = gsub(".fastq.gz",".bam",myFQs), nthreads=4,unique=TRUE,nBestLocations=1,type = "DNA") ``` ``` ## ## ========== _____ _ _ ____ _____ ______ _____ ## ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ## ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ ## Rsubread 2.2.6 ## ## //================================= setting ==================================\\ ## || || ## || Function : Read alignment (DNA-Seq) || ## || Input file : Control_R1_S14_L008_R1_001_x.fastq.gz || ## || Output file : Control_R1_S14_L008_R1_001_x.bam (BAM) || ## || Index name : GeCKO || ## || || ## || ------------------------------------ || ## || || ## || Threads : 4 || ## || Phred offset : 33 || ## || Min votes : 3 / 10 || ## || Max mismatches : 3 || ## || Max indel length : 5 || ## || Report multi-mapping reads : no || ## || Max alignments per multi-mapping read : 1 || ## || || ## \\============================================================================// ## ## //================= Running (25-Aug-2020 18:30:01, pid=6501) =================\\ ## || || ## || Check the input reads. || ## || The input file contains base space reads. || ## || Initialise the memory objects. || ## || Estimate the mean read length. || ## || The range of Phred scores observed in the data is [2,41] || ## || Create the output BAM file. || ## || Check the index. || ## || Init the voting space. || ## || Global environment is initialised. || ## || Load the 1-th index block... || ## || The index block has been loaded. || ## || Start read mapping in chunk. || ## || 5% completed, 0.4 mins elapsed, rate=196.0k reads per second || ## || 11% completed, 0.4 mins elapsed, rate=230.9k reads per second || ## || 18% completed, 0.4 mins elapsed, rate=242.9k reads per second || ## || 25% completed, 0.4 mins elapsed, rate=243.9k reads per second || ## || 32% completed, 0.4 mins elapsed, rate=246.4k reads per second || ## || 38% completed, 0.4 mins elapsed, rate=251.4k reads per second || ## || 45% completed, 0.4 mins elapsed, rate=252.2k reads per second || ## || 51% completed, 0.4 mins elapsed, rate=253.4k reads per second || ## || 58% completed, 0.4 mins elapsed, rate=255.6k reads per second || ## || 64% completed, 0.4 mins elapsed, rate=257.9k reads per second || ## || 69% completed, 0.4 mins elapsed, rate=15.2k reads per second || ## || 72% completed, 0.4 mins elapsed, rate=15.9k reads per second || ## || 76% completed, 0.4 mins elapsed, rate=16.6k reads per second || ## || 79% completed, 0.4 mins elapsed, rate=17.2k reads per second || ## || 82% completed, 0.4 mins elapsed, rate=17.8k reads per second || ## || 86% completed, 0.4 mins elapsed, rate=18.6k reads per second || ## || 90% completed, 0.4 mins elapsed, rate=19.3k reads per second || ## || 95% completed, 0.4 mins elapsed, rate=20.2k reads per second || ## || 97% completed, 0.4 mins elapsed, rate=20.7k reads per second || ## || || ## || Completed successfully. || ## || || ## \\==================================== ====================================// ## ## //================================ Summary =================================\\ ## || || ## || Total reads : 500,000 || ## || Mapped : 0 (0.0%) || ## || Uniquely mapped : 0 || ## || Multi-mapping : 0 || ## || || ## || Unmapped : 500,000 || ## || || ## || Indels : 0 || ## || || ## || Running time : 0.4 minutes || ## || || ## \\============================================================================// ``` --- ## Rsubread sgRNA alignment We can now assess how well our alignment has done by reviewing the total number of mapped reads output by the **align** function. ```r myMapped ``` ``` ## Control_R1_S14_L008_R1_001_x.bam ## Total_reads 500000 ## Mapped_reads 0 ## Uniquely_mapped_reads 0 ## Multi_mapping_reads 0 ## Unmapped_reads 500000 ## Indels 0 ``` --- ## Rsubread sgRNA alignment To see why we have been unable to map we can look at the Rsubread manual and review the [guide for mapping to miRNAs.](https://bioconductor.org/packages/release/bioc/vignettes/Rsubread/inst/doc/SubreadUsersGuide.pdf). We can see here that the problem is that we need to reduce the number of subreads required for accepting a hit. We can control this with the **TH1** parameter which we now set ot 1. ```r myFQs <- "PinAPL-py_demo_data/Control_R1_S14_L008_R1_001_x.fastq.gz" myMapped <- align("GeCKO",myFQs,output_file = gsub(".fastq.gz",".bam",myFQs), nthreads=4,unique=TRUE,nBestLocations=1,type = "DNA",TH1 = 1) ``` ``` ## ## ========== _____ _ _ ____ _____ ______ _____ ## ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ## ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ ## Rsubread 2.2.6 ## ## //================================= setting ==================================\\ ## || || ## || Function : Read alignment (DNA-Seq) || ## || Input file : Control_R1_S14_L008_R1_001_x.fastq.gz || ## || Output file : Control_R1_S14_L008_R1_001_x.bam (BAM) || ## || Index name : GeCKO || ## || || ## || ------------------------------------ || ## || || ## || Threads : 4 || ## || Phred offset : 33 || ## || Min votes : 1 / 10 || ## || Max mismatches : 3 || ## || Max indel length : 5 || ## || Report multi-mapping reads : no || ## || Max alignments per multi-mapping read : 1 || ## || || ## \\============================================================================// ## ## //================= Running (25-Aug-2020 18:30:25, pid=6501) =================\\ ## || || ## || Check the input reads. || ## || The input file contains base space reads. || ## || Initialise the memory objects. || ## || Estimate the mean read length. || ## || The range of Phred scores observed in the data is [2,41] || ## || Create the output BAM file. || ## || Check the index. || ## || Init the voting space. || ## || Global environment is initialised. || ## || Load the 1-th index block... || ## || The index block has been loaded. || ## || Start read mapping in chunk. || ## || 5% completed, 0.3 mins elapsed, rate=188.3k reads per second || ## || 11% completed, 0.4 mins elapsed, rate=213.1k reads per second || ## || 17% completed, 0.4 mins elapsed, rate=223.4k reads per second || ## || 24% completed, 0.4 mins elapsed, rate=228.3k reads per second || ## || 31% completed, 0.4 mins elapsed, rate=228.7k reads per second || ## || 37% completed, 0.4 mins elapsed, rate=228.8k reads per second || ## || 44% completed, 0.4 mins elapsed, rate=230.7k reads per second || ## || 50% completed, 0.4 mins elapsed, rate=233.6k reads per second || ## || 57% completed, 0.4 mins elapsed, rate=234.3k reads per second || ## || 64% completed, 0.4 mins elapsed, rate=234.3k reads per second || ## || 70% completed, 0.4 mins elapsed, rate=15.5k reads per second || ## || 73% completed, 0.4 mins elapsed, rate=16.1k reads per second || ## || 76% completed, 0.4 mins elapsed, rate=16.7k reads per second || ## || 80% completed, 0.4 mins elapsed, rate=17.2k reads per second || ## || 83% completed, 0.4 mins elapsed, rate=17.8k reads per second || ## || 86% completed, 0.4 mins elapsed, rate=18.3k reads per second || ## || 90% completed, 0.4 mins elapsed, rate=18.8k reads per second || ## || 93% completed, 0.4 mins elapsed, rate=19.3k reads per second || ## || 96% completed, 0.4 mins elapsed, rate=19.9k reads per second || ## || 99% completed, 0.4 mins elapsed, rate=20.3k reads per second || ## || || ## || Completed successfully. || ## || || ## \\==================================== ====================================// ## ## //================================ Summary =================================\\ ## || || ## || Total reads : 500,000 || ## || Mapped : 421,608 (84.3%) || ## || Uniquely mapped : 421,608 || ## || Multi-mapping : 0 || ## || || ## || Unmapped : 78,392 || ## || || ## || Indels : 0 || ## || || ## || Running time : 0.4 minutes || ## || || ## \\============================================================================// ``` --- ## Rsubread sgRNA alignment We can now see we have mapped around 80% of the reads uniquely to sgRNA library. ```r myMapped ``` ``` ## Control_R1_S14_L008_R1_001_x.bam ## Total_reads 500000 ## Mapped_reads 421608 ## Uniquely_mapped_reads 421608 ## Multi_mapping_reads 0 ## Unmapped_reads 78392 ## Indels 0 ``` --- ## Rsubread ChIPseq alignment We will be more conservative now and exclude any reads with mismatches and indels (insertions/deletions) from our aligned data. ```r myFQs <- "PinAPL-py_demo_data/Control_R1_S14_L008_R1_001_x.fastq.gz" myMapped <- align("GeCKO",myFQs,output_file = gsub(".fastq.gz",".bam",myFQs), nthreads=4,unique=TRUE,nBestLocations=1,type = "DNA",TH1 = 1, maxMismatches = 0,indels = 0) ``` ``` ## ## ========== _____ _ _ ____ _____ ______ _____ ## ===== / ____| | | | _ \| __ \| ____| /\ | __ \ ## ===== | (___ | | | | |_) | |__) | |__ / \ | | | | ## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | | ## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| | ## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/ ## Rsubread 2.2.6 ## ## //================================= setting ==================================\\ ## || || ## || Function : Read alignment (DNA-Seq) || ## || Input file : Control_R1_S14_L008_R1_001_x.fastq.gz || ## || Output file : Control_R1_S14_L008_R1_001_x.bam (BAM) || ## || Index name : GeCKO || ## || || ## || ------------------------------------ || ## || || ## || Threads : 4 || ## || Phred offset : 33 || ## || Min votes : 1 / 10 || ## || Max mismatches : 0 || ## || Max indel length : 0 || ## || Report multi-mapping reads : no || ## || Max alignments per multi-mapping read : 1 || ## || || ## \\============================================================================// ## ## //================= Running (25-Aug-2020 18:30:50, pid=6501) =================\\ ## || || ## || Check the input reads. || ## || The input file contains base space reads. || ## || Initialise the memory objects. || ## || Estimate the mean read length. || ## || The range of Phred scores observed in the data is [2,41] || ## || Create the output BAM file. || ## || Check the index. || ## || Init the voting space. || ## || Global environment is initialised. || ## || Load the 1-th index block... || ## || The index block has been loaded. || ## || Start read mapping in chunk. || ## || 7% completed, 0.4 mins elapsed, rate=190.9k reads per second || ## || 15% completed, 0.4 mins elapsed, rate=213.4k reads per second || ## || 22% completed, 0.4 mins elapsed, rate=218.3k reads per second || ## || 29% completed, 0.4 mins elapsed, rate=220.6k reads per second || ## || 35% completed, 0.4 mins elapsed, rate=220.6k reads per second || ## || 42% completed, 0.4 mins elapsed, rate=221.4k reads per second || ## || 48% completed, 0.4 mins elapsed, rate=214.4k reads per second || ## || 54% completed, 0.4 mins elapsed, rate=215.5k reads per second || ## || 62% completed, 0.4 mins elapsed, rate=216.3k reads per second || ## || 69% completed, 0.4 mins elapsed, rate=14.5k reads per second || ## || 73% completed, 0.4 mins elapsed, rate=15.1k reads per second || ## || 76% completed, 0.4 mins elapsed, rate=15.7k reads per second || ## || 79% completed, 0.4 mins elapsed, rate=16.2k reads per second || ## || 83% completed, 0.4 mins elapsed, rate=16.7k reads per second || ## || 86% completed, 0.4 mins elapsed, rate=17.2k reads per second || ## || 89% completed, 0.4 mins elapsed, rate=17.7k reads per second || ## || 93% completed, 0.4 mins elapsed, rate=18.3k reads per second || ## || 96% completed, 0.4 mins elapsed, rate=18.8k reads per second || ## || 98% completed, 0.4 mins elapsed, rate=19.1k reads per second || ## || || ## || Completed successfully. || ## || || ## \\==================================== ====================================// ## ## //================================ Summary =================================\\ ## || || ## || Total reads : 500,000 || ## || Mapped : 413,480 (82.7%) || ## || Uniquely mapped : 413,480 || ## || Multi-mapping : 0 || ## || || ## || Unmapped : 86,520 || ## || || ## || Indels : 0 || ## || || ## || Running time : 0.4 minutes || ## || || ## \\============================================================================// ``` --- ## Rsubread sgRNA alignment With a little more stringency we have expected less alignment. This alignment rate is in keeping with observed results from the PinAPL-Py. ```r myMapped ``` ``` ## Control_R1_S14_L008_R1_001_x.bam ## Total_reads 500000 ## Mapped_reads 413480 ## Uniquely_mapped_reads 413480 ## Multi_mapping_reads 0 ## Unmapped_reads 86520 ## Indels 0 ``` --- ## Rsubread sgRNA aligned data We can use the **GenoomicAlignments** package to read in our newly created BAM file using the **readGAlignments** function. ```r require(GenomicAlignments) temp <- readGAlignments(gsub(".fastq.gz",".bam",myFQs)) temp ``` ``` ## GAlignments object with 413480 alignments and 0 metadata columns: ## seqnames strand cigar qwidth start end ## <Rle> <Rle> <character> <integer> <integer> <integer> ## [1] HGLibB_08643 + 29S20M1S 50 1 20 ## [2] HGLibA_47249 + 29S19M2S 50 1 19 ## [3] HGLibB_55658 + 29S19M2S 50 1 19 ## [4] HGLibA_20205 + 29S20M1S 50 1 20 ## [5] HGLibB_32397 + 29S20M1S 50 1 20 ## ... ... ... ... ... ... ... ## [413476] HGLibA_35013 + 29S19M2S 50 1 19 ## [413477] HGLibB_08748 + 29S20M1S 50 1 20 ## [413478] HGLibA_23475 + 29S19M2S 50 1 19 ## [413479] HGLibB_55809 + 29S19M2S 50 1 19 ## [413480] HGLibA_62676 + 29S20M1S 50 1 20 ## width njunc ## <integer> <integer> ## [1] 20 0 ## [2] 19 0 ## [3] 19 0 ## [4] 20 0 ## [5] 20 0 ## ... ... ... ## [413476] 19 0 ## [413477] 20 0 ## [413478] 19 0 ## [413479] 19 0 ## [413480] 20 0 ## ------- ## seqinfo: 122411 sequences from an unspecified genome ``` --- ## Rsubread sgRNA aligned data Now we can review the alignments we can see how Rsubread has dealt with read sequences longer than the sgRNA to align to. As we can see in the cigar strings, subread has softcliped the start and end of reads around the point of alignment and so removed adaptors as part of alignment. ```r temp ``` ``` ## GAlignments object with 413480 alignments and 0 metadata columns: ## seqnames strand cigar qwidth start end ## <Rle> <Rle> <character> <integer> <integer> <integer> ## [1] HGLibB_08643 + 29S20M1S 50 1 20 ## [2] HGLibA_47249 + 29S19M2S 50 1 19 ## [3] HGLibB_55658 + 29S19M2S 50 1 19 ## [4] HGLibA_20205 + 29S20M1S 50 1 20 ## [5] HGLibB_32397 + 29S20M1S 50 1 20 ## ... ... ... ... ... ... ... ## [413476] HGLibA_35013 + 29S19M2S 50 1 19 ## [413477] HGLibB_08748 + 29S20M1S 50 1 20 ## [413478] HGLibA_23475 + 29S19M2S 50 1 19 ## [413479] HGLibB_55809 + 29S19M2S 50 1 19 ## [413480] HGLibA_62676 + 29S20M1S 50 1 20 ## width njunc ## <integer> <integer> ## [1] 20 0 ## [2] 19 0 ## [3] 19 0 ## [4] 20 0 ## [5] 20 0 ## ... ... ... ## [413476] 19 0 ## [413477] 20 0 ## [413478] 19 0 ## [413479] 19 0 ## [413480] 20 0 ## ------- ## seqinfo: 122411 sequences from an unspecified genome ``` --- ## Rsubread sgRNA aligned data We can then review where the softclipping and matches occur in our read by interrogating these cigar strings using the **cigar** functions to extract them from our GAlignements object. ```r cigars <- cigar(temp) cigars[1:5] ``` ``` ## [1] "29S20M1S" "29S19M2S" "29S19M2S" "29S20M1S" "29S20M1S" ``` --- ## Rsubread sgRNA aligned data We can use the **cigarToRleList** function to turn our cigar strings into a list of RLE objects. ```r cigarRLE <- cigarToRleList(cigars) cigarRLE[1] ``` ``` ## RleList of length 1 ## [[1]] ## character-Rle of length 50 with 3 runs ## Lengths: 29 20 1 ## Values : "S" "M" "S" ``` --- ## Rsubread sgRNA aligned data And as our reads are all the same length we can now turn our list of RLEs into a matrix of cigar values with rows representing reads and columns the cycles acorss the reads. ```r cigarMat <- matrix(as.vector(unlist(cigarRLE)),ncol=50,byrow = TRUE) cigarMat[1:2,] ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] ## [1,] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" ## [2,] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" ## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26] ## [1,] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" ## [2,] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" ## [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38] ## [1,] "S" "S" "S" "M" "M" "M" "M" "M" "M" "M" "M" "M" ## [2,] "S" "S" "S" "M" "M" "M" "M" "M" "M" "M" "M" "M" ## [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50] ## [1,] "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "S" ## [2,] "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "S" "S" ``` --- ## Rsubread sgRNA aligned data From this we can now get the frequency of S or M using the table function. ```r cigarFreq <- apply(cigarMat,2,table) cigarFreq[1:2,] ``` ``` ## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] ## M 41 699 824 867 868 868 873 878 878 878 878 ## S 413439 412781 412656 412613 412612 412612 412607 412602 412602 412602 412602 ## [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] ## M 880 880 881 902 905 883 883 882 835 767 120 ## S 412600 412600 412599 412578 412575 412597 412597 412598 412645 412713 413360 ## [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] ## M 61 44 51 123 972 2672 31947 405383 411237 411654 412178 ## S 413419 413436 413429 413357 412508 410808 381533 8097 2243 1826 1302 ## [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] ## M 412575 412575 412575 412574 412574 412572 412571 412571 412562 412555 412547 ## S 905 905 905 906 906 908 909 909 918 925 933 ## [,45] [,46] [,47] [,48] [,49] [,50] ## M 412532 412335 409707 403963 312939 2308 ## S 948 1145 3773 9517 100541 411172 ``` --- ## Rsubread sgRNA aligned data With this frequency table we can construct a ggplot to show the distibution of soft-clips and matches across the read. ```r require(ggplot2) toPlot <- data.frame(Freq=c(cigarFreq["S",],cigarFreq["M",]), Cigar=rep(c("S","M"),each=ncol(cigarFreq)), Cycle=rep(seq(1,ncol(cigarFreq)),2)) ggplot(toPlot,aes(x=Cycle,y=Freq,colour=Cigar))+geom_line()+theme_bw() ``` <!-- --> --- ## Rsubread counts per sgRNA To obtain the counts per sgRNA we can simply take the frequncy of contig occurrence. ```r counts <- data.frame(table(seqnames(temp)),row.names = "Var1") counts[1:2,,drop=FALSE] ``` ``` ## Freq ## HGLibA_00001 0 ## HGLibA_00002 1 ``` --- ## Rsubread counts per sgRNA We can now compare to the counts from the same file retreived from PinAPL-Py. ```r ss <- read.delim("New_Example_1580342908_4/Analysis/01_Alignment_Results/ReadCounts_per_sgRNA/Control_1_GuideCounts.txt") new <- merge(ss,counts,by.x=1,by.y=0) plot(new$X0,new$Freq) ``` <!-- --> --- ## Rsubread counts per sgRNA This looked quite different for low counts. We will also add 1 more additional filter to screen out reads with less than 20 matches. ```r counts <- data.frame(table(seqnames(temp[width(temp) == 20])),row.names = "Var1") new <- merge(ss,counts,by.x=1,by.y=0) plot(new$X0,new$Freq) ``` <!-- --> --- ## Rsubread counts per sgRNA Now we can align the reads and count sgRNAs we could run the counting over each of our files of interest. I have done this already and provide a solution in the exercises based on the previous slides. For now we can work with the result stored in a matrix object called **sgRNAcounts**. ```r load("data/sgRNACounts.RData") sgRNAcounts[1:4,] ``` ``` ## Control_1 Control_2 ToxB_1 ToxB_2 ## HGLibA_00001 0 7 2 8 ## HGLibA_00002 0 0 0 0 ## HGLibA_00003 1 0 0 0 ## HGLibA_00004 9 0 0 1 ``` --- ## Rsubread counts per sgRNA To normalise our data and test for enrichment we will use the DESeq2 package. First we need to constuct our DESeqDataset object using familiar **DESeqDataSetFromMatrix** function ```r library(DESeq2) metadata <- DataFrame(Group=factor(c("Control","Control","ToxB","ToxB"), levels = c("Control","ToxB")), row.names = colnames(sgRNAcounts)) dds <- DESeqDataSetFromMatrix(sgRNAcounts,colData = metadata,design = ~Group) ``` --- ## Rsubread counts per sgRNA Now we have our DESeqDataset object we can use the DESeq function to normalise and test for enrichment. ```r dds <- DESeq(dds) ``` ``` ## estimating size factors ``` ``` ## estimating dispersions ``` ``` ## gene-wise dispersion estimates ``` ``` ## mean-dispersion relationship ``` ``` ## final dispersion estimates ``` ``` ## fitting model and testing ``` --- ## Rsubread counts per sgRNA A useful QC with sgRNA enrichment analysis is to review the distribution of counts for sgRNA guides before and after treatment. Here we extract the normalised counts and plot the distbutions of log2 counts for sgRNAs as a boxplot. ```r normCounts <- counts(dds,normalized=TRUE) boxplot(log2(normCounts+0.1)) ``` <!-- --> --- ## Rsubread counts per sgRNA To identify sgRNAs enriched in our Tox condition we can now use the standard differential analysis of DESeq2 by way of the **results** function. ```r ToxBvsControl <- results(dds,contrast=c("Group","ToxB","Control")) ToxBvsControl <- ToxBvsControl[order(ToxBvsControl$pvalue),] ToxBvsControl ``` ``` ## log2 fold change (MLE): Group ToxB vs Control ## Wald test p-value: Group ToxB vs Control ## DataFrame with 122411 rows and 6 columns ## baseMean log2FoldChange lfcSE stat pvalue ## <numeric> <numeric> <numeric> <numeric> <numeric> ## HGLibA_02011 2269.551 11.02447 1.84655 5.97030 2.36814e-09 ## HGLibA_64400 572.199 12.78926 2.38743 5.35691 8.46576e-08 ## HGLibB_02299 2910.432 9.46052 1.77452 5.33132 9.75036e-08 ## HGLibA_39135 1804.121 10.44016 2.00996 5.19422 2.05577e-07 ## HGLibB_51676 2751.242 9.63231 1.86532 5.16390 2.41858e-07 ## ... ... ... ... ... ... ## HGLibB_57015 0 NA NA NA NA ## HGLibB_57016 0 NA NA NA NA ## HGLibB_57019 0 NA NA NA NA ## HGLibB_57021 0 NA NA NA NA ## HGLibB_57026 0 NA NA NA NA ## padj ## <numeric> ## HGLibA_02011 3.09658e-05 ## HGLibA_64400 4.24986e-04 ## HGLibB_02299 4.24986e-04 ## HGLibA_39135 6.32506e-04 ## HGLibB_51676 6.32506e-04 ## ... ... ## HGLibB_57015 NA ## HGLibB_57016 NA ## HGLibB_57019 NA ## HGLibB_57021 NA ## HGLibB_57026 NA ``` --- ## Rsubread counts per sgRNA We can now compare our results to those provided by PinAPL-py. Although there us some differences the correlation here is 99%. ```r ss <- read.delim("New_Example_1580342908_4/Analysis/02_sgRNA-Ranking_Results/sgRNA_Rankings/ToxB_avg_0.01_Sidak_sgRNAList.txt") toPlt <- merge(ss,as.data.frame(ToxBvsControl),by.x=1,by.y=0) corr <- cor(log2(toPlt[!is.na(toPlt$padj),]$fold.change),toPlt[!is.na(toPlt$padj),]$log2FoldChange) plot(log2(toPlt[!is.na(toPlt$padj),]$fold.change),toPlt[!is.na(toPlt$padj),]$log2FoldChange,main=corr) ``` <!-- --> --- ## sgRNA to Gene We now would want to identify genes whose sgRNAs are enriched. Most sgRNA libraries will have multiple sgRNA guides for a single gene and so we can use makes use of these in selected genes of interest. Two main approaches are to either set a cut-off for number of enriched sgRNAs per gene or to produce an aggregate score of sgRNAs for a gene. Here we will employ the first strategy. --- ## sgRNA to Gene We can first identify which sgRNAs were significantly enriched in ToxB. ```r ToxBvsControl <- as.data.frame(ToxBvsControl)[order(ToxBvsControl$pvalue),] ToxBvsControl$Enriched <- !is.na(ToxBvsControl$padj) & ToxBvsControl$pvalue < 0.05 &ToxBvsControl$log2FoldChange > 0 ToxBvsControl[1:2,] ``` ``` ## baseMean log2FoldChange lfcSE stat pvalue ## HGLibA_02011 2269.5511 11.02447 1.846551 5.970302 2.368142e-09 ## HGLibA_64400 572.1991 12.78926 2.387432 5.356909 8.465756e-08 ## padj Enriched ## HGLibA_02011 3.096583e-05 TRUE ## HGLibA_64400 4.249859e-04 TRUE ``` --- ## sgRNA to Gene We also need to add some information of sgRNA to gene. This is available in the same file containing the sequence information. ```r ToxBvsControl <- merge(GeCKO,ToxBvsControl,by.x=2,by.y=0) ToxBvsControl[1:2,] ``` ``` ## UID gene_id seq baseMean log2FoldChange lfcSE ## 1 HGLibA_00001 A1BG GTCGCTGAGCTCCGATTCGA 4.588529 1.022701 2.679293 ## 2 HGLibA_00002 A1BG ACCTGTAGTTGCCGGCGTGC 0.000000 NA NA ## stat pvalue padj Enriched ## 1 0.3817054 0.7026799 0.8083972 FALSE ## 2 NA NA NA FALSE ``` --- ## sgRNA to Gene We can now loop through our sgRNA results and summarise to the gene level. ```r genes <- unique(ToxBvsControl$gene_id) listofGene <- list() for(i in 1:length(genes)){ tempRes <- ToxBvsControl[ToxBvsControl$gene_id %in% genes[i],] meanLogFC <- mean(tempRes$log2FoldChange,na.rm=TRUE) logFCs <- paste0(tempRes$log2FoldChange,collapse=";") minPvalue <- min(tempRes$pvalue,na.rm=TRUE) pvalues <- paste0(tempRes$pvalue,collapse=";") nEnriched <- sum(tempRes$Enriched,na.rm=TRUE) listofGene[[i]] <- data.frame(Gene=genes[i],meanLogFC,logFCs,minPvalue,pvalues,nEnriched) } geneTable <- do.call(rbind,listofGene) geneTable <- geneTable[order(geneTable$nEnriched,decreasing=TRUE),] ``` --- ## sgRNA to Gene Finally we can set a cut-off of enrichment by at least two guides and select genes to investigate further. ```r geneTable[1:3,] ``` ``` ## Gene meanLogFC ## 329 HBEGF 3.4286182 ## 17 CPSF3L 0.9897694 ## 28 PIP5K1B 3.2540767 ## logFCs ## 329 5.91201414165886;5.94417514624874;4.87849538275283;5.55277593166671;-0.704130722997228;-1.01162051433903 ## 17 11.4822538345679;6.51040028627386;-5.94602232979228;-2.59617588084159;-2.57900317698969;-0.93283644538519 ## 28 9.23864163813939;4.54175883040469;-5.10566945641389;4.34157590463247;NA;NA ## minPvalue ## 329 1.515112e-03 ## 17 1.616409e-06 ## 28 4.500310e-06 ## pvalues ## 329 0.00151511233063106;0.00364650882426901;0.0196783715509742;0.0289925243536325;0.800810984608005;0.839540034529576 ## 17 1.61640923806389e-06;0.00736461379039227;0.0382497535051929;0.592989202271584;0.595502117574694;0.730750118602134 ## 28 4.50030988059637e-06;0.018688113990729;0.0978052274147489;0.122899091425096;NA;NA ## nEnriched ## 329 4 ## 17 2 ## 28 2 ``` --- # Time for an exercise. [Link_to_exercises](../../exercises/exercises/crispr_exercise.html) [Link_to_answers](../../exercises/answers/crisp_answers.html)