High-throughput Crispr or shRNA screening provides a method to simultaneously assess the functional roles of 1000s genes under specific conditions/perturbations.
We can take advantage of high throughput sequencing to assess the abundance of cells containing each CrispR/shRNA.
By comparing the abundance of CrispR/shRNA in test conditions versus control conditions we can assess which genes have a functional role related to the test condition. i.e. If a set sgRNA guides targeting are enriched after drug treatment, that gene is important to susceptiabilty to the drug.
There are many tools available to analyse sgRNA/shRNA screens. These incude the popular PinAPL-Py and MAGeCK toolsets.
PinAPL-Py: A comprehensive web-application for the analysis of CRISPR/Cas9 screens
Much of the tools used in these toolsets are implemented within R (PinAPL=Py uses Bowtie, MAGeCK uses very similar normalisations to DESeq.
We can therefore use some of the skills we have learnt to deal with other high throughput sequencing data and then apply it to identifying enriched sgRNA/shRNAs in our test conditions versus control.
PinAPL-Py offers a python/command line and web GUI interface to sgRNA screen analysis.
We will follow some of their methods and use their test datasets in our upcoming analysis.
You can find the GUI and test datasets available here.
We will use their test data and results data from this GUI.
We can download this directly by using the download.file command
download.file("https://github.com/LewisLabUCSD/PinAPL-Py/archive/master.zip","pieappleData.zip")
unzip("pieappleData.zip")
download.file("http://pinapl-py.ucsd.edu/example-data","TestData.zip")
unzip("TestData.zip")
download.file("http://pinapl-py.ucsd.edu/run/download/example-run","Results.zip")
unzip("Results.zip")
Once we have downloaded the raw fastQ data (either from their Gtihub or from SRA) we can use the ShortRead package to review our sequence data quality.
We have reviewed how to work with raw sequencing data in the FastQ in Bioconductor session.
First we load the ShortRead library.
First we will review the raw sequencing reads using functions in the ShortRead package. This is similar to our QC we performed for RNAseq and ChIPseq.
We do not need to review all reads in the file to can gain an understanding of data quality. We can simply review a subsample of the reads and save ourselves some time and memory.
Remember when we subsample we retrieve random reads from across the entire fastQ file. This is important as fastQ files are often ordered by their position on the sequencer.
We can subsample from a fastQ file using functions in ShortRead package.
Here we use the FastqSampler and yield function to randomly sample a defined number of reads from a fastQ file. Here we subsample 1 million reads.
The resulting object is a ShortReadQ object showing information on the number of cycles, base pairs in reads, and number of reads in memory.
## class: ShortReadQ
## length: 500000 reads; width: 50 cycles
If we wished, we can assess information from the fastQ file using our familiar accessor functions.
## DNAStringSet object of length 500000:
## width seq
## [1] 50 TCGAATCTTGTGGAAATGACGAAACACAGCCCTTGACATGACAATTGTGG
## [2] 50 TCGAATCTTGTGGAAAGGACGAAACACCGAAGAGGAATAACCCGGCCTGG
## [3] 50 TCGAATCTTGTGGAAAGGACGAAACGCAGACAATCCCCCCCCTGACATAG
## [4] 50 TCGAATCTTGTGGAAAGGGCGAAACACTGCTGGTCGAGGCGTTCCGCATG
## [5] 50 TCGAATCTTGTGGAAAGAGGAAACACCGAGCCATCCCAGCCCACTAGCGG
## ... ... ...
## [499996] 50 TCGAATCTTGTGGAAAGGACGAAACACCGGCTTCTTGTCCGCGCGCACGG
## [499997] 50 TCGAATCTTGTGGAAAGGACGAAACACCGCCCCAGACGGAAGCTCATGTG
## [499998] 50 TCGAATCTTGTGGAAAGGACGAAACACCGTGATGAATTCAGCTTTCCGGT
## [499999] 50 TCGAATCTTGTGGAAAGGACGAAACACCGAGTGCAATTGCAAGGTCATAG
## [500000] 50 TCGAATCTTGTGGAAAGGACGAAACACCATGTCCGTGTCACAACTTTGTT
We can check some simple quality metrics for our subsampled fastQ data.
First, we can review the overall reads’ quality scores.
We use the alphabetScore() function with our read’s qualitys to retrieve the sum quality for every read from our subsample.
## [1] 1282 1254 1095 1239 1382 1408 1324 1325 1318 1474
We can then produce a histogram of quality scores to get a better understanding of the distribution of scores.
library(ggplot2)
toPlot <- data.frame(ReadQ=readQualities)
ggplot(toPlot,aes(x=ReadQ))+geom_histogram()+theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
We can review the occurrence of DNA bases within reads and well as the occurrence of DNA bases across sequencing cycles using the alphabetFrequency() and alphabetByCycle() functions respectively.
Here we check the overall frequency of A, G, C, T and N (unknown bases) in our sequence reads.
readSequences <- sread(fastq)
readSequences_AlpFreq <- alphabetFrequency(readSequences)
readSequences_AlpFreq[1:3,]
## A C G T M R W S Y K V H D B N - + .
## [1,] 16 10 12 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 17 11 15 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 16 15 11 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Once we have the frequency of DNA bases in our sequence reads we can retrieve the sum across all reads.
## A C G T N
## 7275781 6043883 6564129 5105990 10217
We can review DNA base occurrence by cycle using the alphabetByCycle() function.
## cycle
## alphabet [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## A 374 7816 9634 482525 481226 257 203 172 189 241
## C 9077 482302 7821 538 7954 9864 481477 336 248 132
## G 7630 9323 481741 9073 750 131 482 7825 10814 488911
## T 473741 559 804 7864 10058 489748 17838 491667 488749 10716
We often plot this to visualise the base occurrence over cycles to observe any bias. First we arrange the base frequency into a data frame.
AFreq <- readSequences_AlpbyCycle["A",]
CFreq <- readSequences_AlpbyCycle["C",]
GFreq <- readSequences_AlpbyCycle["G",]
TFreq <- readSequences_AlpbyCycle["T",]
toPlot <- data.frame(Count=c(AFreq,CFreq,GFreq,TFreq),
Cycle=rep(1:max(width(readSequences)),4),
Base=rep(c("A","C","G","T"),each=max(width(readSequences))))
Now we can plot the frequencies using ggplot2
We can also assess mean read quality over cycles. This will allow us to identify whether there are any isses with quality dropping off over time.
For this we use the as(read_quality,“matrix”) function first to translate our ASCI quality scores to numeric quality scores.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] 32 32 12 32 37 41 37 41 41 22 32 37 37 32
## [2,] 32 27 32 32 32 32 37 37 41 27 37 32 37 37
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,] 12 37 12 32 32 27 22 27 37 27 22 27
## [2,] 12 32 37 32 12 12 32 22 22 12 27 27
## [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,] 12 12 27 12 22 12 22 22 12 12 12 27
## [2,] 22 27 22 22 37 12 27 37 22 27 22 12
## [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## [1,] 27 12 22 37 37 37 37 27 12 12 12 27
## [2,] 22 12 27 12 27 12 22 22 12 12 12 22
In this case the distribution of reads quality scores and read qualities over time look okay. We will often want to access fastQ samples together to see if any samples stick out by these metrics.
Here we observed a second population of low quality scores so will remove some reads with low quality scores and high unknown bases.
Following assessment of read quality and any read filtering we applied, we will want to align our reads to the sgRNA library so as to quantify sgRNA abundance in our samples.
Since sgRNA screen reads will align continously against our sgRNA we can use our genomic aligners we have seen in previous sessions. The resulting BAM file will contain aligned sequence reads for use in further analysis.
First we need to retrieve the sequence information for the sgRNA guides of interest in FASTA format
We can read the sequences of our sgRNA probes from the TSV file downloaded from the Github page here.
## gene_id UID seq
## 1 A1BG HGLibA_00001 GTCGCTGAGCTCCGATTCGA
## 2 A1BG HGLibA_00002 ACCTGTAGTTGCCGGCGTGC
Now we can create a DNAStringSet object from the retrieved sequences as we have done for full chromosome sequences.
Now we have a DNAStringSet object we can use the writeXStringSet to create our FASTA file of sequences to align to.
We will be aligning using the subjunc algorithm from the authors of subread. We can therefore use the Rsubread package. Before we attempt to align our fastq files, we will need to first build an index from our reference genome using the buildindex() function.
The index is quite small so we dont need to split index or worry about memory parameters.
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.2.6
##
## //================================= setting ==================================\\
## || ||
## || Index name : GeCKO ||
## || Index space : base space ||
## || Index split : no-split ||
## || Repeat threshold : 100 repeats ||
## || Gapped index : no ||
## || ||
## || Free / total memory : 4.0GB / 6.8GB ||
## || ||
## || Input files : 1 file in total ||
## || o GeCKO.fa ||
## || ||
## \\============================================================================//
##
## //================================= Running ==================================\\
## || ||
## || Check the integrity of provided reference sequences ... ||
## || No format issues were found ||
## || Scan uninformative subreads in reference sequences ... ||
## || Estimate the index size... ||
## || 1.4 GB of memory is needed for index building. ||
## || Build the index... ||
## || Save current index block... ||
## || [ 0.0% finished ] ||
## || [ 10.0% finished ] ||
## || [ 20.0% finished ] ||
## || [ 30.0% finished ] ||
## || [ 40.0% finished ] ||
## || [ 50.0% finished ] ||
## || [ 60.0% finished ] ||
## || [ 70.0% finished ] ||
## || [ 80.0% finished ] ||
## || [ 90.0% finished ] ||
## || [ 100.0% finished ] ||
## || ||
## || Total running time: 0.2 minutes. ||
## || Index GeCKO was successfully built. ||
## || ||
## \\============================================================================//
We can align our raw sequence data in fastQ format to the new FASTA file of sgRNA sequences using the Rsubread package. Specifically we will be using the align function as this utilizes the subread genomic alignment algorithm.
myFQs <- "PinAPL-py_demo_data/Control_R1_S14_L008_R1_001_x.fastq.gz"
myMapped <- align("GeCKO",myFQs,output_file = gsub(".fastq.gz",".bam",myFQs),
nthreads=4,unique=TRUE,nBestLocations=1,type = "DNA")
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.2.6
##
## //================================= setting ==================================\\
## || ||
## || Function : Read alignment (DNA-Seq) ||
## || Input file : Control_R1_S14_L008_R1_001_x.fastq.gz ||
## || Output file : Control_R1_S14_L008_R1_001_x.bam (BAM) ||
## || Index name : GeCKO ||
## || ||
## || ------------------------------------ ||
## || ||
## || Threads : 4 ||
## || Phred offset : 33 ||
## || Min votes : 3 / 10 ||
## || Max mismatches : 3 ||
## || Max indel length : 5 ||
## || Report multi-mapping reads : no ||
## || Max alignments per multi-mapping read : 1 ||
## || ||
## \\============================================================================//
##
## //================= Running (25-Aug-2020 18:32:37, pid=6501) =================\\
## || ||
## || Check the input reads. ||
## || The input file contains base space reads. ||
## || Initialise the memory objects. ||
## || Estimate the mean read length. ||
## || The range of Phred scores observed in the data is [2,41] ||
## || Create the output BAM file. ||
## || Check the index. ||
## || Init the voting space. ||
## || Global environment is initialised. ||
## || Load the 1-th index block... ||
## || The index block has been loaded. ||
## || Start read mapping in chunk. ||
## || 5% completed, 0.4 mins elapsed, rate=181.0k reads per second ||
## || 12% completed, 0.4 mins elapsed, rate=218.6k reads per second ||
## || 18% completed, 0.4 mins elapsed, rate=227.8k reads per second ||
## || 23% completed, 0.4 mins elapsed, rate=232.9k reads per second ||
## || 30% completed, 0.4 mins elapsed, rate=237.4k reads per second ||
## || 36% completed, 0.4 mins elapsed, rate=239.7k reads per second ||
## || 43% completed, 0.4 mins elapsed, rate=239.3k reads per second ||
## || 50% completed, 0.4 mins elapsed, rate=238.9k reads per second ||
## || 56% completed, 0.4 mins elapsed, rate=240.1k reads per second ||
## || 62% completed, 0.4 mins elapsed, rate=242.6k reads per second ||
## || 70% completed, 0.4 mins elapsed, rate=15.3k reads per second ||
## || 74% completed, 0.4 mins elapsed, rate=16.1k reads per second ||
## || 77% completed, 0.4 mins elapsed, rate=16.7k reads per second ||
## || 79% completed, 0.4 mins elapsed, rate=17.2k reads per second ||
## || 81% completed, 0.4 mins elapsed, rate=17.6k reads per second ||
## || 85% completed, 0.4 mins elapsed, rate=18.3k reads per second ||
## || 89% completed, 0.4 mins elapsed, rate=19.1k reads per second ||
## || 92% completed, 0.4 mins elapsed, rate=19.7k reads per second ||
## || 96% completed, 0.4 mins elapsed, rate=20.5k reads per second ||
## || ||
## || Completed successfully. ||
## || ||
## \\==================================== ====================================//
##
## //================================ Summary =================================\\
## || ||
## || Total reads : 500,000 ||
## || Mapped : 0 (0.0%) ||
## || Uniquely mapped : 0 ||
## || Multi-mapping : 0 ||
## || ||
## || Unmapped : 500,000 ||
## || ||
## || Indels : 0 ||
## || ||
## || Running time : 0.4 minutes ||
## || ||
## \\============================================================================//
We can now assess how well our alignment has done by reviewing the total number of mapped reads output by the align function.
## Control_R1_S14_L008_R1_001_x.bam
## Total_reads 500000
## Mapped_reads 0
## Uniquely_mapped_reads 0
## Multi_mapping_reads 0
## Unmapped_reads 500000
## Indels 0
To see why we have been unable to map we can look at the Rsubread manual and review the guide for mapping to miRNAs..
We can see here that the problem is that we need to reduce the number of subreads required for accepting a hit. We can control this with the TH1 parameter which we now set ot 1.
myFQs <- "PinAPL-py_demo_data/Control_R1_S14_L008_R1_001_x.fastq.gz"
myMapped <- align("GeCKO",myFQs,output_file = gsub(".fastq.gz",".bam",myFQs),
nthreads=4,unique=TRUE,nBestLocations=1,type = "DNA",TH1 = 1)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.2.6
##
## //================================= setting ==================================\\
## || ||
## || Function : Read alignment (DNA-Seq) ||
## || Input file : Control_R1_S14_L008_R1_001_x.fastq.gz ||
## || Output file : Control_R1_S14_L008_R1_001_x.bam (BAM) ||
## || Index name : GeCKO ||
## || ||
## || ------------------------------------ ||
## || ||
## || Threads : 4 ||
## || Phred offset : 33 ||
## || Min votes : 1 / 10 ||
## || Max mismatches : 3 ||
## || Max indel length : 5 ||
## || Report multi-mapping reads : no ||
## || Max alignments per multi-mapping read : 1 ||
## || ||
## \\============================================================================//
##
## //================= Running (25-Aug-2020 18:33:01, pid=6501) =================\\
## || ||
## || Check the input reads. ||
## || The input file contains base space reads. ||
## || Initialise the memory objects. ||
## || Estimate the mean read length. ||
## || The range of Phred scores observed in the data is [2,41] ||
## || Create the output BAM file. ||
## || Check the index. ||
## || Init the voting space. ||
## || Global environment is initialised. ||
## || Load the 1-th index block... ||
## || The index block has been loaded. ||
## || Start read mapping in chunk. ||
## || 5% completed, 0.4 mins elapsed, rate=188.1k reads per second ||
## || 12% completed, 0.4 mins elapsed, rate=207.3k reads per second ||
## || 18% completed, 0.4 mins elapsed, rate=214.1k reads per second ||
## || 24% completed, 0.4 mins elapsed, rate=219.0k reads per second ||
## || 32% completed, 0.4 mins elapsed, rate=221.6k reads per second ||
## || 38% completed, 0.4 mins elapsed, rate=223.8k reads per second ||
## || 45% completed, 0.4 mins elapsed, rate=225.4k reads per second ||
## || 52% completed, 0.4 mins elapsed, rate=226.5k reads per second ||
## || 58% completed, 0.4 mins elapsed, rate=226.9k reads per second ||
## || 66% completed, 0.4 mins elapsed, rate=228.2k reads per second ||
## || 70% completed, 0.4 mins elapsed, rate=14.8k reads per second ||
## || 73% completed, 0.4 mins elapsed, rate=15.3k reads per second ||
## || 76% completed, 0.4 mins elapsed, rate=15.9k reads per second ||
## || 79% completed, 0.4 mins elapsed, rate=16.4k reads per second ||
## || 82% completed, 0.4 mins elapsed, rate=16.8k reads per second ||
## || 85% completed, 0.4 mins elapsed, rate=17.3k reads per second ||
## || 89% completed, 0.4 mins elapsed, rate=17.9k reads per second ||
## || 92% completed, 0.4 mins elapsed, rate=18.4k reads per second ||
## || 96% completed, 0.4 mins elapsed, rate=18.9k reads per second ||
## || 99% completed, 0.4 mins elapsed, rate=19.4k reads per second ||
## || ||
## || Completed successfully. ||
## || ||
## \\==================================== ====================================//
##
## //================================ Summary =================================\\
## || ||
## || Total reads : 500,000 ||
## || Mapped : 421,608 (84.3%) ||
## || Uniquely mapped : 421,608 ||
## || Multi-mapping : 0 ||
## || ||
## || Unmapped : 78,392 ||
## || ||
## || Indels : 0 ||
## || ||
## || Running time : 0.4 minutes ||
## || ||
## \\============================================================================//
We can now see we have mapped around 80% of the reads uniquely to sgRNA library.
## Control_R1_S14_L008_R1_001_x.bam
## Total_reads 500000
## Mapped_reads 421608
## Uniquely_mapped_reads 421608
## Multi_mapping_reads 0
## Unmapped_reads 78392
## Indels 0
We will be more conservative now and exclude any reads with mismatches and indels (insertions/deletions) from our aligned data.
myFQs <- "PinAPL-py_demo_data/Control_R1_S14_L008_R1_001_x.fastq.gz"
myMapped <- align("GeCKO",myFQs,output_file = gsub(".fastq.gz",".bam",myFQs),
nthreads=4,unique=TRUE,nBestLocations=1,type = "DNA",TH1 = 1,
maxMismatches = 0,indels = 0)
##
## ========== _____ _ _ ____ _____ ______ _____
## ===== / ____| | | | _ \| __ \| ____| /\ | __ \
## ===== | (___ | | | | |_) | |__) | |__ / \ | | | |
## ==== \___ \| | | | _ <| _ /| __| / /\ \ | | | |
## ==== ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
## ========== |_____/ \____/|____/|_| \_\______/_/ \_\_____/
## Rsubread 2.2.6
##
## //================================= setting ==================================\\
## || ||
## || Function : Read alignment (DNA-Seq) ||
## || Input file : Control_R1_S14_L008_R1_001_x.fastq.gz ||
## || Output file : Control_R1_S14_L008_R1_001_x.bam (BAM) ||
## || Index name : GeCKO ||
## || ||
## || ------------------------------------ ||
## || ||
## || Threads : 4 ||
## || Phred offset : 33 ||
## || Min votes : 1 / 10 ||
## || Max mismatches : 0 ||
## || Max indel length : 0 ||
## || Report multi-mapping reads : no ||
## || Max alignments per multi-mapping read : 1 ||
## || ||
## \\============================================================================//
##
## //================= Running (25-Aug-2020 18:33:26, pid=6501) =================\\
## || ||
## || Check the input reads. ||
## || The input file contains base space reads. ||
## || Initialise the memory objects. ||
## || Estimate the mean read length. ||
## || The range of Phred scores observed in the data is [2,41] ||
## || Create the output BAM file. ||
## || Check the index. ||
## || Init the voting space. ||
## || Global environment is initialised. ||
## || Load the 1-th index block... ||
## || The index block has been loaded. ||
## || Start read mapping in chunk. ||
## || 5% completed, 0.4 mins elapsed, rate=186.6k reads per second ||
## || 12% completed, 0.4 mins elapsed, rate=204.5k reads per second ||
## || 18% completed, 0.4 mins elapsed, rate=213.2k reads per second ||
## || 24% completed, 0.4 mins elapsed, rate=219.0k reads per second ||
## || 31% completed, 0.4 mins elapsed, rate=222.2k reads per second ||
## || 38% completed, 0.4 mins elapsed, rate=223.7k reads per second ||
## || 44% completed, 0.4 mins elapsed, rate=224.6k reads per second ||
## || 51% completed, 0.4 mins elapsed, rate=225.0k reads per second ||
## || 58% completed, 0.4 mins elapsed, rate=225.7k reads per second ||
## || 64% completed, 0.4 mins elapsed, rate=227.6k reads per second ||
## || 70% completed, 0.4 mins elapsed, rate=14.1k reads per second ||
## || 74% completed, 0.4 mins elapsed, rate=14.8k reads per second ||
## || 77% completed, 0.4 mins elapsed, rate=15.3k reads per second ||
## || 81% completed, 0.4 mins elapsed, rate=15.9k reads per second ||
## || 84% completed, 0.4 mins elapsed, rate=16.4k reads per second ||
## || 87% completed, 0.4 mins elapsed, rate=16.9k reads per second ||
## || 91% completed, 0.4 mins elapsed, rate=17.4k reads per second ||
## || 94% completed, 0.4 mins elapsed, rate=17.9k reads per second ||
## || 97% completed, 0.4 mins elapsed, rate=18.4k reads per second ||
## || ||
## || Completed successfully. ||
## || ||
## \\==================================== ====================================//
##
## //================================ Summary =================================\\
## || ||
## || Total reads : 500,000 ||
## || Mapped : 413,480 (82.7%) ||
## || Uniquely mapped : 413,480 ||
## || Multi-mapping : 0 ||
## || ||
## || Unmapped : 86,520 ||
## || ||
## || Indels : 0 ||
## || ||
## || Running time : 0.4 minutes ||
## || ||
## \\============================================================================//
With a little more stringency we have expected less alignment. This alignment rate is in keeping with observed results from the PinAPL-Py.
## Control_R1_S14_L008_R1_001_x.bam
## Total_reads 500000
## Mapped_reads 413480
## Uniquely_mapped_reads 413480
## Multi_mapping_reads 0
## Unmapped_reads 86520
## Indels 0
We can use the GenoomicAlignments package to read in our newly created BAM file using the readGAlignments function.
## GAlignments object with 413480 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] HGLibB_08643 + 29S20M1S 50 1 20
## [2] HGLibA_47249 + 29S19M2S 50 1 19
## [3] HGLibB_55658 + 29S19M2S 50 1 19
## [4] HGLibA_20205 + 29S20M1S 50 1 20
## [5] HGLibB_32397 + 29S20M1S 50 1 20
## ... ... ... ... ... ... ...
## [413476] HGLibA_12772 + 29S20M1S 50 1 20
## [413477] HGLibA_11892 + 28S20M2S 50 1 20
## [413478] HGLibB_42020 + 29S19M2S 50 1 19
## [413479] HGLibB_38831 + 28S20M2S 50 1 20
## [413480] HGLibA_37244 + 29S20M1S 50 1 20
## width njunc
## <integer> <integer>
## [1] 20 0
## [2] 19 0
## [3] 19 0
## [4] 20 0
## [5] 20 0
## ... ... ...
## [413476] 20 0
## [413477] 20 0
## [413478] 19 0
## [413479] 20 0
## [413480] 20 0
## -------
## seqinfo: 122411 sequences from an unspecified genome
Now we can review the alignments we can see how Rsubread has dealt with read sequences longer than the sgRNA to align to.
As we can see in the cigar strings, subread has softcliped the start and end of reads around the point of alignment and so removed adaptors as part of alignment.
## GAlignments object with 413480 alignments and 0 metadata columns:
## seqnames strand cigar qwidth start end
## <Rle> <Rle> <character> <integer> <integer> <integer>
## [1] HGLibB_08643 + 29S20M1S 50 1 20
## [2] HGLibA_47249 + 29S19M2S 50 1 19
## [3] HGLibB_55658 + 29S19M2S 50 1 19
## [4] HGLibA_20205 + 29S20M1S 50 1 20
## [5] HGLibB_32397 + 29S20M1S 50 1 20
## ... ... ... ... ... ... ...
## [413476] HGLibA_12772 + 29S20M1S 50 1 20
## [413477] HGLibA_11892 + 28S20M2S 50 1 20
## [413478] HGLibB_42020 + 29S19M2S 50 1 19
## [413479] HGLibB_38831 + 28S20M2S 50 1 20
## [413480] HGLibA_37244 + 29S20M1S 50 1 20
## width njunc
## <integer> <integer>
## [1] 20 0
## [2] 19 0
## [3] 19 0
## [4] 20 0
## [5] 20 0
## ... ... ...
## [413476] 20 0
## [413477] 20 0
## [413478] 19 0
## [413479] 20 0
## [413480] 20 0
## -------
## seqinfo: 122411 sequences from an unspecified genome
We can then review where the softclipping and matches occur in our read by interrogating these cigar strings using the cigar functions to extract them from our GAlignements object.
## [1] "29S20M1S" "29S19M2S" "29S19M2S" "29S20M1S" "29S20M1S"
We can use the cigarToRleList function to turn our cigar strings into a list of RLE objects.
## RleList of length 1
## [[1]]
## character-Rle of length 50 with 3 runs
## Lengths: 29 20 1
## Values : "S" "M" "S"
And as our reads are all the same length we can now turn our list of RLEs into a matrix of cigar values with rows representing reads and columns the cycles acorss the reads.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S"
## [2,] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S"
## [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S"
## [2,] "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S" "S"
## [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,] "S" "S" "S" "M" "M" "M" "M" "M" "M" "M" "M" "M"
## [2,] "S" "S" "S" "M" "M" "M" "M" "M" "M" "M" "M" "M"
## [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## [1,] "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "S"
## [2,] "M" "M" "M" "M" "M" "M" "M" "M" "M" "M" "S" "S"
From this we can now get the frequency of S or M using the table function.
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## M 41 699 824 867 868 868 873 878 878 878 878
## S 413439 412781 412656 412613 412612 412612 412607 412602 412602 412602 412602
## [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
## M 880 880 881 902 905 883 883 882 835 767 120
## S 412600 412600 412599 412578 412575 412597 412597 412598 412645 412713 413360
## [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33]
## M 61 44 51 123 972 2672 31947 405383 411237 411654 412178
## S 413419 413436 413429 413357 412508 410808 381533 8097 2243 1826 1302
## [,34] [,35] [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44]
## M 412575 412575 412575 412574 412574 412572 412571 412571 412562 412555 412547
## S 905 905 905 906 906 908 909 909 918 925 933
## [,45] [,46] [,47] [,48] [,49] [,50]
## M 412532 412335 409707 403963 312939 2308
## S 948 1145 3773 9517 100541 411172
With this frequency table we can construct a ggplot to show the distibution of soft-clips and matches across the read.
require(ggplot2)
toPlot <- data.frame(Freq=c(cigarFreq["S",],cigarFreq["M",]),
Cigar=rep(c("S","M"),each=ncol(cigarFreq)),
Cycle=rep(seq(1,ncol(cigarFreq)),2))
ggplot(toPlot,aes(x=Cycle,y=Freq,colour=Cigar))+geom_line()+theme_bw()
To obtain the counts per sgRNA we can simply take the frequncy of contig occurrence.
## Freq
## HGLibA_00001 0
## HGLibA_00002 1
We can now compare to the counts from the same file retreived from PinAPL-Py.
ss <- read.delim("New_Example_1580342908_4/Analysis/01_Alignment_Results/ReadCounts_per_sgRNA/Control_1_GuideCounts.txt")
new <- merge(ss,counts,by.x=1,by.y=0)
plot(new$X0,new$Freq)
This looked quite different for low counts. We will also add 1 more additional filter to screen out reads with less than 20 matches.
counts <- data.frame(table(seqnames(temp[width(temp) == 20])),row.names = "Var1")
new <- merge(ss,counts,by.x=1,by.y=0)
plot(new$X0,new$Freq)
Now we can align the reads and count sgRNAs we could run the counting over each of our files of interest. I have done this already and provide a solution in the exercises based on the previous slides.
For now we can work with the result stored in a matrix object called sgRNAcounts.
## Control_1 Control_2 ToxB_1 ToxB_2
## HGLibA_00001 0 7 2 8
## HGLibA_00002 0 0 0 0
## HGLibA_00003 1 0 0 0
## HGLibA_00004 9 0 0 1
To normalise our data and test for enrichment we will use the DESeq2 package.
First we need to constuct our DESeqDataset object using familiar DESeqDataSetFromMatrix function
Now we have our DESeqDataset object we can use the DESeq function to normalise and test for enrichment.
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
A useful QC with sgRNA enrichment analysis is to review the distribution of counts for sgRNA guides before and after treatment. Here we extract the normalised counts and plot the distbutions of log2 counts for sgRNAs as a boxplot.
To identify sgRNAs enriched in our Tox condition we can now use the standard differential analysis of DESeq2 by way of the results function.
ToxBvsControl <- results(dds,contrast=c("Group","ToxB","Control"))
ToxBvsControl <- ToxBvsControl[order(ToxBvsControl$pvalue),]
ToxBvsControl
## log2 fold change (MLE): Group ToxB vs Control
## Wald test p-value: Group ToxB vs Control
## DataFrame with 122411 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue
## <numeric> <numeric> <numeric> <numeric> <numeric>
## HGLibA_02011 2269.551 11.02447 1.84655 5.97030 2.36814e-09
## HGLibA_64400 572.199 12.78926 2.38743 5.35691 8.46576e-08
## HGLibB_02299 2910.432 9.46052 1.77452 5.33132 9.75036e-08
## HGLibA_39135 1804.121 10.44016 2.00996 5.19422 2.05577e-07
## HGLibB_51676 2751.242 9.63231 1.86532 5.16390 2.41858e-07
## ... ... ... ... ... ...
## HGLibB_57015 0 NA NA NA NA
## HGLibB_57016 0 NA NA NA NA
## HGLibB_57019 0 NA NA NA NA
## HGLibB_57021 0 NA NA NA NA
## HGLibB_57026 0 NA NA NA NA
## padj
## <numeric>
## HGLibA_02011 3.09658e-05
## HGLibA_64400 4.24986e-04
## HGLibB_02299 4.24986e-04
## HGLibA_39135 6.32506e-04
## HGLibB_51676 6.32506e-04
## ... ...
## HGLibB_57015 NA
## HGLibB_57016 NA
## HGLibB_57019 NA
## HGLibB_57021 NA
## HGLibB_57026 NA
We can now compare our results to those provided by PinAPL-py. Although there us some differences the correlation here is 99%.
ss <- read.delim("New_Example_1580342908_4/Analysis/02_sgRNA-Ranking_Results/sgRNA_Rankings/ToxB_avg_0.01_Sidak_sgRNAList.txt")
toPlt <- merge(ss,as.data.frame(ToxBvsControl),by.x=1,by.y=0)
corr <- cor(log2(toPlt[!is.na(toPlt$padj),]$fold.change),toPlt[!is.na(toPlt$padj),]$log2FoldChange)
plot(log2(toPlt[!is.na(toPlt$padj),]$fold.change),toPlt[!is.na(toPlt$padj),]$log2FoldChange,main=corr)
We now would want to identify genes whose sgRNAs are enriched. Most sgRNA libraries will have multiple sgRNA guides for a single gene and so we can use makes use of these in selected genes of interest.
Two main approaches are to either set a cut-off for number of enriched sgRNAs per gene or to produce an aggregate score of sgRNAs for a gene.
Here we will employ the first strategy.
We can first identify which sgRNAs were significantly enriched in ToxB.
ToxBvsControl <- as.data.frame(ToxBvsControl)[order(ToxBvsControl$pvalue),]
ToxBvsControl$Enriched <- !is.na(ToxBvsControl$padj) & ToxBvsControl$pvalue < 0.05 &ToxBvsControl$log2FoldChange > 0
ToxBvsControl[1:2,]
## baseMean log2FoldChange lfcSE stat pvalue
## HGLibA_02011 2269.5511 11.02447 1.846551 5.970302 2.368142e-09
## HGLibA_64400 572.1991 12.78926 2.387432 5.356909 8.465756e-08
## padj Enriched
## HGLibA_02011 3.096583e-05 TRUE
## HGLibA_64400 4.249859e-04 TRUE
We also need to add some information of sgRNA to gene. This is available in the same file containing the sequence information.
## UID gene_id seq baseMean log2FoldChange lfcSE
## 1 HGLibA_00001 A1BG GTCGCTGAGCTCCGATTCGA 4.588529 1.022701 2.679293
## 2 HGLibA_00002 A1BG ACCTGTAGTTGCCGGCGTGC 0.000000 NA NA
## stat pvalue padj Enriched
## 1 0.3817054 0.7026799 0.8083972 FALSE
## 2 NA NA NA FALSE
We can now loop through our sgRNA results and summarise to the gene level.
genes <- unique(ToxBvsControl$gene_id)
listofGene <- list()
for(i in 1:length(genes)){
tempRes <- ToxBvsControl[ToxBvsControl$gene_id %in% genes[i],]
meanLogFC <- mean(tempRes$log2FoldChange,na.rm=TRUE)
logFCs <- paste0(tempRes$log2FoldChange,collapse=";")
minPvalue <- min(tempRes$pvalue,na.rm=TRUE)
pvalues <- paste0(tempRes$pvalue,collapse=";")
nEnriched <- sum(tempRes$Enriched,na.rm=TRUE)
listofGene[[i]] <- data.frame(Gene=genes[i],meanLogFC,logFCs,minPvalue,pvalues,nEnriched)
}
geneTable <- do.call(rbind,listofGene)
geneTable <- geneTable[order(geneTable$nEnriched,decreasing=TRUE),]
Finally we can set a cut-off of enrichment by at least two guides and select genes to investigate further.
## Gene meanLogFC
## 329 HBEGF 3.4286182
## 17 CPSF3L 0.9897694
## 28 PIP5K1B 3.2540767
## logFCs
## 329 5.91201414165886;5.94417514624874;4.87849538275283;5.55277593166671;-0.704130722997228;-1.01162051433903
## 17 11.4822538345679;6.51040028627386;-5.94602232979228;-2.59617588084159;-2.57900317698969;-0.93283644538519
## 28 9.23864163813939;4.54175883040469;-5.10566945641389;4.34157590463247;NA;NA
## minPvalue
## 329 1.515112e-03
## 17 1.616409e-06
## 28 4.500310e-06
## pvalues
## 329 0.00151511233063106;0.00364650882426901;0.0196783715509742;0.0289925243536325;0.800810984608005;0.839540034529576
## 17 1.61640923806389e-06;0.00736461379039227;0.0382497535051929;0.592989202271584;0.595502117574694;0.730750118602134
## 28 4.50030988059637e-06;0.018688113990729;0.0978052274147489;0.122899091425096;NA;NA
## nEnriched
## 329 4
## 17 2
## 28 2