Crispr Screenining analysis in R


High-throughput Crispr Screening

High-throughput Crispr or shRNA screening provides a method to simultaneously assess the functional roles of 1000s genes under specific conditions/perturbations.

Analysis of Crispr screens

We can take advantage of high throughput sequencing to assess the abundance of cells containing each CrispR/shRNA.

By comparing the abundance of CrispR/shRNA in test conditions versus control conditions we can assess which genes have a functional role related to the test condition. i.e. If a set sgRNA guides targeting are enriched after drug treatment, that gene is important to susceptiabilty to the drug.

Tools for analysis

There are many tools available to analyse sgRNA/shRNA screens. These incude the popular PinAPL-Py and MAGeCK toolsets.

PinAPL-Py: A comprehensive web-application for the analysis of CRISPR/Cas9 screens

MAGeCK enables robust identification of essential genes from genome-scale CRISPR/Cas9 knockout screens

Tools in R.

Much of the tools used in these toolsets are implemented within R (PinAPL=Py uses Bowtie, MAGeCK uses very similar normalisations to DESeq.

We can therefore use some of the skills we have learnt to deal with other high throughput sequencing data and then apply it to identifying enriched sgRNA/shRNAs in our test conditions versus control.

PinAPL-Py

PinAPL-Py offers a python/command line and web GUI interface to sgRNA screen analysis.

We will follow some of their methods and use their test datasets in our upcoming analysis.

You can find the GUI and test datasets available here.

Download the data

We will use their test data and results data from this GUI.

We can download this directly by using the download.file command

download.file("https://github.com/LewisLabUCSD/PinAPL-Py/archive/master.zip","pieappleData.zip")
unzip("pieappleData.zip")
download.file("http://pinapl-py.ucsd.edu/example-data","TestData.zip")
unzip("TestData.zip")
download.file("http://pinapl-py.ucsd.edu/run/download/example-run","Results.zip")
unzip("Results.zip")

Quality control of FastQ

Once we have downloaded the raw fastQ data (either from their Gtihub or from SRA) we can use the ShortRead package to review our sequence data quality.

We have reviewed how to work with raw sequencing data in the FastQ in Bioconductor session.

First we load the ShortRead library.

library(ShortRead)

Working with FQs from Crispr screen.

First we will review the raw sequencing reads using functions in the ShortRead package. This is similar to our QC we performed for RNAseq and ChIPseq.

We do not need to review all reads in the file to can gain an understanding of data quality. We can simply review a subsample of the reads and save ourselves some time and memory.

Remember when we subsample we retrieve random reads from across the entire fastQ file. This is important as fastQ files are often ordered by their position on the sequencer.

Crispr screen FQs.

We can subsample from a fastQ file using functions in ShortRead package.

Here we use the FastqSampler and yield function to randomly sample a defined number of reads from a fastQ file. Here we subsample 1 million reads.

fqSample <- FastqSampler("PinAPL-Py-master/Data/Tox-A_R01_98_S2_L008_R1_001_x.fastq.gz",n=10^6)
fastq <- yield(fqSample)

Working with Crispr fastQ

The resulting object is a ShortReadQ object showing information on the number of cycles, base pairs in reads, and number of reads in memory.

fastq
## class: ShortReadQ
## length: 500000 reads; width: 50 cycles

Working with Crispr fastQ

If we wished, we can assess information from the fastQ file using our familiar accessor functions.

  • sread() - Retrieve sequence of reads.
  • quality() - Retrieve quality of reads as ASCI scores.
  • id() - Retrieve IDs of reads.
readSequences <- sread(fastq)
readQuality <- quality(fastq)
readIDs <- id(fastq)
readSequences
## DNAStringSet object of length 500000:
##          width seq
##      [1]    50 TCGAATCTTGTGGAAATGACGAAACACAGCCCTTGACATGACAATTGTGG
##      [2]    50 TCGAATCTTGTGGAAAGGACGAAACACCGAAGAGGAATAACCCGGCCTGG
##      [3]    50 TCGAATCTTGTGGAAAGGACGAAACGCAGACAATCCCCCCCCTGACATAG
##      [4]    50 TCGAATCTTGTGGAAAGGGCGAAACACTGCTGGTCGAGGCGTTCCGCATG
##      [5]    50 TCGAATCTTGTGGAAAGAGGAAACACCGAGCCATCCCAGCCCACTAGCGG
##      ...   ... ...
## [499996]    50 TCGAATCTTGTGGAAAGGACGAAACACCGGCTTCTTGTCCGCGCGCACGG
## [499997]    50 TCGAATCTTGTGGAAAGGACGAAACACCGCCCCAGACGGAAGCTCATGTG
## [499998]    50 TCGAATCTTGTGGAAAGGACGAAACACCGTGATGAATTCAGCTTTCCGGT
## [499999]    50 TCGAATCTTGTGGAAAGGACGAAACACCGAGTGCAATTGCAAGGTCATAG
## [500000]    50 TCGAATCTTGTGGAAAGGACGAAACACCATGTCCGTGTCACAACTTTGTT

Working with Crispr fastQ

We can check some simple quality metrics for our subsampled fastQ data.

First, we can review the overall reads’ quality scores.

We use the alphabetScore() function with our read’s qualitys to retrieve the sum quality for every read from our subsample.

readQuality <- quality(fastq)
readQualities <- alphabetScore(readQuality)
readQualities[1:10]
##  [1] 1282 1254 1095 1239 1382 1408 1324 1325 1318 1474

Working with Crispr fastQ

We can then produce a histogram of quality scores to get a better understanding of the distribution of scores.

library(ggplot2)
toPlot <- data.frame(ReadQ=readQualities)
ggplot(toPlot,aes(x=ReadQ))+geom_histogram()+theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Working with Crispr fastQ

We can review the occurrence of DNA bases within reads and well as the occurrence of DNA bases across sequencing cycles using the alphabetFrequency() and alphabetByCycle() functions respectively.

Here we check the overall frequency of A, G, C, T and N (unknown bases) in our sequence reads.

readSequences <- sread(fastq)
readSequences_AlpFreq <- alphabetFrequency(readSequences)
readSequences_AlpFreq[1:3,]
##       A  C  G  T M R W S Y K V H D B N - + .
## [1,] 16 10 12 12 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [2,] 17 11 15  7 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [3,] 16 15 11  8 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Working with Crispr fastQ

Once we have the frequency of DNA bases in our sequence reads we can retrieve the sum across all reads.

summed__AlpFreq  <- colSums(readSequences_AlpFreq)
summed__AlpFreq[c("A","C","G","T","N")]
##       A       C       G       T       N 
## 7275781 6043883 6564129 5105990   10217

Working with Crispr fastQ

We can review DNA base occurrence by cycle using the alphabetByCycle() function.

readSequences_AlpbyCycle <- alphabetByCycle(readSequences)
readSequences_AlpbyCycle[1:4,1:10]
##         cycle
## alphabet   [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]
##        A    374   7816   9634 482525 481226    257    203    172    189    241
##        C   9077 482302   7821    538   7954   9864 481477    336    248    132
##        G   7630   9323 481741   9073    750    131    482   7825  10814 488911
##        T 473741    559    804   7864  10058 489748  17838 491667 488749  10716

Working with Crispr fastQ

We often plot this to visualise the base occurrence over cycles to observe any bias. First we arrange the base frequency into a data frame.

AFreq <- readSequences_AlpbyCycle["A",]
CFreq <- readSequences_AlpbyCycle["C",]
GFreq <- readSequences_AlpbyCycle["G",]
TFreq <- readSequences_AlpbyCycle["T",]
toPlot <- data.frame(Count=c(AFreq,CFreq,GFreq,TFreq),
                     Cycle=rep(1:max(width(readSequences)),4),
                     Base=rep(c("A","C","G","T"),each=max(width(readSequences))))

Working with Crispr fastQ

Now we can plot the frequencies using ggplot2

ggplot(toPlot,aes(y=Count,x=Cycle,colour=Base))+geom_line()+
  theme_bw()

Assess by cycle with raw Crispr data

We can also assess mean read quality over cycles. This will allow us to identify whether there are any isses with quality dropping off over time.

For this we use the as(read_quality,“matrix”) function first to translate our ASCI quality scores to numeric quality scores.

qualAsMatrix <- as(readQuality,"matrix")
qualAsMatrix[1:2,]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,]   32   32   12   32   37   41   37   41   41    22    32    37    37    32
## [2,]   32   27   32   32   32   32   37   37   41    27    37    32    37    37
##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,]    12    37    12    32    32    27    22    27    37    27    22    27
## [2,]    12    32    37    32    12    12    32    22    22    12    27    27
##      [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,]    12    12    27    12    22    12    22    22    12    12    12    27
## [2,]    22    27    22    22    37    12    27    37    22    27    22    12
##      [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## [1,]    27    12    22    37    37    37    37    27    12    12    12    27
## [2,]    22    12    27    12    27    12    22    22    12    12    12    22

Assess by cycle with raw Crispr data

We can now visualise qualities across cycles using a lineplot.

toPlot <- colMeans(qualAsMatrix)
plot(toPlot)

Assess by cycle with raw Crispr data

In this case the distribution of reads quality scores and read qualities over time look okay. We will often want to access fastQ samples together to see if any samples stick out by these metrics.

Here we observed a second population of low quality scores so will remove some reads with low quality scores and high unknown bases.

Aligning data


Aligning reads to sgRNA library

Following assessment of read quality and any read filtering we applied, we will want to align our reads to the sgRNA library so as to quantify sgRNA abundance in our samples.

Since sgRNA screen reads will align continously against our sgRNA we can use our genomic aligners we have seen in previous sessions. The resulting BAM file will contain aligned sequence reads for use in further analysis.

igv

Creating a sgRNA reference

First we need to retrieve the sequence information for the sgRNA guides of interest in FASTA format

We can read the sequences of our sgRNA probes from the TSV file downloaded from the Github page here.

GeCKO <- read.delim("PinAPL-py_demo_data/GeCKOv21_Human.tsv")
GeCKO[1:2,]
##   gene_id          UID                  seq
## 1    A1BG HGLibA_00001 GTCGCTGAGCTCCGATTCGA
## 2    A1BG HGLibA_00002 ACCTGTAGTTGCCGGCGTGC

Creating a sgRNA fasta

Now we can create a DNAStringSet object from the retrieved sequences as we have done for full chromosome sequences.

require(Biostrings)
sgRNAs <- DNAStringSet(GeCKO$seq)
names(sgRNAs) <- GeCKO$UID

Creating a sgRNA fasta

Now we have a DNAStringSet object we can use the writeXStringSet to create our FASTA file of sequences to align to.

writeXStringSet(sgRNAs,
                file="GeCKO.fa")

Creating an Rsubread index

We will be aligning using the subjunc algorithm from the authors of subread. We can therefore use the Rsubread package. Before we attempt to align our fastq files, we will need to first build an index from our reference genome using the buildindex() function.

The buildindex() function simply takes the parameters of our desired index name and the FASTA file to build index from.

The index is quite small so we dont need to split index or worry about memory parameters.

library(Rsubread)
buildindex("GeCKO","GeCKO.fa", 
           indexSplit=FALSE)
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.2.6
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## ||                Index name : GeCKO                                          ||
## ||               Index space : base space                                     ||
## ||               Index split : no-split                                       ||
## ||          Repeat threshold : 100 repeats                                    ||
## ||              Gapped index : no                                             ||
## ||                                                                            ||
## ||       Free / total memory : 4.0GB / 6.8GB                                  ||
## ||                                                                            ||
## ||               Input files : 1 file in total                                ||
## ||                             o GeCKO.fa                                     ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================================= Running ==================================\\
## ||                                                                            ||
## || Check the integrity of provided reference sequences ...                    ||
## || No format issues were found                                                ||
## || Scan uninformative subreads in reference sequences ...                     ||
## || Estimate the index size...                                                 ||
## || 1.4 GB of memory is needed for index building.                             ||
## || Build the index...                                                         ||
## || Save current index block...                                                ||
## ||  [ 0.0% finished ]                                                         ||
## ||  [ 10.0% finished ]                                                        ||
## ||  [ 20.0% finished ]                                                        ||
## ||  [ 30.0% finished ]                                                        ||
## ||  [ 40.0% finished ]                                                        ||
## ||  [ 50.0% finished ]                                                        ||
## ||  [ 60.0% finished ]                                                        ||
## ||  [ 70.0% finished ]                                                        ||
## ||  [ 80.0% finished ]                                                        ||
## ||  [ 90.0% finished ]                                                        ||
## ||  [ 100.0% finished ]                                                       ||
## ||                                                                            ||
## ||                      Total running time: 0.2 minutes.                      ||
## ||                    Index GeCKO was successfully built.                     ||
## ||                                                                            ||
## \\============================================================================//

Rsubread sgRNA alignment

We can align our raw sequence data in fastQ format to the new FASTA file of sgRNA sequences using the Rsubread package. Specifically we will be using the align function as this utilizes the subread genomic alignment algorithm.

The align() function accepts arguments for the index to align to, the fastQ to align, the name of output BAM, the mode of alignment (rna or dna) and the phredOffset.

myFQs <- "PinAPL-py_demo_data/Control_R1_S14_L008_R1_001_x.fastq.gz"
myMapped <- align("GeCKO",myFQs,output_file = gsub(".fastq.gz",".bam",myFQs),
                  nthreads=4,unique=TRUE,nBestLocations=1,type = "DNA")
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.2.6
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## || Function      : Read alignment (DNA-Seq)                                   ||
## || Input file    : Control_R1_S14_L008_R1_001_x.fastq.gz                      ||
## || Output file   : Control_R1_S14_L008_R1_001_x.bam (BAM)                     ||
## || Index name    : GeCKO                                                      ||
## ||                                                                            ||
## ||                    ------------------------------------                    ||
## ||                                                                            ||
## ||                               Threads : 4                                  ||
## ||                          Phred offset : 33                                 ||
## ||                             Min votes : 3 / 10                             ||
## ||                        Max mismatches : 3                                  ||
## ||                      Max indel length : 5                                  ||
## ||            Report multi-mapping reads : no                                 ||
## || Max alignments per multi-mapping read : 1                                  ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================= Running (25-Aug-2020 18:32:37, pid=6501) =================\\
## ||                                                                            ||
## || Check the input reads.                                                     ||
## || The input file contains base space reads.                                  ||
## || Initialise the memory objects.                                             ||
## || Estimate the mean read length.                                             ||
## || The range of Phred scores observed in the data is [2,41]                   ||
## || Create the output BAM file.                                                ||
## || Check the index.                                                           ||
## || Init the voting space.                                                     ||
## || Global environment is initialised.                                         ||
## || Load the 1-th index block...                                               ||
## || The index block has been loaded.                                           ||
## || Start read mapping in chunk.                                               ||
## ||    5% completed, 0.4 mins elapsed, rate=181.0k reads per second            ||
## ||   12% completed, 0.4 mins elapsed, rate=218.6k reads per second            ||
## ||   18% completed, 0.4 mins elapsed, rate=227.8k reads per second            ||
## ||   23% completed, 0.4 mins elapsed, rate=232.9k reads per second            ||
## ||   30% completed, 0.4 mins elapsed, rate=237.4k reads per second            ||
## ||   36% completed, 0.4 mins elapsed, rate=239.7k reads per second            ||
## ||   43% completed, 0.4 mins elapsed, rate=239.3k reads per second            ||
## ||   50% completed, 0.4 mins elapsed, rate=238.9k reads per second            ||
## ||   56% completed, 0.4 mins elapsed, rate=240.1k reads per second            ||
## ||   62% completed, 0.4 mins elapsed, rate=242.6k reads per second            ||
## ||   70% completed, 0.4 mins elapsed, rate=15.3k reads per second             ||
## ||   74% completed, 0.4 mins elapsed, rate=16.1k reads per second             ||
## ||   77% completed, 0.4 mins elapsed, rate=16.7k reads per second             ||
## ||   79% completed, 0.4 mins elapsed, rate=17.2k reads per second             ||
## ||   81% completed, 0.4 mins elapsed, rate=17.6k reads per second             ||
## ||   85% completed, 0.4 mins elapsed, rate=18.3k reads per second             ||
## ||   89% completed, 0.4 mins elapsed, rate=19.1k reads per second             ||
## ||   92% completed, 0.4 mins elapsed, rate=19.7k reads per second             ||
## ||   96% completed, 0.4 mins elapsed, rate=20.5k reads per second             ||
## ||                                                                            ||
## ||                           Completed successfully.                          ||
## ||                                                                            ||
## \\====================================    ====================================//
## 
## //================================   Summary =================================\\
## ||                                                                            ||
## ||                 Total reads : 500,000                                      ||
## ||                      Mapped : 0 (0.0%)                                     ||
## ||             Uniquely mapped : 0                                            ||
## ||               Multi-mapping : 0                                            ||
## ||                                                                            ||
## ||                    Unmapped : 500,000                                      ||
## ||                                                                            ||
## ||                      Indels : 0                                            ||
## ||                                                                            ||
## ||                Running time : 0.4 minutes                                  ||
## ||                                                                            ||
## \\============================================================================//

Rsubread sgRNA alignment

We can now assess how well our alignment has done by reviewing the total number of mapped reads output by the align function.

myMapped 
##                       Control_R1_S14_L008_R1_001_x.bam
## Total_reads                                     500000
## Mapped_reads                                         0
## Uniquely_mapped_reads                                0
## Multi_mapping_reads                                  0
## Unmapped_reads                                  500000
## Indels                                               0

Rsubread sgRNA alignment

To see why we have been unable to map we can look at the Rsubread manual and review the guide for mapping to miRNAs..

We can see here that the problem is that we need to reduce the number of subreads required for accepting a hit. We can control this with the TH1 parameter which we now set ot 1.

myFQs <- "PinAPL-py_demo_data/Control_R1_S14_L008_R1_001_x.fastq.gz"
myMapped <- align("GeCKO",myFQs,output_file = gsub(".fastq.gz",".bam",myFQs),
                  nthreads=4,unique=TRUE,nBestLocations=1,type = "DNA",TH1 = 1)
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.2.6
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## || Function      : Read alignment (DNA-Seq)                                   ||
## || Input file    : Control_R1_S14_L008_R1_001_x.fastq.gz                      ||
## || Output file   : Control_R1_S14_L008_R1_001_x.bam (BAM)                     ||
## || Index name    : GeCKO                                                      ||
## ||                                                                            ||
## ||                    ------------------------------------                    ||
## ||                                                                            ||
## ||                               Threads : 4                                  ||
## ||                          Phred offset : 33                                 ||
## ||                             Min votes : 1 / 10                             ||
## ||                        Max mismatches : 3                                  ||
## ||                      Max indel length : 5                                  ||
## ||            Report multi-mapping reads : no                                 ||
## || Max alignments per multi-mapping read : 1                                  ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================= Running (25-Aug-2020 18:33:01, pid=6501) =================\\
## ||                                                                            ||
## || Check the input reads.                                                     ||
## || The input file contains base space reads.                                  ||
## || Initialise the memory objects.                                             ||
## || Estimate the mean read length.                                             ||
## || The range of Phred scores observed in the data is [2,41]                   ||
## || Create the output BAM file.                                                ||
## || Check the index.                                                           ||
## || Init the voting space.                                                     ||
## || Global environment is initialised.                                         ||
## || Load the 1-th index block...                                               ||
## || The index block has been loaded.                                           ||
## || Start read mapping in chunk.                                               ||
## ||    5% completed, 0.4 mins elapsed, rate=188.1k reads per second            ||
## ||   12% completed, 0.4 mins elapsed, rate=207.3k reads per second            ||
## ||   18% completed, 0.4 mins elapsed, rate=214.1k reads per second            ||
## ||   24% completed, 0.4 mins elapsed, rate=219.0k reads per second            ||
## ||   32% completed, 0.4 mins elapsed, rate=221.6k reads per second            ||
## ||   38% completed, 0.4 mins elapsed, rate=223.8k reads per second            ||
## ||   45% completed, 0.4 mins elapsed, rate=225.4k reads per second            ||
## ||   52% completed, 0.4 mins elapsed, rate=226.5k reads per second            ||
## ||   58% completed, 0.4 mins elapsed, rate=226.9k reads per second            ||
## ||   66% completed, 0.4 mins elapsed, rate=228.2k reads per second            ||
## ||   70% completed, 0.4 mins elapsed, rate=14.8k reads per second             ||
## ||   73% completed, 0.4 mins elapsed, rate=15.3k reads per second             ||
## ||   76% completed, 0.4 mins elapsed, rate=15.9k reads per second             ||
## ||   79% completed, 0.4 mins elapsed, rate=16.4k reads per second             ||
## ||   82% completed, 0.4 mins elapsed, rate=16.8k reads per second             ||
## ||   85% completed, 0.4 mins elapsed, rate=17.3k reads per second             ||
## ||   89% completed, 0.4 mins elapsed, rate=17.9k reads per second             ||
## ||   92% completed, 0.4 mins elapsed, rate=18.4k reads per second             ||
## ||   96% completed, 0.4 mins elapsed, rate=18.9k reads per second             ||
## ||   99% completed, 0.4 mins elapsed, rate=19.4k reads per second             ||
## ||                                                                            ||
## ||                           Completed successfully.                          ||
## ||                                                                            ||
## \\====================================    ====================================//
## 
## //================================   Summary =================================\\
## ||                                                                            ||
## ||                 Total reads : 500,000                                      ||
## ||                      Mapped : 421,608 (84.3%)                              ||
## ||             Uniquely mapped : 421,608                                      ||
## ||               Multi-mapping : 0                                            ||
## ||                                                                            ||
## ||                    Unmapped : 78,392                                       ||
## ||                                                                            ||
## ||                      Indels : 0                                            ||
## ||                                                                            ||
## ||                Running time : 0.4 minutes                                  ||
## ||                                                                            ||
## \\============================================================================//

Rsubread sgRNA alignment

We can now see we have mapped around 80% of the reads uniquely to sgRNA library.

myMapped 
##                       Control_R1_S14_L008_R1_001_x.bam
## Total_reads                                     500000
## Mapped_reads                                    421608
## Uniquely_mapped_reads                           421608
## Multi_mapping_reads                                  0
## Unmapped_reads                                   78392
## Indels                                               0

Rsubread ChIPseq alignment

We will be more conservative now and exclude any reads with mismatches and indels (insertions/deletions) from our aligned data.

myFQs <- "PinAPL-py_demo_data/Control_R1_S14_L008_R1_001_x.fastq.gz"
myMapped <- align("GeCKO",myFQs,output_file = gsub(".fastq.gz",".bam",myFQs),
                  nthreads=4,unique=TRUE,nBestLocations=1,type = "DNA",TH1 = 1,
                  maxMismatches = 0,indels = 0)
## 
##         ==========     _____ _    _ ____  _____  ______          _____  
##         =====         / ____| |  | |  _ \|  __ \|  ____|   /\   |  __ \ 
##           =====      | (___ | |  | | |_) | |__) | |__     /  \  | |  | |
##             ====      \___ \| |  | |  _ <|  _  /|  __|   / /\ \ | |  | |
##               ====    ____) | |__| | |_) | | \ \| |____ / ____ \| |__| |
##         ==========   |_____/ \____/|____/|_|  \_\______/_/    \_\_____/
##        Rsubread 2.2.6
## 
## //================================= setting ==================================\\
## ||                                                                            ||
## || Function      : Read alignment (DNA-Seq)                                   ||
## || Input file    : Control_R1_S14_L008_R1_001_x.fastq.gz                      ||
## || Output file   : Control_R1_S14_L008_R1_001_x.bam (BAM)                     ||
## || Index name    : GeCKO                                                      ||
## ||                                                                            ||
## ||                    ------------------------------------                    ||
## ||                                                                            ||
## ||                               Threads : 4                                  ||
## ||                          Phred offset : 33                                 ||
## ||                             Min votes : 1 / 10                             ||
## ||                        Max mismatches : 0                                  ||
## ||                      Max indel length : 0                                  ||
## ||            Report multi-mapping reads : no                                 ||
## || Max alignments per multi-mapping read : 1                                  ||
## ||                                                                            ||
## \\============================================================================//
## 
## //================= Running (25-Aug-2020 18:33:26, pid=6501) =================\\
## ||                                                                            ||
## || Check the input reads.                                                     ||
## || The input file contains base space reads.                                  ||
## || Initialise the memory objects.                                             ||
## || Estimate the mean read length.                                             ||
## || The range of Phred scores observed in the data is [2,41]                   ||
## || Create the output BAM file.                                                ||
## || Check the index.                                                           ||
## || Init the voting space.                                                     ||
## || Global environment is initialised.                                         ||
## || Load the 1-th index block...                                               ||
## || The index block has been loaded.                                           ||
## || Start read mapping in chunk.                                               ||
## ||    5% completed, 0.4 mins elapsed, rate=186.6k reads per second            ||
## ||   12% completed, 0.4 mins elapsed, rate=204.5k reads per second            ||
## ||   18% completed, 0.4 mins elapsed, rate=213.2k reads per second            ||
## ||   24% completed, 0.4 mins elapsed, rate=219.0k reads per second            ||
## ||   31% completed, 0.4 mins elapsed, rate=222.2k reads per second            ||
## ||   38% completed, 0.4 mins elapsed, rate=223.7k reads per second            ||
## ||   44% completed, 0.4 mins elapsed, rate=224.6k reads per second            ||
## ||   51% completed, 0.4 mins elapsed, rate=225.0k reads per second            ||
## ||   58% completed, 0.4 mins elapsed, rate=225.7k reads per second            ||
## ||   64% completed, 0.4 mins elapsed, rate=227.6k reads per second            ||
## ||   70% completed, 0.4 mins elapsed, rate=14.1k reads per second             ||
## ||   74% completed, 0.4 mins elapsed, rate=14.8k reads per second             ||
## ||   77% completed, 0.4 mins elapsed, rate=15.3k reads per second             ||
## ||   81% completed, 0.4 mins elapsed, rate=15.9k reads per second             ||
## ||   84% completed, 0.4 mins elapsed, rate=16.4k reads per second             ||
## ||   87% completed, 0.4 mins elapsed, rate=16.9k reads per second             ||
## ||   91% completed, 0.4 mins elapsed, rate=17.4k reads per second             ||
## ||   94% completed, 0.4 mins elapsed, rate=17.9k reads per second             ||
## ||   97% completed, 0.4 mins elapsed, rate=18.4k reads per second             ||
## ||                                                                            ||
## ||                           Completed successfully.                          ||
## ||                                                                            ||
## \\====================================    ====================================//
## 
## //================================   Summary =================================\\
## ||                                                                            ||
## ||                 Total reads : 500,000                                      ||
## ||                      Mapped : 413,480 (82.7%)                              ||
## ||             Uniquely mapped : 413,480                                      ||
## ||               Multi-mapping : 0                                            ||
## ||                                                                            ||
## ||                    Unmapped : 86,520                                       ||
## ||                                                                            ||
## ||                      Indels : 0                                            ||
## ||                                                                            ||
## ||                Running time : 0.4 minutes                                  ||
## ||                                                                            ||
## \\============================================================================//

Rsubread sgRNA alignment

With a little more stringency we have expected less alignment. This alignment rate is in keeping with observed results from the PinAPL-Py.

myMapped 
##                       Control_R1_S14_L008_R1_001_x.bam
## Total_reads                                     500000
## Mapped_reads                                    413480
## Uniquely_mapped_reads                           413480
## Multi_mapping_reads                                  0
## Unmapped_reads                                   86520
## Indels                                               0

Rsubread sgRNA aligned data

We can use the GenoomicAlignments package to read in our newly created BAM file using the readGAlignments function.

require(GenomicAlignments)
temp <- readGAlignments(gsub(".fastq.gz",".bam",myFQs))
temp
## GAlignments object with 413480 alignments and 0 metadata columns:
##                seqnames strand       cigar    qwidth     start       end
##                   <Rle>  <Rle> <character> <integer> <integer> <integer>
##        [1] HGLibB_08643      +    29S20M1S        50         1        20
##        [2] HGLibA_47249      +    29S19M2S        50         1        19
##        [3] HGLibB_55658      +    29S19M2S        50         1        19
##        [4] HGLibA_20205      +    29S20M1S        50         1        20
##        [5] HGLibB_32397      +    29S20M1S        50         1        20
##        ...          ...    ...         ...       ...       ...       ...
##   [413476] HGLibA_12772      +    29S20M1S        50         1        20
##   [413477] HGLibA_11892      +    28S20M2S        50         1        20
##   [413478] HGLibB_42020      +    29S19M2S        50         1        19
##   [413479] HGLibB_38831      +    28S20M2S        50         1        20
##   [413480] HGLibA_37244      +    29S20M1S        50         1        20
##                width     njunc
##            <integer> <integer>
##        [1]        20         0
##        [2]        19         0
##        [3]        19         0
##        [4]        20         0
##        [5]        20         0
##        ...       ...       ...
##   [413476]        20         0
##   [413477]        20         0
##   [413478]        19         0
##   [413479]        20         0
##   [413480]        20         0
##   -------
##   seqinfo: 122411 sequences from an unspecified genome

Rsubread sgRNA aligned data

Now we can review the alignments we can see how Rsubread has dealt with read sequences longer than the sgRNA to align to.

As we can see in the cigar strings, subread has softcliped the start and end of reads around the point of alignment and so removed adaptors as part of alignment.

temp
## GAlignments object with 413480 alignments and 0 metadata columns:
##                seqnames strand       cigar    qwidth     start       end
##                   <Rle>  <Rle> <character> <integer> <integer> <integer>
##        [1] HGLibB_08643      +    29S20M1S        50         1        20
##        [2] HGLibA_47249      +    29S19M2S        50         1        19
##        [3] HGLibB_55658      +    29S19M2S        50         1        19
##        [4] HGLibA_20205      +    29S20M1S        50         1        20
##        [5] HGLibB_32397      +    29S20M1S        50         1        20
##        ...          ...    ...         ...       ...       ...       ...
##   [413476] HGLibA_12772      +    29S20M1S        50         1        20
##   [413477] HGLibA_11892      +    28S20M2S        50         1        20
##   [413478] HGLibB_42020      +    29S19M2S        50         1        19
##   [413479] HGLibB_38831      +    28S20M2S        50         1        20
##   [413480] HGLibA_37244      +    29S20M1S        50         1        20
##                width     njunc
##            <integer> <integer>
##        [1]        20         0
##        [2]        19         0
##        [3]        19         0
##        [4]        20         0
##        [5]        20         0
##        ...       ...       ...
##   [413476]        20         0
##   [413477]        20         0
##   [413478]        19         0
##   [413479]        20         0
##   [413480]        20         0
##   -------
##   seqinfo: 122411 sequences from an unspecified genome

Rsubread sgRNA aligned data

We can then review where the softclipping and matches occur in our read by interrogating these cigar strings using the cigar functions to extract them from our GAlignements object.

cigars <- cigar(temp)
cigars[1:5]
## [1] "29S20M1S" "29S19M2S" "29S19M2S" "29S20M1S" "29S20M1S"

Rsubread sgRNA aligned data

We can use the cigarToRleList function to turn our cigar strings into a list of RLE objects.

cigarRLE <- cigarToRleList(cigars)
cigarRLE[1]
## RleList of length 1
## [[1]]
## character-Rle of length 50 with 3 runs
##   Lengths:  29  20   1
##   Values : "S" "M" "S"

Rsubread sgRNA aligned data

And as our reads are all the same length we can now turn our list of RLEs into a matrix of cigar values with rows representing reads and columns the cycles acorss the reads.

cigarMat <- matrix(as.vector(unlist(cigarRLE)),ncol=50,byrow = TRUE)
cigarMat[1:2,]
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] "S"  "S"  "S"  "S"  "S"  "S"  "S"  "S"  "S"  "S"   "S"   "S"   "S"   "S"  
## [2,] "S"  "S"  "S"  "S"  "S"  "S"  "S"  "S"  "S"  "S"   "S"   "S"   "S"   "S"  
##      [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25] [,26]
## [1,] "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"  
## [2,] "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"   "S"  
##      [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36] [,37] [,38]
## [1,] "S"   "S"   "S"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"  
## [2,] "S"   "S"   "S"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"  
##      [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46] [,47] [,48] [,49] [,50]
## [1,] "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "S"  
## [2,] "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "M"   "S"   "S"

Rsubread sgRNA aligned data

From this we can now get the frequency of S or M using the table function.

cigarFreq <- apply(cigarMat,2,table)
cigarFreq[1:2,]
##     [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]   [,8]   [,9]  [,10]  [,11]
## M     41    699    824    867    868    868    873    878    878    878    878
## S 413439 412781 412656 412613 412612 412612 412607 412602 412602 412602 412602
##    [,12]  [,13]  [,14]  [,15]  [,16]  [,17]  [,18]  [,19]  [,20]  [,21]  [,22]
## M    880    880    881    902    905    883    883    882    835    767    120
## S 412600 412600 412599 412578 412575 412597 412597 412598 412645 412713 413360
##    [,23]  [,24]  [,25]  [,26]  [,27]  [,28]  [,29]  [,30]  [,31]  [,32]  [,33]
## M     61     44     51    123    972   2672  31947 405383 411237 411654 412178
## S 413419 413436 413429 413357 412508 410808 381533   8097   2243   1826   1302
##    [,34]  [,35]  [,36]  [,37]  [,38]  [,39]  [,40]  [,41]  [,42]  [,43]  [,44]
## M 412575 412575 412575 412574 412574 412572 412571 412571 412562 412555 412547
## S    905    905    905    906    906    908    909    909    918    925    933
##    [,45]  [,46]  [,47]  [,48]  [,49]  [,50]
## M 412532 412335 409707 403963 312939   2308
## S    948   1145   3773   9517 100541 411172

Rsubread sgRNA aligned data

With this frequency table we can construct a ggplot to show the distibution of soft-clips and matches across the read.

require(ggplot2)
toPlot <- data.frame(Freq=c(cigarFreq["S",],cigarFreq["M",]),
                      Cigar=rep(c("S","M"),each=ncol(cigarFreq)),
                      Cycle=rep(seq(1,ncol(cigarFreq)),2))
ggplot(toPlot,aes(x=Cycle,y=Freq,colour=Cigar))+geom_line()+theme_bw()

Rsubread counts per sgRNA

To obtain the counts per sgRNA we can simply take the frequncy of contig occurrence.

counts <- data.frame(table(seqnames(temp)),row.names = "Var1")
counts[1:2,,drop=FALSE]
##              Freq
## HGLibA_00001    0
## HGLibA_00002    1

Rsubread counts per sgRNA

We can now compare to the counts from the same file retreived from PinAPL-Py.

ss <- read.delim("New_Example_1580342908_4/Analysis/01_Alignment_Results/ReadCounts_per_sgRNA/Control_1_GuideCounts.txt")
new <- merge(ss,counts,by.x=1,by.y=0)
plot(new$X0,new$Freq)

Rsubread counts per sgRNA

This looked quite different for low counts. We will also add 1 more additional filter to screen out reads with less than 20 matches.

counts <- data.frame(table(seqnames(temp[width(temp) == 20])),row.names = "Var1")
new <- merge(ss,counts,by.x=1,by.y=0)
plot(new$X0,new$Freq)

Rsubread counts per sgRNA

Now we can align the reads and count sgRNAs we could run the counting over each of our files of interest. I have done this already and provide a solution in the exercises based on the previous slides.

For now we can work with the result stored in a matrix object called sgRNAcounts.

load("data/sgRNACounts.RData")
sgRNAcounts[1:4,]
##              Control_1 Control_2 ToxB_1 ToxB_2
## HGLibA_00001         0         7      2      8
## HGLibA_00002         0         0      0      0
## HGLibA_00003         1         0      0      0
## HGLibA_00004         9         0      0      1

Rsubread counts per sgRNA

To normalise our data and test for enrichment we will use the DESeq2 package.

First we need to constuct our DESeqDataset object using familiar DESeqDataSetFromMatrix function

library(DESeq2)
metadata <- DataFrame(Group=factor(c("Control","Control","ToxB","ToxB"),
                                   levels = c("Control","ToxB")),
                      row.names = colnames(sgRNAcounts))
dds <- DESeqDataSetFromMatrix(sgRNAcounts,colData = metadata,design = ~Group)

Rsubread counts per sgRNA

Now we have our DESeqDataset object we can use the DESeq function to normalise and test for enrichment.

dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing

Rsubread counts per sgRNA

A useful QC with sgRNA enrichment analysis is to review the distribution of counts for sgRNA guides before and after treatment. Here we extract the normalised counts and plot the distbutions of log2 counts for sgRNAs as a boxplot.

normCounts <- counts(dds,normalized=TRUE)
boxplot(log2(normCounts+0.1))

Rsubread counts per sgRNA

To identify sgRNAs enriched in our Tox condition we can now use the standard differential analysis of DESeq2 by way of the results function.

ToxBvsControl <- results(dds,contrast=c("Group","ToxB","Control"))
ToxBvsControl <- ToxBvsControl[order(ToxBvsControl$pvalue),]
ToxBvsControl
## log2 fold change (MLE): Group ToxB vs Control 
## Wald test p-value: Group ToxB vs Control 
## DataFrame with 122411 rows and 6 columns
##               baseMean log2FoldChange     lfcSE      stat      pvalue
##              <numeric>      <numeric> <numeric> <numeric>   <numeric>
## HGLibA_02011  2269.551       11.02447   1.84655   5.97030 2.36814e-09
## HGLibA_64400   572.199       12.78926   2.38743   5.35691 8.46576e-08
## HGLibB_02299  2910.432        9.46052   1.77452   5.33132 9.75036e-08
## HGLibA_39135  1804.121       10.44016   2.00996   5.19422 2.05577e-07
## HGLibB_51676  2751.242        9.63231   1.86532   5.16390 2.41858e-07
## ...                ...            ...       ...       ...         ...
## HGLibB_57015         0             NA        NA        NA          NA
## HGLibB_57016         0             NA        NA        NA          NA
## HGLibB_57019         0             NA        NA        NA          NA
## HGLibB_57021         0             NA        NA        NA          NA
## HGLibB_57026         0             NA        NA        NA          NA
##                     padj
##                <numeric>
## HGLibA_02011 3.09658e-05
## HGLibA_64400 4.24986e-04
## HGLibB_02299 4.24986e-04
## HGLibA_39135 6.32506e-04
## HGLibB_51676 6.32506e-04
## ...                  ...
## HGLibB_57015          NA
## HGLibB_57016          NA
## HGLibB_57019          NA
## HGLibB_57021          NA
## HGLibB_57026          NA

Rsubread counts per sgRNA

We can now compare our results to those provided by PinAPL-py. Although there us some differences the correlation here is 99%.

ss <- read.delim("New_Example_1580342908_4/Analysis/02_sgRNA-Ranking_Results/sgRNA_Rankings/ToxB_avg_0.01_Sidak_sgRNAList.txt")
toPlt <- merge(ss,as.data.frame(ToxBvsControl),by.x=1,by.y=0)
corr <- cor(log2(toPlt[!is.na(toPlt$padj),]$fold.change),toPlt[!is.na(toPlt$padj),]$log2FoldChange)
plot(log2(toPlt[!is.na(toPlt$padj),]$fold.change),toPlt[!is.na(toPlt$padj),]$log2FoldChange,main=corr)

sgRNA to Gene

We now would want to identify genes whose sgRNAs are enriched. Most sgRNA libraries will have multiple sgRNA guides for a single gene and so we can use makes use of these in selected genes of interest.

Two main approaches are to either set a cut-off for number of enriched sgRNAs per gene or to produce an aggregate score of sgRNAs for a gene.

Here we will employ the first strategy.

sgRNA to Gene

We can first identify which sgRNAs were significantly enriched in ToxB.

ToxBvsControl <- as.data.frame(ToxBvsControl)[order(ToxBvsControl$pvalue),]
ToxBvsControl$Enriched <- !is.na(ToxBvsControl$padj) & ToxBvsControl$pvalue < 0.05 &ToxBvsControl$log2FoldChange > 0
ToxBvsControl[1:2,]
##               baseMean log2FoldChange    lfcSE     stat       pvalue
## HGLibA_02011 2269.5511       11.02447 1.846551 5.970302 2.368142e-09
## HGLibA_64400  572.1991       12.78926 2.387432 5.356909 8.465756e-08
##                      padj Enriched
## HGLibA_02011 3.096583e-05     TRUE
## HGLibA_64400 4.249859e-04     TRUE

sgRNA to Gene

We also need to add some information of sgRNA to gene. This is available in the same file containing the sequence information.

ToxBvsControl <- merge(GeCKO,ToxBvsControl,by.x=2,by.y=0)
ToxBvsControl[1:2,]
##            UID gene_id                  seq baseMean log2FoldChange    lfcSE
## 1 HGLibA_00001    A1BG GTCGCTGAGCTCCGATTCGA 4.588529       1.022701 2.679293
## 2 HGLibA_00002    A1BG ACCTGTAGTTGCCGGCGTGC 0.000000             NA       NA
##        stat    pvalue      padj Enriched
## 1 0.3817054 0.7026799 0.8083972    FALSE
## 2        NA        NA        NA    FALSE

sgRNA to Gene

We can now loop through our sgRNA results and summarise to the gene level.

genes <- unique(ToxBvsControl$gene_id)
listofGene <- list()
for(i in 1:length(genes)){
  tempRes <- ToxBvsControl[ToxBvsControl$gene_id %in% genes[i],]
  meanLogFC <- mean(tempRes$log2FoldChange,na.rm=TRUE)
  logFCs <- paste0(tempRes$log2FoldChange,collapse=";")
  minPvalue <- min(tempRes$pvalue,na.rm=TRUE) 
  pvalues <- paste0(tempRes$pvalue,collapse=";")
  nEnriched <- sum(tempRes$Enriched,na.rm=TRUE) 
  listofGene[[i]] <- data.frame(Gene=genes[i],meanLogFC,logFCs,minPvalue,pvalues,nEnriched)
}
geneTable <- do.call(rbind,listofGene)
geneTable <- geneTable[order(geneTable$nEnriched,decreasing=TRUE),]

sgRNA to Gene

Finally we can set a cut-off of enrichment by at least two guides and select genes to investigate further.

geneTable[1:3,]
##        Gene meanLogFC
## 329   HBEGF 3.4286182
## 17   CPSF3L 0.9897694
## 28  PIP5K1B 3.2540767
##                                                                                                        logFCs
## 329  5.91201414165886;5.94417514624874;4.87849538275283;5.55277593166671;-0.704130722997228;-1.01162051433903
## 17  11.4822538345679;6.51040028627386;-5.94602232979228;-2.59617588084159;-2.57900317698969;-0.93283644538519
## 28                                 9.23864163813939;4.54175883040469;-5.10566945641389;4.34157590463247;NA;NA
##        minPvalue
## 329 1.515112e-03
## 17  1.616409e-06
## 28  4.500310e-06
##                                                                                                               pvalues
## 329 0.00151511233063106;0.00364650882426901;0.0196783715509742;0.0289925243536325;0.800810984608005;0.839540034529576
## 17  1.61640923806389e-06;0.00736461379039227;0.0382497535051929;0.592989202271584;0.595502117574694;0.730750118602134
## 28                                  4.50030988059637e-06;0.018688113990729;0.0978052274147489;0.122899091425096;NA;NA
##     nEnriched
## 329         4
## 17          2
## 28          2

Time for an exercise.

Link_to_exercises

Link_to_answers