Exercise 2 - Bioconductor

These exercises are about manipulate single-cell data with Bioconductor packages. Please download the counting matrix from DropBox and loading them into a Seurat object. Or you may also used the rds file data/scSeq_CKO_sceSub.rds.

If you want to load the whole dataset

library(DropletUtils)
library(DropletTestFiles)
fname <- "~path to the 10X counting matrix"
sce <- read10xCounts(fname, col.names=TRUE)
saveRDS(sce,"path to rds file")

If you want to load a subset of the data

library(DropletUtils)
# library(DropletTestFiles)
sce <- readRDS("data/scSeq_CKO_sceSub.rds")

Exercise 2 - Data manipulation with Bioconductor packages

Identify empty droplets

Draw a knee plot and identify inflection point and knee point.

## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
## logarithmic plot

Identify non-empty droplets and compare to the results with hard cut-off

set limit as “100” (filter out droplets with UMI counts less than 100)
FDR cut-off for non-empty droplet is 0.001

## DataFrame with 400000 rows and 5 columns
##                        Total   LogProb    PValue   Limited       FDR
##                    <integer> <numeric> <numeric> <logical> <numeric>
## CCACGGAAGCTCTCGG-1         0        NA        NA        NA        NA
## TGTGGTAAGAGTCGGT-1         2  -11.1237 0.7561244     FALSE        NA
## CCCTCCTTCGTCTGCT-1         2  -18.2572 0.0727927     FALSE        NA
## ACTGAGTGTGTCGCTG-1         0        NA        NA        NA        NA
## CGTAGCGTCTCTGCTG-1         0        NA        NA        NA        NA
## ...                      ...       ...       ...       ...       ...
## AGAGCTTGTCCGACGT-1         0        NA        NA        NA        NA
## GCTGCGACAATAGCAA-1         0        NA        NA        NA        NA
## CGAACATAGTGAAGTT-1         2  -12.3017  0.632437     FALSE        NA
## TAGCCGGCATGTCGAT-1         0        NA        NA        NA        NA
## CGAGAAGAGCAGATCG-1         0        NA        NA        NA        NA

##    Mode   FALSE    TRUE    NA's 
## logical    3051    3943  393006

##        Limited
## Sig     FALSE TRUE
##   FALSE  3051    0
##   TRUE     62 3881

Data normalization and cluster data

Normalize and cluster now that empty droplets have been removed.

Normalize and log transform the counts
Use modeling to identify top variable features
Make UMAP plot
Graphic based clustering

## Warning in (function (x, sizes, min.mean = NULL, positive = FALSE, scaling =
## NULL) : encountered non-positive size factor estimates

Evaluate ambient RNA contamination

Please estimate ambient RNA contamination, remove contaminants, and validate by using the Hba-a1 gene.

## ENSMUSG00000051951 ENSMUSG00000025902 ENSMUSG00000033845 ENSMUSG00000025903 
##       1.669575e-07       1.669575e-07       1.407972e-04       5.201547e-05 
## ENSMUSG00000104217 ENSMUSG00000033813 
##       1.669575e-07       6.783772e-05

Re-cluster data after ambient RNA removal

Remove doublets

Please estimate doublets and evaluate if the doublets were enriched in any clusters or not. Then try to remove the doublet cells/clusters.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2445  0.6309  0.8732  1.2381  7.3498

## 
##   no  yes 
## 3744  199

Advanced QC plots

Please estimate mitochondrial contents (is.mito), read counts (sum) and gene counts (detected) for each cell. Then, draw plots.

violin plots of is.mito, sum, and detected
scatter plots for is.mito vs sum and detected vs sum

Estimate variance explanation and find the factor that contributes to the majority of variance.

## Warning: Removed 905 rows containing non-finite values (stat_density).