These exercises are about manipulate single-cell data with Bioconductor packages. Please download the counting matrix from DropBox and loading them into a Seurat object. Or you may also used the rds file data/scSeq_CKO_sceSub.rds.

If you want to load the whole dataset

library(DropletUtils)
library(DropletTestFiles)
fname <- "~path to the 10X counting matrix"
sce <- read10xCounts(fname, col.names=TRUE)
saveRDS(sce,"path to rds file")

If you want to load a subset of the data

library(DropletUtils)
# library(DropletTestFiles)
sce <- readRDS("data/scSeq_CKO_sceSub.rds")

Exercise 2 - Data manipulation with Bioconductor packages

Identify empty droplets

  1. Draw a knee plot and identify inflection point and knee point.
## Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0 omitted from
## logarithmic plot

  1. Identify non-empty droplets and compare to the results with hard cut-off
  • set limit as “100” (filter out droplets with UMI counts less than 100)
  • FDR cut-off for non-empty droplet is 0.001
## DataFrame with 400000 rows and 5 columns
##                        Total   LogProb    PValue   Limited       FDR
##                    <integer> <numeric> <numeric> <logical> <numeric>
## CCACGGAAGCTCTCGG-1         0        NA        NA        NA        NA
## TGTGGTAAGAGTCGGT-1         2  -11.1237 0.7561244     FALSE        NA
## CCCTCCTTCGTCTGCT-1         2  -18.2572 0.0727927     FALSE        NA
## ACTGAGTGTGTCGCTG-1         0        NA        NA        NA        NA
## CGTAGCGTCTCTGCTG-1         0        NA        NA        NA        NA
## ...                      ...       ...       ...       ...       ...
## AGAGCTTGTCCGACGT-1         0        NA        NA        NA        NA
## GCTGCGACAATAGCAA-1         0        NA        NA        NA        NA
## CGAACATAGTGAAGTT-1         2  -12.3017  0.632437     FALSE        NA
## TAGCCGGCATGTCGAT-1         0        NA        NA        NA        NA
## CGAGAAGAGCAGATCG-1         0        NA        NA        NA        NA
##    Mode   FALSE    TRUE    NA's 
## logical    3051    3943  393006
##        Limited
## Sig     FALSE TRUE
##   FALSE  3051    0
##   TRUE     62 3881

Data normalization and cluster data

  1. Normalize and cluster now that empty droplets have been removed.
## Warning in (function (x, sizes, min.mean = NULL, positive = FALSE, scaling =
## NULL) : encountered non-positive size factor estimates

Evaluate ambient RNA contamination

  1. Please estimate ambient RNA contamination, remove contaminants, and validate by using the Hba-a1 gene.
## ENSMUSG00000051951 ENSMUSG00000025902 ENSMUSG00000033845 ENSMUSG00000025903 
##       1.669575e-07       1.669575e-07       1.407972e-04       5.201547e-05 
## ENSMUSG00000104217 ENSMUSG00000033813 
##       1.669575e-07       6.783772e-05

  1. Re-cluster data after ambient RNA removal

Remove doublets

  1. Please estimate doublets and evaluate if the doublets were enriched in any clusters or not. Then try to remove the doublet cells/clusters.
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2445  0.6309  0.8732  1.2381  7.3498

## 
##   no  yes 
## 3744  199

Advanced QC plots

  1. Please estimate mitochondrial contents (is.mito), read counts (sum) and gene counts (detected) for each cell. Then, draw plots.
  • violin plots of is.mito, sum, and detected
  • scatter plots for is.mito vs sum and detected vs sum
  1. Estimate variance explanation and find the factor that contributes to the majority of variance.
## Warning: Removed 905 rows containing non-finite values (stat_density).