Introduction to single cell sequencing

Overview

High-throughput Sequencing
Emergence of single-cell
Differing single-cell assays
Differing single-cell technologies
Overview of Single-cell sequencing

Sequencing

The emergence of high-throughput sequencing technologies such as 454 (Roche) and Solexa (Illumina) sequencing allowed for the highly parallel short read sequencing of DNA molecules.

454 (Roche) ~ 2005
Solexa (Illumina) ~ 2006

Sequencing overview

overview

Sequence files - FastQ

FastQ (FASTA with Qualities)

$igv$

“@” followed by identifier.
Sequence information.
“+”
Quality scores encodes as ASCI.

Sequence files - FastQ

FastQ - Header

igv

Header for each read can contain additional information
- HS2000-887_89 - Machine name.
- 5 - Flowcell lane.
- /1 - Read 1 or 2 of pair (here read 1)

Sequence files - FastQ

FastQ - Qualities

igv

Qualities follow “+” line.
-log10 probability of sequence base being wrong.
Encoded in ASCI to save space.
Used in quality assessment and downstream analysis

Sequencing use cases

ChIP-seq - Study transcription factor binding
RNA-seq - Study genome wide transcriptomics
Whole Genome/Exome Sequencing - Study genome variation/changes
MNAseq/DNAse-seq/ATAC-seq - Study chromatin structure/accessibility
CrispR-seq - Study genome wide phenotypes
And many more..

Bulk sequencing methods

Sequencing typically performed on bulk tissue or cells.

Tissues - Liver vs Heart
Conditions - Cancer vs surrounding normal tissue
Treatments - Hep-G2 cells with/without rapamycin
Genetic perturbation - Ikzf WT/KO in B-cells.

Analysis of the bulk characteristics of data without understanding of hetergeneity of data.

FACS Cell and Nuclei sorting

Newer technologies such as TRAP from the Heintz lab or nuclei sorting allow for capture of distinct cell types based on expressed markers.

Pros - Allow for the capture of rare cell populations such as specific neuron types.

Cons - Require known markers for desired cell populations.

Single-cell sequencing

With the advent of advanced microfluidics and refined sequencing technologies, single-cell sequencing has emerged as a technology to profile individual cells from a heterogeneous population without prior knowledge of cell populations.

Pros - No prior knowledge of cell populations required. - Simultaneously assess profiles of 1000s of cells.

Cons - Low sequencing sequencing depth for individual cells (1000s vs millions of reads for bulk).

Single-cell use cases

Single-cell sequencing, as with bulk sequencing, has now been applied to the study of a wide range of differing assays.

scRNA-seq/snRNA-seq - Transcriptomics
scATAC-seq/snATAC-seq - Chromatin accessibility
CITE-seq - Surface protein expression
AIRR - TCR/BCR profiling
Spatial-seq - Spatial profiling of gene expression.

Single-cell platforms

10x Genomics (microfluidics)
Smart-seq3 (microfluidics)
Drop-seq (microfluidics)
VASA-seq (FANS)
inDrop-seq (microfluidics)
In-house solutions

Many companies offer single-cell sequencing technologies which may be used with the Illumina sequencer.

Major Single-cell platforms

Two popular major companies offer the most used technologies.

10x Genomics
Smart-seq3

Major difference between the two are the sequencing depth and coverage profiles across transcripts.

10x is 3’ biased so will not provide full length transcript coverage.
As 10x is sequencing only the 3’ of transcripts/genes, less sequencing depth is required per transcript/gene

The 10x Genomics system

overview

The 10x Genomics system - Microfluidics

$Microfluidics$

Microfluidics

The 10x Genomics system - Gel beads

beads

The 10x Genomics system - Sequencing primers

primers

The 10x Genomics system - Sequencing R1

Read 1

The 10x Genomics system - Sequencing R2

Read 2

The 10x sequence reads

The sequence reads contain:-

Information on cell identity (10x barcode)
Information on molecule identity (UMI)
Information on sample identity (Index read)
Information on the RNA molecule (transcriptome read)

From Sequences to Processed data

As with standard bulk sequencing data, the next steps are typically to align the data to a reference genome/transcriptome and summarize data to a signal matrix.

Align reads to transcriptome and summarize signal to genes for RNAseq
Align reads to genome, call enriched windows/peaks and summarize signal in peaks.

Processing scRNA-seq/snRNA-seq data.

For the processing of scRNA/snRNA from fastQ to count matrix, there are many options available to us.

Alignment and counting - Cellranger count - STAR - STARsolo - Subread cellCounts

Pseudoalignment and counting - Salmon - Alevin - Kallisto - Bustools

Droplets to counts

The output of these tools is typically a matrix of the signal attributed to cells and genes (typically read counts).

This matrix is the input for all downstream post-processing, quality control, normalization, batch correction, clustering, dimension reduction and differential expression analysis.

The output matrix is often stored in a compressed format such as:- - MEX (Market Exchange Format) - HDF5 (Hierarchical Data Format)

]

overview ]

HDF5 format

Optimized format for big data.
Contains matrix information alongside column and row information.
Loom format version of HDF5 most popular.
Efficient to work with programmatically.

MEX format

Plain/Compressed text format
Contains matrix information in tsv file
Separate row and column files contain barcode (cell) and feature (gene/transcript) information.

This is an example of a directory produced by Cell Ranger.

Cell Ranger

Cell Ranger is the typical approach we use to process 10x data. The default setting are pretty good. This is an intensive program, so we will not be running this locally on your laptops. Instead we run it on remote systems , like the HPC.

If you are working with your own data, the data will often be provided as the Cell Ranger output by the Genomics/Bioinformatics teams, like here at Rockefeller University.

Cell Ranger

Cell Ranger is a suite of tools for single cell processing and analysis available from 10X Genomics. It performs key processing steps i.e. demultiplexing, conversion to FASTQ and mapping. It is also the first chance to delve into your data sets QC.

In this session we will give a brief overview of running this tool and then dive deeper into interpreting the outputs. ]

CellRanger ]

Do I need to run this?

Often genomics centers will run it for you and deliver mtx/hdf5 files (i.e. here at Rockefeller)
Why run Cell Ranger yourself?:
- You want to integrate data sets, but they were processed by different versions of Cell Ranger
- You may want to change Cell Ranger parameters i.e. export BAMs, or force an expected cell number
- Most commonly you want to reanalyze a published data set

Cell Ranger Download

Cell Ranger is available from the 10x genomics website.
Also available are pre-baked references for Human and Mouse genomes (GRCh37/38 and GRCm37)

] ]

Cell Ranger

Cell Ranger only runs on linux machines (CentOS/RedHat 7.0+ and Ubuntu 14.04+).
Due to memory requirements, typically users run Cell Ranger on a remote server and not their own machines
To download Cell Ranger and the required reference onto a remote server, we typically use the wget command (these commands are all on the Download page)

[This will all be in terminal on the server you are using]

wget -O cellranger-8.0.0.tar.gz "https://cf.10xgenomics.com/releases/cell-exp/cellranger-8.0.0.tar.gz?Expires=1711772964&Key-Pair-Id=APKAI7S6A5RYOXBWRPDA&Signature=muvzcbqxba6d-blyYS02MVfLlzwZk6iZNQWXdaoCLnl7owW2nEN-IHwSPwdNoYl-6Xia7rr0S1sLCUQTsekGm2pQKcd0kqK~ndHK0DM7SwSVpXLlRvBV5pXt~EIlsxATVBKVeQLnUy698N-WnRlT~ahjlU-nMdpomX9-lOkF~w8gbgHBdtPXunTWfW87sSJLpHMDVENSF7TFJsXERDwDnsXyQLCuEhfGTCOnupkaATlLEr9kaeCStePKkwGyqgi1m8Ua02NNGHWPIJ6I1mDt695wo~dgptpJF4SDNRTyE-TuXrHfIqRjZB60zhWRJczFo2kpL7FCKwliE-vJ6djcSw__"

Download reference genome for Cell Ranger i.e. Human genome (GRCh38)

wget "https://cf.10xgenomics.com/supp/cell-exp/refdata-gex-GRCh38-2024-A.tar.gz"

Cell Ranger set-up

Having downloaded the software and references, we can then unpack them.

tar -xzvf cellranger-8.0.0.tar.gz
tar -xzvf refdata-gex-GRCh38-2020-A.tar.gz

Finally we can add the cellranger directory to our PATH.

export PATH=/PATH_TO_CELLRANGER_DIRECTORY/cellranger-8.0.0:$PATH

Running Cell Ranger count

Now we have the downloaded Cell Ranger software and required pre-build reference for Human (GRCh38) we can start the generation of count data from scRNA-seq/snRNA-seq fastQ data.

Typically FastQ files for your scRNA run will have been generated using the Cell Ranger mkfastq toolset to produce a directory a FastQ files.

As we have the downloaded Cell Ranger software and required pre-build reference for Human (GRCh38) we can generate count data from the FASTQ files. This will make a count matrix and associated files.

If you are analyzing single nuclei RNA-seq data remember to set the –include-introns flag.

cellranger count --id=my_run_name \
   --fastqs=PATH_TO_FASTQ_DIRECTORY \
   --transcriptome=/PATH_TO_CELLRANGER_DIRECTORY/refdata-gex-GRCh38-2020-A
   --create-bam=true

Working with custom genomes

If you are working with a genome which is not Human and/or mouse you will need to find another source for your Cell Ranger reference.

Luckily many references are pre-built by other consortiums.
We can build our own references using other tools in Cell Ranger using the mkgtf and mkref functions.
For this you just need a FASTA file (DNA sequence) and a GTF file (Gene Annotation) for your reference.

Creating your own reference

To create your own references you will need two additional files.

FASTA file - File containing the full genome sequence for the reference of interest
GTF file - File containing the gene/transcript models for the reference of interest.

FASTA file (Genome sequence)

The reference genome stored as a collection of contigs/chromosomes.
A contig is a stretch of DNA sequence encoded as A,G,C,T,N.
Typically comes in FASTA format.
- “>” line contains information on contig
- Lines following contain contig sequence

igv

GTF (Gene Transfer Format)

igv

Used to genome annotation.
Stores position, feature (exon) and meta-feature (transcript/gene) information.
Importantly for Cell Ranger Count, only features labelled as exon (column 3) will be considered for counting signal in genes
Many genomes label mitochondrial genes with CDS and not exon so these must be updated

Using Cell Ranger mkgtf

Now we have the gene models in the GTF format we can use the Cell Ranger mkgtf tools to validate our GTF and remove any unwanted annotation types using the attribute flag.

Below is an example of how 10x generated the GTF for the Human reference.

cellranger mkgtf Homo_sapiens.GRCh38.ensembl.gtf \
Homo_sapiens.GRCh38.ensembl.filtered.gtf \
                   --attribute=gene_biotype:protein_coding \
                   --attribute=gene_biotype:lncRNA \
                   --attribute=gene_biotype:antisense \
                   --attribute=gene_biotype:IG_LV_gene \
                   --attribute=gene_biotype:IG_V_gene \
                   --attribute=gene_biotype:IG_V_pseudogene \
                   --attribute=gene_biotype:IG_D_gene \
                   --attribute=gene_biotype:IG_J_gene \
                   --attribute=gene_biotype:IG_J_pseudogene \
                   --attribute=gene_biotype:IG_C_gene \
                   --attribute=gene_biotype:IG_C_pseudogene \
                   --attribute=gene_biotype:TR_V_gene \
                   --attribute=gene_biotype:TR_V_pseudogene \
                   --attribute=gene_biotype:TR_D_gene \
                   --attribute=gene_biotype:TR_J_gene \
                   --attribute=gene_biotype:TR_J_pseudogene \
                   --attribute=gene_biotype:TR_C_gene

Using Cell Ranger mkref

Following filtering of your GTF to the required biotypes, we can use the Cell Ranger mkref tool to finally create our custom reference.

cellranger mkref --genome=custom_reference \
--fasta=custom_reference.fa  \
--genes=custom_reference_filtered.gtf

Output files

Outputs

Having completed the Cell Ranger count step, the user will have created a folder, with the name set by the –id flag from the count command.

Within this folder there will be the outs/ directory which contains all the outputs generated from Cell Ranger count.

Count Matrices

The count matrices to be used for further analysis are stored in both MEX and HDF5 formats within the output directories.

The filtered matrix only contains detected, cell-associated barcodes whereas the raw contains all barcodes (background and cell-associated).

MEX format - filtered_feature_bc_matrix - raw_feature_bc_matrix

HDF5 format - filtered_feature_bc_matrix.h5 - raw_feature_bc_matrix.h5

BAM files

The outs directory may also contain a BAM file of alignments for all barcodes against the reference (possorted_genome_bam.bam) as well as an associated BAI index file (possorted_genome_bam.bam.bai). This depends on whether you put true or false in the –create-bam argument. Older versions of Cell Ranger did not have this argument and would default to producing this BAM file.

This BAM file is sometimes used in downstream analysis such as scSplit/Velocyto as well as for the generation of signal graphs such as bigWigs.

igv

]

Cloupe files

Cell Ranger also outputs cloupe.cloupe files for visualization within the 10X Loupe browser software.

This allows for the visualization of scRNA-seq/snRNA-seq as a t-sne/umap with the ability to overlay metrics of QC and gene expression onto the cells in real time.

igv

Web Summary QC

QC outputs

Assessment of the overall quality of a scRNA-seq/snRNA-seq experiment after Cell Ranger can give our first chance to dig into the quality of your dataset and gain insight any issues we might face in data analysis.

QC is essential

There are many potential issues which can arise in scRNA-seq/snRNA-seq data including:

Empty droplets
Low quality cells (dead or dying)
Ambient RNA contamination
Doublets

Huemos et al. (2023)

Metrics and Web Summary

Cell Ranger will also output summaries of useful metrics as a text file (metrics_summary.csv) and as a intuitive web-page.

Metrics include:

Counts/UMIs per cell.
Number of cells detected.
Alignment quality.
Distribution of reads in genomic features.
Sequencing saturation
t-sne/UMAP with default clustering.

Metrics and Web Summary

Assessment of the overall quality of a scRNA-seq/snRNA-seq experiment after Cell Ranger can give our first insight into any issues we might face.

Web Summary Example

We will be looking at the web summary generated from a PBMC dataset with ~1,000 cells from a healthy human donor. The full experiment details can be found on the 10X website here and you can get a copy of a web summary from Cell Ranger version 8 here.

[NOTE: Older web summaries contain largely the same information, but slightly different layout]

Web Summary overview

The web summary html file contains an interactive report describing the most essential QC for your single cell experiment as well as initial clustering and dimension reduction for your data.

The web summary also contains useful information on the input files and the versions used in this analysis for later reproducibility. ]

websum ]

Run Summary

The first thing we can review is the Run Summary information panel. (Top Left) As most people do not run Cell Ranger themselves this is important to check it matches expectations.

Sample ID - Sample name (Assigned in cellranger count)
Chemistry - The 10x chemistry used
Include introns - Whether counting was run to include intronic counts (typical for single nucleus RNAseq, but also useful for scRNAseq)
Pipeline Version - Version of Cell Ranger used

websum

Command Line Arguments

A corresponding section of the web summary is the Command Line Arguments used to run Cell Ranger. (Bottom) Again this is an important section to double-check to make sure everything was run correctly.

Input Path - Path to the input fastq files
Reference Path and Transcriptome - References used in analysis

This is now under the Experimental Design tab of Cell Ranger 9.

websum

Mapping panel

The Mapping panel highlights information on the mapping of reads to the reference genome and transcriptome. Bottom Left

Reads Mapped to Genome - Total mapped reads
Reads Mapped Confidently to Genome - Uniquely mapped reads
Reads Mapped Confidently to Exonic/Intronic/Intergenic - Uniquely mapped reads to specific regions
Reads Mapped Confidently to Transcriptome - Reads mapped to a unique gene (and consistent with slice junctions)
Reads Mapped Antisense to Gene - Reads mapped to the opposite strand of a gene

websum ]

Mapping panel

Key Metrics we look for:

Mapped to Genome > 60% (usually range 50% ~ 90%)

Mapping rate to reference genome
Check reference genome version if too low

Reads Mapped Confidently to Transcriptome > 30% (usually > 60%)

Reflection of annotation to transcriptome
Check annotation if too low

Sequencing panel

The Sequencing panel highlights information on the quality of the Illumina sequencing. Top Right.

Number of reads - Total number of paired reads in library
Number of Short Reads Skipped - Reads that were filtered out for being too short
Valid Barcodes - Proportion of barcodes matching barcodes in whitelist (~ 740 thousand)
Valid UMIs - Total number of UMIs that are not all one base and contain no unknown bases
Sequencing saturation - Unique valid barcode/UMI versus all valid barcode/UMI
Q30 scores - Assessment of sequencing qualities for barcode/UMI/RNA Reads

websum ]

Sequencing panel

Key Metrics we look for:

Q30 Bases in RNA Read > 65% (usually > 80%)

Reflects the sequencing quality
Need to check with sequencing service supplier

Sequencing Saturation > 40% (usually range 20% ~ 80%)

Reflects the complexity of libraries
Consider reconstructing library if too low

Cells panel

The Cells panel highlights some of the most important information in the report: the total number of cells captured and the distribution of counts across cells and genes. Top Right. Their importance is clear as several metrics are repeated and placed as the headline of the report.

Estimated Number of Cells - Total number of barcodes associated to at least one cell.
Mean Reads per cell - Average reads in each cell
Median Genes per Cell - Median number of genes detected (at least 1 count) per cell associated barcodes

websum ]

Cells panel

The Cells panel also has other metrics which help describe the depth and ambient RNA proportion. Top Right.

Median Reads per Cell - Median number of transcriptome reads within cell associated barcodes
Fraction Reads in Cells - Fraction of reads from valid barcode, associated to a cell and mapped to transcriptome
Total Genes Detected - Number of genes with at least 1 count

websum

Cells panel

Key Metrics we look for:

Fraction Reads in Cells > 70% (usually > 85%)

Reflects the ambient RNA contamination
Consider correcting for ambient RNA if < 90%

Median reads per cell > 20,000/cell and estimated number of cells 500 - 10,000

May be caused by the failure of cell identification
Need to check knee plot and re-evaluate cell number

Knee plot

The Cell panel also includes an interactive knee plot.

The knee plot shows:

On the x-axis, the barcodes ordered by the most frequent on the left to the least frequent on the right
On the y-axis, the frequency of each ordered barcode.
Highlighted in dark blue are the barcodes marked as associated to cells.

]

Knee plot

It is apparent that barcodes labelled blue (cell-associated barcodes) do not have a cut-off just based on the UMI count.

In newer versions of Cell Ranger a two step process is used to define cell-associated barcodes based on the EmptyDrops method (Lun et al.,2019).

First high RNA containing cells are identified based on a UMI cut-off.
Second, low UMI containing cells are used as a background training set to identify additional cell-associated barcodes not called in first step.

If required, a –force-cells flag can be used with cellranger count to identify a specific number of cell-associated barcodes.

websum

Knee plot

The Knee plot also acts a good QC tools to investigate differing types of single cell failure.

Whereas our previous knee plot represented a good sample, differing knee plot patterns can be indicative of specific problems with the single cell protocol. We will show you some examples of these below from real data.

Knee plot

In this example we see no specific cliff and knee suggesting a failure in the integration of oil, beads and samples (wetting failure) or a compromised sample.

websum

Knee plot

If there is a clog in the machine we may see a knee plot where the overall number of samples is low.

websum

Knee plot

There may be occasions where we see two sets of cliff-and-knees in our knee plot.

This could be indicative of a heterogenous sample where we have two populations of cells with differing overall RNA levels.

Knee plots should be interpreted in the context of the biology under investigation.

websum

Knee plot

It is important to know what version and parameters were used to run Cell Ranger.

This cell calling step is continually updated and it can have a dramatic affect on your results. The Web Summary on 10X Genomics for this dataset is from Cell Ranger V3.0 if you want to compare.

Cell Ranger V9.0 was released relatively recently and, as with almost every prior version, there’s been a change in default parameters (EmptyDrops false discovery rate (FDR) threshold has been lowered).

Gene Expression page

The web-summary also contains an analysis page where default dimension reduction, clustering and differential expressions between clusters has been performed.

Additionally the analysis page contains information on sequencing saturation and gene per cell vs reads per cell.

analysis

t-SNE/UMAP and clustering

The t-SNE plot shows the distribution and similarity within your data.

Review for high and low UMI cells driving t-SNE structure and/or clustering.
Expected separation between and structure across clusters may be observed within the t-SNE plot.
Identify expected clusters based on expression of marker genes

analysis

Sequence and Gene saturation

The sequence saturation and Median genes per cell plots show these calculations (as show on summary page) over successive downsampling of the data.

By reviewing the curve of the down sampled metrics we can assess whether we are approaching saturation for either of these metrics.

]

igv

]

QC issues going forward?

Early issues with QC can manifest in many ways downstream. This is from a published dataset:

Next steps

Fingers crossed there’s no QC issues.

Often at this step we wouldn’t make any decisions unless there is a clear complete failure. This is an important first step in setting expectations/preparing for what you may need to do for the dataset.

Cell Ranger 9

New Cell Ranger came out recently. We are yet to test it but new features include:

Automated annotation for human data sets [in beta]
UMAPs are made instead of tSNE
New Web Summary layout
EmptyDrops false discovery rate (FDR) threshold has been lowered
cellranger mkfastq is being deprecated
New Workflow for hashtagged data

Comparison of other scRNA-seq assays

Non-droplet based assays

While the traditional 10X scRNA-seq transcriptomic assay is droplet based, other emerging methods are using non-droplet based methods to increase efficiency.

For example, ScaleBio uses sample fixation, followed by library construction in which transcripts are sequentially barcoded using a specialized plate design.

]

scalebio ]

Non-droplet based assays

With non-droplet based methods, different QC metrics have to be considered.
For example, ambient RNA correction software is droplet-based, so you wouldn’t use these methods for ScaleBio.

]

How spatial works - Visium from 10X

Fresh or frozen tissue is fixed onto a slide, and gene expression is captured from each section of the slide.

Each captured section has barcoded spots containing oligonucleotides with spatial barcodes unique to that spot.

Tissue is stained and then permeabilized, allowing released mRNA to bind to spatially barcoded oligonucleotides present on the spots.

]

visium ]

How spatial works - Visium from 10X

The outputs are filtered and raw matrices with gene expression information for each tissue-associated barcode.

Captured spots contain more than one cell, and need to be deconvoluted to predict celltype proportions in each spot.

You can use scRNA-seq data as a reference to estimate cellular proportion based on the gene expression profiles of each spot.

There is available software to do this: - BayesPrism: Deconvolution - SpaceFold: Mapping cells onto each spot

]

spatial ]

Single-cell RNA sequencing ~ Session 1

Bioinformatics Resource Center - Rockefeller University

https://rockefelleruniversity.github.io/SingleCell_Bootcamp/

brc@rockefeller.edu

Introduction to single cell sequencing

Overview

Sequencing

Sequencing overview

Sequence files - FastQ

Sequence files - FastQ

Sequence files - FastQ

Sequencing use cases

Bulk sequencing methods

FACS Cell and Nuclei sorting

Single-cell sequencing

Single-cell use cases

Single-cell platforms

Major Single-cell platforms

The 10x Genomics system

The 10x Genomics system - Microfluidics

The 10x Genomics system - Gel beads

The 10x Genomics system - Sequencing primers

The 10x Genomics system - Sequencing R1

The 10x Genomics system - Sequencing R2

The 10x sequence reads

From Sequences to Processed data

Processing scRNA-seq/snRNA-seq data.

Droplets to counts

HDF5 format

MEX format

Cell Ranger

Cell Ranger

Cell Ranger

Do I need to run this?

Cell Ranger Download

Cell Ranger

Cell Ranger set-up

Running Cell Ranger count

Working with custom genomes

Creating your own reference

FASTA file (Genome sequence)

GTF (Gene Transfer Format)

Using Cell Ranger mkgtf

Using Cell Ranger mkref

Output files

Outputs

Count Matrices

BAM files

Cloupe files

Web Summary QC

QC outputs

QC is essential

Metrics and Web Summary

Metrics and Web Summary

Web Summary Example

Web Summary overview

Run Summary

Command Line Arguments

Mapping panel

Mapping panel

Sequencing panel

Sequencing panel

Cells panel

Cells panel

Cells panel

Knee plot

Knee plot

Knee plot

Knee plot

Knee plot

Knee plot

Knee plot

Gene Expression page

t-SNE/UMAP and clustering

Sequence and Gene saturation

QC issues going forward?

Next steps

Cell Ranger 9

Comparison of other scRNA-seq assays

Non-droplet based assays