Full image here of workflow is here: overview
scRNA seq is a very variable process and each dataset has unique QC problems. Though our simplified approach often works, we also need many more tools in our toolbox to handle more complex datasets. Often it is a case of trial and error to see what approaches work best for your data.
We are going to try out a few tools and approaches. We want to wrap some of our analysis steps into a function to simplify rerunning things.
Normalization: * Log normalization with scale factor = 10,000 * Find Variable features with vst, select top 2000 variable features
Make clusters: * Scale data with ScaleData() * Principle Component Analysis by using RunPCA() with npcs=30 PCs * Make non-linear dimensional reduction in UMAP by using RunUMAP() with dims=1:10 * Estimate Neighbors by using FindNeighbors() with dims=1:10 * Identify clusters with FindClusters() by using resolution=0.5
quick_clust <- function(seu) {
set.seed(42)
seu <- ScaleData(seu, verbose = FALSE)
seu <- RunPCA(seu, npcs = 30, verbose = FALSE)
seu <- RunUMAP(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
seu <- FindNeighbors(seu, reduction = "pca", dims = 1:10, verbose = FALSE)
seu <- FindClusters(seu, resolution = 0.5, verbose = FALSE)
return(seu)
}
We will discuss a number of approaches and methods we may need to use to merge datasets.
We will be using a dataset containing 4 snRNAseq samples. This is from a preprint studying Alzheimers disease. Though this dataset has 10 samples, we will be using 4 for our example. There are 2 AD samples and 2 Control samples from the entorhinal cortex taht have been collected post-mortem.
You could access the raw data from GE0: GSE287652. But we have it here already loaded into a Seurat object and after some preprocessing.
Each dataset has been individually processed, QC’d and reviewed prior to combining into a list. Sotring everything in a list will make our life easier later.
Load into Seurat object.
Log Normalized and scaled.
Filtered based on MT levels (2.5%).
Filtered based on doublets (Scrublet).
Regressed to MT.
PCA.
Clusters.
UMAP.
We can use merge() function to integrate multiple data sets, as they are. We need to provide a single seurat object as our first argument, and a list containing the rest of our Seurat objects as the second. We also need to provide Group information and a name for the project as arguments.
seu_merge <- merge(my_seu_list[[1]], my_seu_list[2:4], add.cell.ids = c("AD2b", "AD4",
"C1", "C3"), project = "Merge")
head(seu_merge, 4)
## orig.ident nCount_RNA nFeature_RNA percent.mt
## AD2b_AAACCCACAGCTGAAG-1_C1 SeuratProject 763 390 0.39318480
## AD2b_AAACGAAAGGTCGTAG-1_C1 SeuratProject 1030 501 0.00000000
## AD2b_AAACGAAGTCTTGAGT-1_C1 SeuratProject 3556 2272 0.61797753
## AD2b_AAACGAATCACCTCGT-1_C1 SeuratProject 3091 1863 0.03234153
## scrubletScore sample_id paper_cluster group
## AD2b_AAACCCACAGCTGAAG-1_C1 0.03906250 C1 0 C
## AD2b_AAACGAAAGGTCGTAG-1_C1 0.08771930 C1 0 C
## AD2b_AAACGAAGTCTTGAGT-1_C1 0.06688963 C1 12 C
## AD2b_AAACGAATCACCTCGT-1_C1 0.04857143 C1 29 C
## paper_annot
## AD2b_AAACCCACAGCTGAAG-1_C1 Excitatory Neurons
## AD2b_AAACGAAAGGTCGTAG-1_C1 Excitatory Neurons
## AD2b_AAACGAAGTCTTGAGT-1_C1 Inhibitory Neurons
## AD2b_AAACGAATCACCTCGT-1_C1 Inhibitory Neurons
Now that our data is merged we need to reprocess it. This ensures nromalization and scaling is done in a globally context. We can use our processing and clustering functions to analyse our merged dataset.
Our UMAP shows our cells are quite similar. Bu there are a few regions that are distinct depending on the condition.
Reciprocal PCA minimizes the batch effects while merging different data sets.
The steps involved in rPCA:
First we need to identify features for integration. We do this on our list of Seurat objects. This is similar to the VariableFeatures function we ran on a single dataset, but works across all 4 datasets. We will then run scaling and PCA, using these features.
We can then identify anchors. These are the features through which we will integrate our data. Once we have these features, we can then integrate our data sets together.
anchors <- FindIntegrationAnchors(my_seu_list_rpca, anchor.features = feats, reduction = "rpca")
my_seu_merge_rpca <- IntegrateData(anchorset = anchors)
my_seu_merge_rpca
## An object of class Seurat
## 28110 features across 10018 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 1 layer present: data
## 1 other assay present: RNA
To evaluate how well the merge has worked we must check the clustering. Again we must scale, and then use our quick_clust function.
To assess the integration we can use similar metrics to assessing the QC of the datasets in general.
We also check the overlap of our datasets. Generally we expect most cells between samples to overlap, butt this can be very experiment dependent.
We can see that our data sets overlay with each other. It looks slightly better than before. The dataset looked pretty good before, but now the overlap is tighter.
Our UMAP shows our cell clusters are fairly distinct, though not perfect (and the clustering itself could do with some optimization).
We can check the numbers in each cluster. Broadly, there are similar numbers per cluster now. To do this we make a heatmap and then scale it along the row to accunt for the different number of cells in each sample.
library(pheatmap)
tbl <- table(my_seu_merge_rpca$sample_id, my_seu_merge_rpca$seurat_clusters)
pheatmap(tbl, scale = "row")
A given cell type should often be clustered together. This means marker genes should be specific. This oligodendrocyte marker has quite specific distribution.
This dataset also has some annotation in the paper_annot slot. We can check the distribution of the labels. Our UMAP shows our cell types are distinct.
Harmony works in a different way to rPCA.
Fuzzy clustering - It will assign clusters, but allow cells to belong to multiple clusters, and caluclate the strength of assingmetn to each cluster.
During clustering there is a penalty term, to maintain diversity of groups i.e. a cluster with just one sample is penalized. This means any structure due to a batch effect is not hard.
The centroid of each cluster is calculated for all cells and each individual sample, then a correction factor for each sample based on these centroids.
Cells are then moved based on the dataset correction factor, weighted by their individual assignment scores for each clsuter
Repeat iteratviely until convergence.
We can prepare for Harmony in much the same way as we prepare for the simple Seurat merge: merge, normalize, scale, PCA and UMAP.
seu_merge <- merge(my_seu_list[[1]], my_seu_list[2:4], add.cell.ids = c("C1", "C3",
"AD2b", "AD4"), project = "Merge")
seu_merge <- data_proc(seu_merge)
seu_merge <- ScaleData(seu_merge)
seu_merge <- RunPCA(seu_merge)
seu_merge <- RunUMAP(seu_merge, reduction = "pca", dims = 1:10, reduction.name = "umap")
As you can see we are back with our completely separate groups.
We can use the RunHarmony() function to implement the Harmony correction.
library(harmony)
seu_merge_harmony <- RunHarmony(seu_merge, group.by.vars = "sample_id", assay.use = "RNA")
seu_merge_harmony <- RunUMAP(seu_merge_harmony, reduction = "harmony", dims = 1:10,
reduction.name = "umap_harmony")
seu_merge_harmony <- FindNeighbors(seu_merge_harmony, reduction = "harmony")
seu_merge_harmony <- FindClusters(seu_merge_harmony)
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
##
## Number of nodes: 10018
## Number of edges: 338251
##
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9148
## Number of communities: 21
## Elapsed time: 1 seconds
We can see that our data sets overlay with each other. Again, it looks slightly better than before.
Our UMAP shows our cell clusters are fairly distinct, though not perfect (and the clustering itself could do with some optimization).
We can check the numbers in each cluster. Broadly, there are similar numbers per cluster now. To do this we make a heatmap and then scale it along the row to account for the different number of cells in each sample.
library(pheatmap)
tbl <- table(seu_merge_harmony$sample_id, seu_merge_harmony$seurat_clusters)
pheatmap(tbl, scale = "row")
A given cell type should often be clustered together. This means marker genes should be specific. This oligodendrocyte marker has quite specific distribution.
This dataset also has some annotation in the paper_annot slot. We can check the distribution of the labels. Our UMAP shows our cell types are distinct.
Using heatmaps we can also check how specific each cluster is to each cell type.
tbl <- table(seu_merge_harmony$seurat_clusters, seu_merge_harmony$paper_annot)
pheatmap(tbl, scale = "row")
Broadly it seems like rPCA may be the best option in this case. Why?:
The integration in general is subtle and rPCA is considered more conservative
Sample UMAP looks cleanest
The number of cells per cluster seems more equal
The annotation seems cleaner
Harmony
Works at the cluster level
Iterative nature pushes to convergence so can be a heavy handed
Allows for more complex model terms for batch correction
Fast and scalable
rPCA
Though Harmony and rPCA are our main workhorses for integration. There are many other alternatives too.
For more systematic comparison check out this paper.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 14143263 755.4 24674155 1317.8 24674155 1317.8
## Vcells 775953080 5920.1 1739313404 13270.0 1739313404 13270.0
As mentioned before this dataset has been manually annotated.
This was done by careful assessment of the marker genes. Here we are look at Oligodendrocyte markers.
To annotate the Single-cell data sets, we can evaluate the gene expression pattern of well known cell-type specific marker genes and make a manual annotation. This is time consuming and not systematic.
Here, we will introduce two more strategies to make cell type annotations automatically: 1. Mapping and annotating query datasets with Seurat using a reference data set. link 2. Make annotation with SingleR link
For this we need a reference dataset and a query dataset. Our annotation ends up being only as good as our reference dataset. There are many sources for reference data.
Depending on the format of the data, you may be set to go straightaway, need to do some light processing, or reanalyze the whole thing.
SingleR is Bioconductor package which can be used to predict annotations of a single-cell dataset. It is maybe the most flexible way of annotating your data, as it will accept a variety of kinds of reference data including bulk and scRNAseq experiments. SingleR works with a very simple method: calculate a spearman rank correlation between reference and test dataset for each label.
Despite this simplicity, there is also scope to do more complex annotation with some advanced features i.e. to improve resolution of related labels or using multiple references. You can dig into this further in the singleR book.
Lets start out by using the Human Primary Cell Atlas. This is a collection of microarray datasets from human primary cells that have been aggregated together. To access this data we will use the celldex package.
In this case the data is stored as a SummarizedExpriment; a specialized Bioconductor format designed for holding data matrices and their metadata. The reference data for SingleR can either be in this format or as a SingleCellExpriment (another Bioconductor format specific to single cell analysis).
## class: SummarizedExperiment
## dim: 19363 713
## metadata(0):
## assays(1): logcounts
## rownames(19363): A1BG A1BG-AS1 ... ZZEF1 ZZZ3
## rowData names(0):
## colnames(713): GSM112490 GSM112491 ... GSM92233 GSM92234
## colData names(3): label.main label.fine label.ont
At the moment our test data is a Seurat object which is not Bioconductor friendly.
## An object of class Seurat
## 28110 features across 10018 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 2 layers present: data, scale.data
## 1 other assay present: RNA
## 2 dimensional reductions calculated: pca, umap
We can simply extract out a data matrix containing count information. This is good for SingleR.
## 5 x 5 sparse Matrix of class "dgCMatrix"
## AAACCCACAGCTGAAG-1_C1 AAACGAAAGGTCGTAG-1_C1 AAACGAAGTCTTGAGT-1_C1
## ADAMTSL1 0.39228540 0.37254012 1.11566489
## THSD7B 0.02286489 0.05190584 -0.01465315
## IL1RAPL2 -0.01904459 -0.08152581 0.12641628
## RELN -0.45800974 -0.30815332 -0.64158447
## AC093607.1 0.06373820 0.06256471 0.01170564
## AAACGAATCACCTCGT-1_C1 AAACGAATCTCCGAAA-1_C1
## ADAMTSL1 -0.063393072 -0.05680296
## THSD7B 0.092400788 -0.02638186
## IL1RAPL2 -0.172235822 -0.00585319
## RELN 3.338650923 0.15192283
## AC093607.1 0.007405319 -0.03035472
We are almost ready to run SingleR. We have our test and reference data. The last part we need is labels. This can just be a vector containing all the labels. We can grab this easily from our reference dataset. We can see there are several options here. Let’s just stick to label.main.
## DataFrame with 713 rows and 3 columns
## label.main label.fine label.ont
## <character> <character> <character>
## GSM112490 DC DC:monocyte-derived:.. CL:0000840
## GSM112491 DC DC:monocyte-derived:.. CL:0000840
## GSM112540 DC DC:monocyte-derived:.. CL:0000840
## GSM112541 DC DC:monocyte-derived:.. CL:0000451
## GSM112661 DC DC:monocyte-derived:.. CL:0000451
## ... ... ... ...
## GSM556665 Monocyte Monocyte:S._typhimur.. CL:0000576
## GSM92231 Neurons Neurons:Schwann_cell CL:0002573
## GSM92232 Neurons Neurons:Schwann_cell CL:0002573
## GSM92233 Neurons Neurons:Schwann_cell CL:0002573
## GSM92234 Neurons Neurons:Schwann_cell CL:0002573
library(SingleR)
pred_res <- SingleR(ref = hpcad, test = my_seu_merge_rpca_mat, labels = hpcad$label.main)
The score is generated comparing the expression levels of a cell in query dataset and the expression pattern of certain group (eg. cell types) in reference dataset. A cell would be assigned as the cell type which has highest score
## DataFrame with 2 rows and 4 columns
## scores labels
## <matrix> <character>
## AAACCCACAGCTGAAG-1_C1 0.0699139:0.0443145: 0.00782103:... Neurons
## AAACGAAAGGTCGTAG-1_C1 0.0605863:0.0201250:-0.00486433:... MSC
## delta.next pruned.labels
## <numeric> <character>
## AAACCCACAGCTGAAG-1_C1 0.01087697 Neurons
## AAACGAAAGGTCGTAG-1_C1 0.00253192 MSC
By converting to a matrix, we can check the cell type scoring using a heatmap.
mat <- as.matrix(pred_res$scores)
rownames(mat) <- rownames(pred_res)
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE, fontsize_col = 5)
We can now add our labels back to our original Seurat object by a quick assignment. This then allows us to start reviewing the annotation in the context of UMAPs and also versus our other annotation.
my_seu_merge_rpca$hpcad_singleR_labels <- pred_res$labels
summ_table <- table(my_seu_merge_rpca$hpcad_singleR_labels, my_seu_merge_rpca$paper_annot)
pheatmap(summ_table, scale = "column", fontsize_row = 5)
Lets try an alternative dataset. Often we will have data from other Seurat objects that we want to use as a reference. Here we have a processed version of human data from the Allen Brain Map. This is 10X data from 2020.
There are several interesting labels associated with this data. Lets focus on the class_label.
## orig.ident nCount_RNA nFeature_RNA
## AAACCCAAGGATTTCC-LKTX_190129_01_A01 SeuratProject 14396 4815
## AAACCCAAGTATGGCG-LKTX_190129_01_A01 SeuratProject 12027 4372
## AAACCCACAAAGTGTA-LKTX_190129_01_A01 SeuratProject 16112 5280
## AAACCCACACTACTTT-LKTX_190129_01_A01 SeuratProject 2994 1649
## AAACCCACAGTGAGCA-LKTX_190129_01_A01 SeuratProject 5202 2499
## AAACCCAGTCACCCTT-LKTX_190129_01_A01 SeuratProject 26504 6381
## AAACCCAGTGTCCACG-LKTX_190129_01_A01 SeuratProject 21668 5470
## AAACCCATCATAAGGA-LKTX_190129_01_A01 SeuratProject 30141 7131
## AAACCCATCTGTCCCA-LKTX_190129_01_A01 SeuratProject 31944 6864
## AAACGAAGTATGAAGT-LKTX_190129_01_A01 SeuratProject 32476 7328
## percent.mt class_label
## AAACCCAAGGATTTCC-LKTX_190129_01_A01 0 GABAergic
## AAACCCAAGTATGGCG-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCACAAAGTGTA-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCACACTACTTT-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCACAGTGAGCA-LKTX_190129_01_A01 0 Non-Neuronal
## AAACCCAGTCACCCTT-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCAGTGTCCACG-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCATCATAAGGA-LKTX_190129_01_A01 0 Glutamatergic
## AAACCCATCTGTCCCA-LKTX_190129_01_A01 0 Glutamatergic
## AAACGAAGTATGAAGT-LKTX_190129_01_A01 0 Glutamatergic
## cluster_label subclass_label
## AAACCCAAGGATTTCC-LKTX_190129_01_A01 Inh L1-2 SST CCNJL Sst
## AAACCCAAGTATGGCG-LKTX_190129_01_A01 Exc L5-6 FEZF2 IFNG-AS1 L5/6 NP
## AAACCCACAAAGTGTA-LKTX_190129_01_A01 Exc L3-5 RORB LINC01202 L5 IT
## AAACCCACACTACTTT-LKTX_190129_01_A01 Exc L2 LINC00507 GLRA3 L2/3 IT
## AAACCCACAGTGAGCA-LKTX_190129_01_A01 Oligo L2-6 OPALIN FTH1P3 Oligo
## AAACCCAGTCACCCTT-LKTX_190129_01_A01 Exc L5-6 FEZF2 C9orf135-AS1 L6 CT
## AAACCCAGTGTCCACG-LKTX_190129_01_A01 Exc L3-5 FEZF2 ASGR2 L5 ET
## AAACCCATCATAAGGA-LKTX_190129_01_A01 Exc L3-5 RORB LNX2 L5 IT
## AAACCCATCTGTCCCA-LKTX_190129_01_A01 Exc L3-5 RORB LNX2 L5 IT
## AAACGAAGTATGAAGT-LKTX_190129_01_A01 Exc L5 THEMIS RGPD6 L6 CT
## cell_type_alias_label
## AAACCCAAGGATTTCC-LKTX_190129_01_A01 Inh L1-2 SST CCNJL
## AAACCCAAGTATGGCG-LKTX_190129_01_A01 Exc L5-6 FEZF2 IFNG-AS1
## AAACCCACAAAGTGTA-LKTX_190129_01_A01 Exc L3-5 RORB LINC01202
## AAACCCACACTACTTT-LKTX_190129_01_A01 Exc L2 LINC00507 GLRA3
## AAACCCACAGTGAGCA-LKTX_190129_01_A01 Oligo L2-6 OPALIN FTH1P3
## AAACCCAGTCACCCTT-LKTX_190129_01_A01 Exc L5-6 FEZF2 C9orf135-AS1
## AAACCCAGTGTCCACG-LKTX_190129_01_A01 Exc L3-5 FEZF2 ASGR2
## AAACCCATCATAAGGA-LKTX_190129_01_A01 Exc L3-5 RORB LNX2
## AAACCCATCTGTCCCA-LKTX_190129_01_A01 Exc L3-5 RORB LNX2
## AAACGAAGTATGAAGT-LKTX_190129_01_A01 Exc L5 THEMIS RGPD6
As our reference data is in a Seurat format we can just extract out the data matrix of counts. We also already have our matrix from our test data.
pred_res2 <- SingleR(ref = GetAssayData(abm), test = my_seu_merge_rpca_mat, labels = abm$class_label)
## DataFrame with 2 rows and 4 columns
## scores labels delta.next
## <matrix> <character> <numeric>
## AAACCCACAGCTGAAG-1_C1 0.138210:0.142565:0.0505849 Glutamatergic 0.0050974
## AAACGAAAGGTCGTAG-1_C1 0.218222:0.253712:0.0876373 Glutamatergic 0.0347486
## pruned.labels
## <character>
## AAACCCACAGCTGAAG-1_C1 Glutamatergic
## AAACGAAAGGTCGTAG-1_C1 Glutamatergic
By converting to a matrix, we can check the cell type scoring using a heatmap.
mat <- as.matrix(pred_res2$scores)
rownames(mat) <- rownames(pred_res2)
pheatmap::pheatmap(mat, scale = "row", show_rownames = FALSE, fontsize_col = 5)
We can now add our labels back to our original Seurat object by a quick assignment. This then allows us to start reviewing the annotation in the context of UMAPs and also versus our other annotation.
summ_table <- table(my_seu_merge_rpca$abm_singleR_labels, my_seu_merge_rpca$paper_annot)
pheatmap(summ_table, scale = "column", fontsize_row = 5)
Seurat uses a similar approach to integration to be able to find anchors between datasets. It can then use this to transfer label information from a reference to a test dataset.
A restriction of this approach is both datasets have to be Seurat objects and therefore single cell datasets. We will use the Allen Brain Map again for this analysis.
As with integration, we find integration features in common, then scale and run PCA in the context of these specific features.
transfer_list <- list(ref = abm, query = my_seu_merge_rpca)
feats <- SelectIntegrationFeatures(transfer_list)
transfer_list <- lapply(transfer_list, function(seu, feats) {
seu <- ScaleData(seu, features = feats, verbose = FALSE)
seu <- RunPCA(seu, features = feats, verbose = FALSE)
return(seu)
}, feats)
Identify the anchors between reference and query data sets, using FindTransferAnchors(). These are essential to transfer information from our reference to our query. We can then transfer the cell type information. For each cell in query dataset, the score for each given cell type was estimated by the gene expression pattern of anchor genes using the TransferData() function.
anchors <- FindTransferAnchors(reference = transfer_list$ref, query = transfer_list$query,
dims = 1:30, reference.reduction = "pca")
pred_res3 <- TransferData(anchorset = anchors, refdata = transfer_list$ref$class_label)
## predicted.id prediction.score.GABAergic
## AAACCCACAGCTGAAG-1_C1 Non-Neuronal 0
## AAACGAAAGGTCGTAG-1_C1 Non-Neuronal 0
## prediction.score.Glutamatergic
## AAACCCACAGCTGAAG-1_C1 0.3253133
## AAACGAAAGGTCGTAG-1_C1 0.1247228
## prediction.score.Non.Neuronal prediction.score.max
## AAACCCACAGCTGAAG-1_C1 0.6746867 0.6746867
## AAACGAAAGGTCGTAG-1_C1 0.8752772 0.8752772
The cell type with highest score was assigned to the given cell. We can visualize this score with a heatmap.
mat <- as.matrix(pred_res3[, -c(1, 5)])
colnames(mat) <- gsub("prediction.score.", "", colnames(mat))
pheatmap(mat, scale = "row", show_rownames = FALSE)
We can now add our labels back to our original Seurat object by a quick assignment. This then allows us to start reviewing the annotation in the context of UMAPs and also versus our other annotation.
summ_table <- table(my_seu_merge_rpca$abm_seurat_labels, my_seu_merge_rpca$paper_annot)
pheatmap(summ_table, scale = "column", fontsize_row = 5)
Next we can compare our two annotations:
##
## FALSE TRUE
## 3437 6581
tbl <- table(my_seu_merge_rpca$abm_seurat_labels, my_seu_merge_rpca$abm_singleR_labels)
pheatmap::pheatmap(tbl, scale = "row")
The results are similar but there is a clear difference in the identification of the OPC group. These are Oligodendrocyte precursor cells, and are considered non-neural.
In this case it seems that Seurat has done a better job. But maybe with a different reference, or with the more specific group labels SingleR may perform better.
## Seurat vs SingleR annotation
This isn’t always the case. As a general rule we find:
There have also been some interesting efforts to use LLMs or Deep Learning to annotate cells, either with specialized models or generic models like ChatGPT.
We have not exhaustively tested these, but the consensus is that they provide a good quick estimate, especially when you do not have a reliable reference dataset.
Often domain specific knowledge and good reference datasets will outperform the more general LLM approaches.
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 14133235 754.8 24674155 1317.8 24674155 1317.8
## Vcells 626978247 4783.5 1739313404 13270.0 1739313404 13270.0
We will use the rPCA integrated object from earlier lecture, and you can read it in from dropbox. It is the object called integrated.rds.
## An object of class Seurat
## 28110 features across 10018 samples within 2 assays
## Active assay: integrated (2000 features, 2000 variable features)
## 2 layers present: data, scale.data
## 1 other assay present: RNA
## 2 dimensional reductions calculated: pca, umap
Compared to bulk analysis, single cell RNAseq data presents several opportunities and challenges when doing differential gene expression.
As we’ve explored so far, we can identify different cell types and then find markers for these cell types within a population of cells. We did this with the Seurat FindAllMarkers function.
In this section we will compare conditions within important subsets of cells, such as interesting cell types. We will look at gene expression differences between AD and control cells.
First, we add metadata columns that will eventually allow us to set up these comparisons.
# metadata column for condition
seu_obj$condition <- ifelse(grepl("AD", seu_obj$sample_id), "AD", "CON")
# metadata column for condition + cell type
seu_obj$celltype_condition <- paste(seu_obj$paper_annot, seu_obj$condition, sep = "_")
table(seu_obj$celltype_condition)
##
## Astrocytes_AD Astrocytes_CON Endothelial_AD
## 674 353 70
## Endothelial_CON Excitatory Neurons_AD Excitatory Neurons_CON
## 51 2338 1295
## Inhibitory Neurons_AD Inhibitory Neurons_CON Microglia_AD
## 423 461 391
## Microglia_CON OPCs_AD OPCs_CON
## 180 257 206
## Oligodendrocytes_AD Oligodendrocytes_CON Pericytes_AD
## 1974 1107 137
## Pericytes_CON
## 101
While the integrated data is critical for visualization and clustering samples together, for differential analyses we will use the counts from each individual object.
Before we start our analysis, we then need to change the active assay to ‘RNA’ (instead of ‘integrated’)
[1] “integrated”
[1] “RNA”
As we learned, a Seurat object (version 5 or later) contains separate layers for each sample within an integrated object.
[1] “counts.1” “counts.2” “counts.3” “counts.4” “data.1”
[6] “scale.data.1” “data.2” “scale.data.2” “data.3” “scale.data.3” [11]
“data.4” “scale.data.4”
Before doing differential gene expression, we first must join the layers of the Seurat object for get one counts matrix for all samples.
[1] “scale.data” “data” “counts”
Expression data at the single cell level adds another layer of complexity and how we handle this can have huge impacts on identifying differentially expressed genes.
Expression data at the single cell level adds another layer of complexity and how we handle this can have huge impacts on identifying differentially expressed genes. * How to handle a gene by cell matrix as opposed to gene by replicate?
Expression data at the single cell level adds another layer of complexity and how we handle this can have huge impacts on identifying differentially expressed genes. * How to handle a gene by cell matrix as opposed to gene by replicate? * How to handle the high dropout rate in single cell RNAseq?
counts <- GetAssayData(seu_obj, assay = "RNA", layer = "counts")
counts_mat <- as.matrix(counts)
actin_counts <- counts_mat[rownames(counts_mat) == "ACTB"]
hist(actin_counts, breaks = 50)
This is a non-parametric test that is used by default in the Seurat FindMarkers function. It is very fast and generally works well when finding markers of different and distinct cell types that have been identified through clustering.
It can also be used to find differences between conditions, though this is likely not the best choice for this comparison.
library(tibble)
library(dplyr)
Idents(seu_obj) <- "celltype_condition"
de_wilcox <- FindMarkers(seu_obj, ident.1 = "Excitatory Neurons_AD", ident.2 = "Excitatory Neurons_CON",
logfc.threshold = 0) %>%
tibble::rownames_to_column("geneID")
head(de_wilcox, 3)
## geneID p_val avg_log2FC pct.1 pct.2 p_val_adj
## 1 XIST 2.951723e-276 -3.027978 0.132 0.676 7.706949e-272
## 2 ADARB2 1.653428e-202 -1.727079 0.239 0.720 4.317100e-198
## 3 CDH13 6.851910e-138 -1.357173 0.201 0.605 1.789034e-133
Parametric methods have been developed that attempt to address the added complexities of single cell RNAseq data.
For more subtle changes within your data, such as comparing conditions within a cluster or cell type, these more advanced methods often work better.
One commonly used example is MAST (Model-based Analysis of Single-cell Transcriptomics), which uses a two part generalized linear model to account for the zero-inflated, bi-modal nature of single cell RNA seq.
MAST models both the rate of a gene being detected (i.e. expression not zero) and the continuous level of expression (conditional on expression > 0)
counts_seu <- GetAssayData(seu_obj, assay = "RNA", layer = "data")
# SNX31 was identified as differentially expressed in excitatory neurons
seu_obj$SNX31 <- counts_seu[rownames(counts_seu) == "SNX31", ]
exn_toPlot <- seu_obj@meta.data[seu_obj@meta.data$paper_annot == "Excitatory Neurons",
]
ggplot(exn_toPlot, aes(x = SNX31, fill = condition)) + geom_density() + ggtitle("SNX31 expression in AD vs CON in Excitatory Neruons") +
theme_classic() + xlim(0, 2)
Covariate or batch terms can be accounted for in the model formula * For example, the fraction of genes expressed in a cell, or the cellular detection rate (CDR), is a suggested covariate + CDR is thought to measure unwanted technical and biological factors (e.g. batch, cell size)
MAST requires a SingleCellAssay object with counts on a log scale. Importantly, zero counts should remain zero in any normalized data.
Here we are going to compare AD to control in excitatory neurons. These neurons are extracted from the Seurat object, and then we build the SingleCellAssay object.
library(MAST)
seu_exn <- subset(seu_obj, paper_annot == "Excitatory Neurons")
seu_exn_data <- GetAssayData(seu_exn, assay = "RNA", layer = "data")
# create SingleCellAssay for MAST
sca_exn <- MAST::FromMatrix(exprsArray = as.matrix(seu_exn_data), cData = seu_exn@meta.data,
fData = data.frame(gene_id = rownames(seu_exn)))
sca_exn
## class: SingleCellAssay
## dim: 26110 3633
## metadata(0):
## assays(1): et
## rownames(26110): AL627309.1 AL627309.5 ... AC022486.1 AC004556.3
## rowData names(2): gene_id primerid
## colnames(3633): AAACCCACAGCTGAAG-1_C1 AAACGAAAGGTCGTAG-1_C1 ...
## TTTGTTGGTGCGGATA-1_AD4 TTTGTTGGTTCGGCGT-1_AD4
## colData names(15): orig.ident nCount_RNA ... SNX31 wellKey
## reducedDimNames(0):
## mainExpName: NULL
## altExpNames(0):
The CDR (the fraction of genes that are detected in each cell) is calculated for each cell, then scaled (z-score).
## [,1]
## AAACCCACAGCTGAAG-1_C1 -0.9377860
## AAACGAAAGGTCGTAG-1_C1 -0.9048629
## AAACGCTGTGTCCACG-1_C1 -0.8526606
## AAAGAACAGGTCACCC-1_C1 -0.7360953
## AAAGAACCAGAAGTTA-1_C1 -0.9271082
## AAAGGATGTATTGGCT-1_C1 2.0736346
Add the CDR to the metadata and set up the grouping factor we will use for the differential analysis.
colData(sca_exn)$ngeneson <- cdr_scale
cond <- factor(colData(sca_exn)$condition, levels = c("AD", "CON"))
cond <- relevel(cond, "CON")
colData(sca_exn)$Group <- cond
colData(sca_exn)[1:3, c("paper_annot", "ngeneson", "Group")]
## DataFrame with 3 rows and 3 columns
## paper_annot ngeneson Group
## <character> <matrix> <factor>
## AAACCCACAGCTGAAG-1_C1 Excitatory Neurons -0.937786 CON
## AAACGAAAGGTCGTAG-1_C1 Excitatory Neurons -0.904863 CON
## AAACGCTGTGTCCACG-1_C1 Excitatory Neurons -0.852661 CON
To understand how CDR is contributing to the variance of our data, we can look at correlations between CDR and principal components.
We obtain the embeddings from the PCA calculation previously done on the Seurat object.
pca_embed <- data.frame(Embeddings(seu_exn, reduction = "pca"))
pca_vars <- Stdev(seu_exn, reduction = "pca")^2
total_var <- sum(pca_vars) # total variance
pct_var_explained <- pca_vars/total_var * 100
pct_var_explained <- round(pct_var_explained, 1)
for_plot <- data.frame(colData(sca_exn))
for_plot <- merge(for_plot, pca_embed, by = 0)
CDR seems to correlate with PC1 from the data.
library(ggpubr)
p1 <- ggplot(for_plot, aes(x = ngeneson, y = PC_1, color = condition)) + geom_point() +
ylab(paste0("PC_1 (", pct_var_explained[1], "%)"))
p2 <- ggplot(for_plot, aes(x = ngeneson, y = PC_2, color = condition)) + geom_point() +
ylab(paste0("PC_2 (", pct_var_explained[2], "%)"))
ggarrange(p1, p2, nrow = 1, common.legend = TRUE, legend = "bottom")
The CDR is included in the model and will be accounted for in determining differentially expressed genes.
The zlm function can take a long time when you have a high number of cells. It returns a data table with information about the test.
Note: This can tke a long time to run, the result can be loaded on next slide
MAST can take some time to run, so to move forward download the following file from dropbox: sumDT_exn.rds
The table from the zlm function contains p-values for both the continuous gene expression (component = ‘C’), gene detection rate (component = ‘D’), and the combined p-value for the hurdle method (component = ‘H’).
The log2FC is also in this table in the ‘coef’ column.
## primerid component contrast Pr(>Chisq) coef
## <char> <char> <fctr> <num> <num>
## 1: A1BG C GroupAD 0.1082826 -0.063046860
## 2: A1BG C (Intercept) NA 1.124246862
## 3: A1BG C ngeneson NA -0.572281229
## 4: A1BG D GroupAD 0.5100539 0.100513994
## 5: A1BG D (Intercept) NA -3.662571467
## 6: A1BG D ngeneson NA 1.342774638
## 7: A1BG H GroupAD 0.2216773 NA
## 8: A1BG S GroupAD NA NA
## 9: A1BG S (Intercept) NA NA
## 10: A1BG S ngeneson NA NA
## 11: A1BG logFC GroupAD NA 0.001152782
## 12: A1BG logFC ngeneson NA 0.021265736
A data frame with the fold changes and combined hurdle p-values is constructed from this MAST output.
de_mast_exn <- merge(sumDT_exn[sumDT_exn$component == "H" & sumDT_exn$contrast ==
paste0("Group", "AD"), ], sumDT_exn[sumDT_exn$component == "logFC" & sumDT_exn$contrast ==
paste0("Group", "AD"), ], by = "primerid")
de_mast_exn <- dplyr::select(de_mast_exn, primerid, coef.y, z.y, `Pr(>Chisq).x`)
names(de_mast_exn) <- c("geneID", "log2Fc", "z", "pvalue")
de_mast_exn$FDR <- p.adjust(de_mast_exn$pvalue, "fdr")
de_mast_exn <- de_mast_exn[order(de_mast_exn$FDR), ]
head(de_mast_exn, 3)
## geneID log2Fc z pvalue FDR
## <char> <num> <num> <num> <num>
## 1: XIST -1.415323 -35.84507 4.653001e-254 1.214899e-249
## 2: PLCG2 1.114275 33.01194 4.636187e-233 6.052542e-229
## 3: ADARB2 -1.142667 -30.08747 6.272292e-183 5.458985e-179
de_mast_exn$sig <- de_mast_exn$FDR < 0.05
ggplot(de_mast_exn, aes(x = log2Fc, y = -log10(pvalue), color = sig)) + geom_point() +
scale_color_manual(values = c("grey", "red")) + theme_bw()
Look at the genes that go down in AD (up in control) in a violin plot
de_mast_exn_df <- data.frame(de_mast_exn)
de_mast_exn_df <- na.omit(de_mast_exn_df)
top_down_AD <- head(de_mast_exn_df[de_mast_exn_df$log2Fc < 0, ], 5)
VlnPlot(seu_exn, features = top_down_AD$geneID, stack = T, flip = T, pt.size = 1) +
scale_x_discrete(labels = c("CON", "AD")) + NoLegend()
What about genes that barely pass the differential cutoff?
de_mast_exn_sig <- de_mast_exn_df[de_mast_exn_df$FDR < 0.05, ]
sorta_down_AD <- tail(de_mast_exn_sig[de_mast_exn_sig$log2Fc < 0, ], 5)
sorta_down_AD
## geneID log2Fc z pvalue FDR sig
## 15212 AC105230.1 -0.012419823 -2.377461 0.02896481 0.04971543 TRUE
## 15213 NRIP3 -0.012876226 -1.394409 0.02898560 0.04974785 TRUE
## 15214 CALCR -0.001228994 -1.138183 0.02899530 0.04976123 TRUE
## 15217 LINC00629 -0.003408750 -1.572098 0.02905931 0.04986125 TRUE
## 15223 AL096803.4 -0.007086156 -1.987236 0.02913228 0.04996675 TRUE
What about genes that barely pass the differential cutoff?
VlnPlot(seu_exn, features = sorta_down_AD$geneID, stack = T, flip = T, pt.size = 1) +
scale_x_discrete(labels = c("CON", "AD")) + NoLegend()
We notice that the top DE gene from our analysis is XIST, which has much higher expression in females compared to males.
This suggests that sex could be a confounding factor and is potentially a red flag that we should explore. XIST does seem to have localized expression to the group of cells that are only in one of the controls.
We can see from the sample info taken from the publication that C3 is indeed female. So we appear to have pulled out a sample dependent effect, not a group dependent effect like we are aiming for.
We will re-run MAST with sex as a covariate.
Add sex to metadata of SingleCellAssay object
Run MAST with sex as a covariate
Note: This can take a long time to run, the result can be loaded on next slide
To see results, load the following file from dropbox: sumDT_exn_sex.rds
A data frame with the fold changes and combined hurdle p-values is constructed from this MAST output.
de_mast_exn_sex <- merge(sumDT_exn_sex[sumDT_exn_sex$component == "H" & sumDT_exn_sex$contrast ==
paste0("Group", "AD"), ], sumDT_exn_sex[sumDT_exn_sex$component == "logFC" &
sumDT_exn_sex$contrast == paste0("Group", "AD"), ], by = "primerid")
de_mast_exn_sex <- dplyr::select(de_mast_exn_sex, primerid, coef.y, z.y, `Pr(>Chisq).x`)
names(de_mast_exn_sex) <- c("geneID", "log2Fc", "z", "pvalue")
de_mast_exn_sex$FDR <- p.adjust(de_mast_exn_sex$pvalue, "fdr")
de_mast_exn_sex <- de_mast_exn_sex[order(de_mast_exn_sex$FDR), ]
head(de_mast_exn_sex, 3)
## geneID log2Fc z pvalue FDR
## <char> <num> <num> <num> <num>
## 1: PLCG2 1.17957273 22.907530 9.150363e-191 2.389160e-186
## 2: FP236383.3 -0.01912773 -4.251368 3.264977e-177 4.262428e-173
## 3: PEG3 1.00613240 24.410163 8.852879e-168 7.704956e-164
MAST output with Sex as covarivate:
## geneID log2Fc z pvalue FDR
## <char> <num> <num> <num> <num>
## 1: XIST -0.3376534 -4.619174 3.986526e-12 5.897348e-11
Previous MAST output:
## geneID log2Fc z pvalue FDR sig
## <char> <num> <num> <num> <num> <lgcl>
## 1: XIST -1.415323 -35.84507 4.653001e-254 1.214899e-249 TRUE
Methods that don’t account for variation that exists between biological replicates can lead to inflated p-values and many false positives, known as the pseudoreplication bias (see here and here).
Treating each cell as an independent replicate is likely an incorrect assumption as cells from the same sample often correlate better than cells from different samples. ]
]
To help with pseudoreplication bias we can employ either a pseudobulk strategy or a mixed model with a random effect for each individual/sample
Various papers (for example, here and here) have compared these methods and it’s debatable which performs better and likely depends on the context.
Currently, the much lighter weight pseudobulk method is more commonly used and we will focus on this method.
Pseudobulk analysis can leverage the well-established tools for bulk RNAseq, such as DESeq2 and edgeR. We will show you how to use DESeq2 here.
First, the AggregateExpression function from Seurat is used to get the sum of counts for each gene across all cells grouped by metadata columns.
Importantly, DESeq2 expects raw counts that have not been normalized in anyway.
seu_exn_aggr <- Seurat::AggregateExpression(seu_exn, return.seurat = T, group.by = c("sample_id",
"condition"))
# get the raw un-normalized counts
counts_aggr <- as.matrix(GetAssayData(seu_exn_aggr, assay = "RNA", layer = "counts"))
head(counts_aggr, 3)
## AD2b_AD AD4_AD C1_CON C3_CON
## AL627309.1 25 10 77 23
## AL627309.5 55 89 48 59
## AP006222.2 0 0 7 0
Set up DESeq2 object with the aggregated counts and a data frame with column metadata. We now have reduced the analysis to two groups with only two samples each.
library(DESeq2)
column_meta <- data.frame(row.names = colnames(counts_aggr), condition = ifelse(grepl("AD",
colnames(counts_aggr)), "AD", "CON"))
column_meta$condition <- factor(column_meta$condition, levels = c("CON", "AD"))
dds_pseudo_exn <- DESeqDataSetFromMatrix(countData = counts_aggr, colData = column_meta,
design = ~condition)
colData(dds_pseudo_exn)
## DataFrame with 4 rows and 1 column
## condition
## <factor>
## AD2b_AD AD
## AD4_AD AD
## C1_CON CON
## C3_CON CON
While its not completely necessary, it is common to remove genes that are lowly expressed or are not expressed in enough samples.
Here, we keep those genes that 10 reads in at least two samples (the size of the smallest group)
smallestGroupSize <- 2
keep <- rowSums(counts(dds_pseudo_exn) >= 10) >= smallestGroupSize
table(keep)
## keep
## FALSE TRUE
## 5477 20633
The DESeq function estimates size factors and dispersions before fitting the data to a negative binomial generalized linear model to get differential statistics.
Look at the DESeq2 manual and our course on bulk RNAseq to get more information about how DESeq2 normalizes and determines differentially expressed genes.
## [1] "Intercept" "condition_AD_vs_CON"
de_pseudo_exn <- res_pseudo_exn %>%
data.frame %>%
arrange(pvalue) %>%
rownames_to_column(var = "geneID")
head(de_pseudo_exn, 3)
## geneID baseMean log2FoldChange lfcSE stat pvalue
## 1 SPRY4-AS1 803.0868 -1.959599 0.2493818 -7.857826 3.908571e-15
## 2 SNX31 459.1236 3.251960 0.4357934 7.462160 8.511567e-14
## 3 MOXD1 417.5197 -2.581761 0.3714799 -6.949936 3.654531e-12
## padj
## 1 5.875363e-11
## 2 6.397294e-10
## 3 1.460451e-08
ma <- ggplot(de_pseudo_exn, aes(x = baseMean, y = log2FoldChange)) + geom_point() +
theme_bw() + scale_x_log10() + ggtitle("MA plot")
pca <- plotPCA(rlog(dds_pseudo_exn)) + coord_cartesian() + theme_bw() + ggtitle("PCA")
ggarrange(ma, pca, ncol = 2)
It’s clear that we have many fewer differentially expressed genes using a pseudobulk strategy. This is not surprising with only two replicates per condition, but also highlights the potential for a very high false positive rate when each cell assumed to be independent.
de_pseudo_exn$sig = de_pseudo_exn$padj < 0.05
ggplot(de_pseudo_exn %>%
na.omit, aes(x = log2FoldChange, y = -log10(pvalue), color = sig)) + geom_point() +
scale_color_manual(values = c("grey", "red")) + theme_bw()
Look at the genes that go down in AD (up in control) with a violin plot
de_pseudo_exn_df <- na.omit(de_pseudo_exn)
top_down_AD_ps <- head(de_pseudo_exn_df[de_pseudo_exn_df$log2FoldChange < 0, ], 5)
VlnPlot(seu_exn, features = top_down_AD_ps$geneID, stack = T, flip = T, pt.size = 1) +
scale_x_discrete(labels = c("CON", "AD"))
Add sex to metadata of DESeq2 dds object and make PCA plot
sample_ids <- rownames(colData(dds_pseudo_exn))
colData(dds_pseudo_exn)$sex <- ifelse(sample_ids == "AD2b_AD", "female", ifelse(sample_ids ==
"AD4_AD", "male", ifelse(sample_ids == "C1_CON", "male", ifelse(sample_ids ==
"C3_CON", "female", "none!"))))
colData(dds_pseudo_exn)$sex <- as.factor(colData(dds_pseudo_exn)$sex)
plotPCA(rlog(dds_pseudo_exn), intgroup = "sex") + coord_cartesian() + theme_bw() +
ggtitle("PCA")
We notice that we now have no DE genes. This isn’t terribly surprising given that we only have two replicates per condition and the sex is driving a lot of our variance.
This doesn’t mean the genes that were identified before adding sex as a covariate are invalid, you would just approach them with caution. This likely would be an indication that more replicates were necessary to more confidently find DE genes.
dds_pseudo_exn_sex <- dds_pseudo_exn
design(dds_pseudo_exn_sex) <- ~sex + condition
dds_pseudo_exn_sex <- DESeq(dds_pseudo_exn_sex)
res_pseudo_exn_sex <- results(dds_pseudo_exn_sex, name = "condition_AD_vs_CON")
head(res_pseudo_exn_sex, 3)
## log2 fold change (MLE): condition AD vs CON
## Wald test p-value: condition AD vs CON
## DataFrame with 3 rows and 6 columns
## baseMean log2FoldChange lfcSE stat pvalue padj
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## AL627309.1 44.3124 -1.913196 1.333886 -1.434303 0.151486 0.692659
## AL627309.5 61.6567 -0.121863 0.849788 -0.143404 0.885971 0.992942
## LINC01409 589.3636 -0.408151 0.513194 -0.795315 0.426430 0.909770
A less commonly used approach for dealing with the pseudoreplication bias is the use of a mixed model with random effect for each sample.
In our case, including the sample as the random effect will model the variance that would exist in an entire population of samples (or patients) and then we include this in the model to understand if our fixed effect (AD vs control) changes while accounting for this expected population variance.
In fitting a mixed model with a random effect, MAST will often error when there are genes that have many zeroes. MAST includes the filterLowExpressedGenes function which will filter out genes that are expressed below a threshold percentage of cells.
There is a special notation within the formula to add a random effect. Below we include sample_id as a random effect
MAST can take some time to run, especially with random effects, so to move forward download the following file from dropbox: sumDT_exn_re.rds
de_mast_re <- merge(sumDT_re[sumDT_re$component == "H" & sumDT_re$contrast == paste0("Group",
"AD"), ], sumDT_re[sumDT_re$component == "logFC" & sumDT_re$contrast == paste0("Group",
"AD"), ], by = "primerid")
de_mast_re <- dplyr::select(de_mast_re, primerid, coef.y, z.y, `Pr(>Chisq).x`)
names(de_mast_re) <- c("geneID", "log2Fc", "z", "pvalue")
de_mast_re$FDR <- p.adjust(de_mast_re$pvalue, "fdr")
de_mast_re <- de_mast_re[order(de_mast_re$FDR), ]
head(de_mast_re, 3)
## geneID log2Fc z pvalue FDR
## <char> <num> <num> <num> <num>
## 1: HSP90AB1 0.7011930 14.65392 5.486664e-09 6.003508e-05
## 2: AL133166.1 0.6755124 10.12379 1.346825e-08 7.368480e-05
## 3: AC120193.1 0.6986804 10.06780 2.992759e-08 1.091559e-04
The Seurat FindMarkers function has an option to use MAST or DESeq2 with the ‘test.use’ argument, and the Seurat differential expression vignette describes this in their workflows. Shouldn’t we just use this?
p_seuD <- ggplot(seu_deseq, aes(x = avg_log2FC, y = -log10(p_val_adj), color = sig)) +
geom_point() + scale_color_manual(values = c("grey", "red")) + theme_bw() + ggtitle("FindMarkers")
p_pseudo <- ggplot(de_pseudo_exn %>%
na.omit, aes(x = log2FoldChange, y = -log10(pvalue), color = sig)) + geom_point() +
scale_color_manual(values = c("grey", "red")) + theme_bw() + ggtitle("Using DESeq2 directly")
ggarrange(p_seuD, p_pseudo, common.legend = TRUE, legend = "bottom")
Note: Running this will take a while, so load the object on the next slide to see the plot
cdr_exn <- colSums(seu_exn[["RNA"]]$data > 0)
seu_exn$ngeneson <- scale(cdr_exn)
Idents(seu_exn_aggr) <- "condition"
seu_mast <- FindMarkers(object = seu_exn, group.by = "condition", ident.1 = "AD",
ident.2 = "CON", test.use = "MAST", slot = "data", latent.vars = "ngeneson")
seu_mast$sig = seu_mast$p_val_adj < 0.05
To make the following plot, read in the the MAST output from Seurat on dropbox: seu_mast_exn.rds
p_seuM <- ggplot(seu_mast, aes(x = avg_log2FC, y = -log10(p_val_adj), color = sig)) +
geom_point() + scale_color_manual(values = c("grey", "red")) + theme_bw() + ggtitle("FindMarkers")
p_mast <- ggplot(de_mast_exn, aes(x = log2Fc, y = -log10(pvalue), color = sig)) +
geom_point() + scale_color_manual(values = c("grey", "red")) + theme_bw() + ggtitle("Using MAST directly")
ggarrange(p_seuM, p_mast, common.legend = TRUE, legend = "bottom")
Currently (Seurat v5.0) the fold change information from DESeq2 or MAST is not carried over by FindMarkers. When defining gene sets with fold change cutoffs or visualizing gene expression changes, by not using the fold changes that account for normalization done by that algorithm, you can get funky results.
Also, using the packages directly allow for a higher level of fine tuning for more complex parameters (e.g random effect in MAST).
## used (Mb) gc trigger (Mb) max used (Mb)
## Ncells 14134527 754.9 24674155 1317.8 24674155 1317.8
## Vcells 571212872 4358.1 1725144128 13161.9 1796958467 13709.8
Cellular Indexing of Transcriptomes and Epitopes by Sequencing (CITE-seq).
The CITE-Seq method labels different types of cells or different samples with hashtag-labeled antibodies After sequencing, the cells can be separated by the different antibody hashtags.
This means you can differentiate cell populations based on surface markers, or you can differentiate pooled samples.
Here, we will use a dataset from Seurat vignette, which is based on the a pooling approach: link
This data was processed by an alternative workflow called Drop-Seq for the GEX and CITE-seq-Count for the HTOs.
You can also use Cell Ranger count to do this, by providing an additional -feature-ref reference file containing the names and indexes of all of the hashtags. It will then generate counts over GEX and HTOs for you.
We have the data for you as processed matrices, with counts per gene and counts per tag for each cell.
## [1] 27117 16916
## [1] 8 16916
## AGGCCACAGCGTCTAT ATTGGTGAGTTCGCAT GTGCAGCTCAGTCCCT TAGTTGGGTCATACTG
## HTO_A 3 19 3 6
## HTO_B 0 17 13 17
## HTO_C 1 110 48 9
## HTO_D 10 50 8 33
## HTO_E 10 53 0 22
## TTCTTAGAGAAGGCCT
## HTO_A 7
## HTO_B 31
## HTO_C 26
## HTO_D 1769
## HTO_E 11
We first will create our Seurat data with the GEX data. We can then add the HTO count matrix as metadata information.
seu_obj <- CreateSeuratObject(counts = rna.mat, project = "citeSeq_demo")
seu_obj[["HTO"]] <- CreateAssayObject(counts = hto.mat)
seu_obj
## An object of class Seurat
## 27125 features across 16916 samples within 2 assays
## Active assay: RNA (27117 features, 0 variable features)
## 1 layer present: counts
## 1 other assay present: HTO
Most of the workflow is very consistent between with our regular approach. We will normalize, select features and scale the GEX data.
Finally we can cluster our data. All looks good at this point, but remember this data is pooled.
seu_obj <- quick_clust(seu_obj)
DimPlot(seu_obj, group.by = "seurat_clusters", pt.size = 0.2, label = TRUE) + NoLegend()
To be able to seperate our samples we need to demulitplex the hastag information.
This will assign each cell to a specific hash, which will correspond to a specific sample.
## HTO_A HTO_B HTO_C HTO_D HTO_E HTO_F HTO_G HTO_H
## 3 0 1 10 10 12 9 4155
We cannot handle compositional data of this type in the same way as our GEX data.
CLR is a method to process compositional data. It is a log-ratio transformation that centers the data around the geometric mean of each cell barcode.
We can review and check these values in our metadata.
##
## Doublet Negative Singlet
## 2598 346 13972
##
## Doublet HTO-H HTO-D HTO-E HTO-G HTO-F HTO-B HTO-C
## 2598 1808 1716 1487 1660 1520 1993 1873
## HTO-A Negative
## 1915 346
##
## Doublet HTO-H HTO-D HTO-E HTO-G HTO-F HTO-B HTO-C HTO-A Negative
## Doublet 2598 0 0 0 0 0 0 0 0 0
## Negative 0 0 0 0 0 0 0 0 0 346
## Singlet 0 1808 1716 1487 1660 1520 1993 1873 1915 0
We can quickly and easily subset the data base on our HTO demultiplex results.
New Cell Ranger came out last week. We are yet to test it but new features include: