An overview of methods in Single-cell RNA-seq
This article serves as an overview of the current single-cell RNA-seq data analysis landscape. Due to the popularity of two major software packages across R and python - Seurat and Scanpy - this document will focus on the methods implemented by these packages.
Introduction
Challenges
PCR Amplification and UMIs
scRNA-seq poses unique challenges over bulk RNA-seq. From a sample preparation level, the low number of mRNA transcripts from individual cells (as opposed to thousands per sample) necessitates more PCR amplification cycles to get sufficient nucleotide material for sequencing [@townes_feature_2019]. Therefore, many read counts are duplicates of original genetic material. In scRNA-seq, it is advantageous to add unique molecular identifiers (UMIs) to each molecule before PCR amplification. This allows for computational deconvolution, as all PCR duplicates would share the same UMI and thus can be removed. These de-duplicated read counts are known as UMI Counts.
Compositional nature of RNA-seq data
Basic Quality Control for Count Matrices
This section covers quality control at the level of raw count matrices. We assume you are working with filtered 10X Genomics cellranger aligned output matrices.
By basic quality control, we mean measures such as mitochondrial content, ribosomal content, number of reads per cell, number of genes per cell, etc. More advanced QC such as doublet detection and ambient RNA removal are considered in the next section.
We will describe each of these features, how to compute them in Scanpy, and how they are used to filter cells.
Common QC Metrics
Mitochondrial Fraction
Filtering by reads mapping to mitochondrial reads are so common that it deserves its own little section. A high proportion of reads coming from Mitochondrial DNA is often taken to be a sign low quality/dying cells.
This metric is calculated on the raw/UMI counts on a per-cell basis, and is defined as:
\[ \text{Percent counts mt} = \frac{\text{Reads mapping to mtDNA}}{\text{Total Reads in cell}} \]
Depending on if you are doing single-cell or single-nucleus sequencing, you would expect almost none in the latter as the mitochondria are in the cytoplasm.
Other common metrics
Aside from using reads mapping to mtDNA for QC, in some contexts, ribosomal reads are also considered, albeit less often. There are a set of other common QC metrics that are calculated at the gene or cell levels, and it is important to distinguish the two.
The following metrics are described in Scanpy’s documentation as well as in the scater paper [@mccarthy_scater_2017].
In Scanpy
In scanpy, the sc.pp.calculate_qc_metrics() function calculates the following gene-level QC metrics:
n_cells_by_counts: number of cells that have at least one read mapping to this genemean_counts: the mean counts of this genelog1p_mean_counts: the shifted logarithm of the mean countspct_dropout_by_counts: number of cells that don’t have a read mapping to this genetotal_counts: total counts mapping to this genelog1p_total_counts: shifted logarithm of total counts mapping to this gene
and cell-level QC metrics:
n_cells_by_counts: number of cells this expression is measured inlog1p_n_cells_by_counts: shifted log ofn_cells_by_countstotal_counts: total reads mapping to this celllog1p_total_counts: shifted log oftotal_countspct_counts_in_top_X_genes: X is the number of genes, so it details what percent of counts in this cell maps to the top X genes. A high value may indicate low library complexity (bad capture of RNA molecules) and other technical artifacts.total_counts_mt: total reads mapping to mtDNA in this celllog1p_total_counts_mt: shifted log oftotal_counts_mtpct_counts_mt: percentage of reads mapping to mtDNA in this cell
Filtering
After computing metrics from the raw count matrix, how do you use them filter out low-quality cells? There are two proposed ways: thresholding and dynamic filtering, both with their strengths and weaknesses.
Threshold Filtering
Thresholding filtering uses hard cutoffs to filter out potential low-quality cells. For example, you may choose a cutoff, say ≤ 20% mitochondrial content.
Adaptive Filtering
Adaptive filtering uses the distribution of the metrics to determine outlier cells. This method assumes that most cells are of good quality, and so outliers can be identified from the distribution of cell qualities. One metric to determine if cells are outliers is the Median Absolute Deviation, defined as:
\[ \text{MAD} = \text{median}(|x_i - \text{median}(X)|) \]
It is a measure of statistical dispersion like standard deviation, but is used for filtering single cells because it is a robust statistic, meaning it is less sensitive to outliers. Since our assumption is that most cells are of good quality, this means that bad quality cells are likely outliers and we don’t want them to affect our measure of dispersion.
The decision rule is as follows:
\[ \text{Outlier} = \begin{cases} \text{Yes}, & \text{if } \mathrm{MAD} \geq n \\ \text{No}, & \text{otherwise} \end{cases} \]
where n is some value analogous to how many standard deviations away constitute an outlier. Often, this value is something between 3 (recommended in the OSCA book) and 5 (recommended in the sc-best-practices book).
In studies with many samples and projects, it is increasingly preferable to use this automatic filtering metric.
Basic QC plots
In many instances, it helps to visualize the relationship between QC metrics to see if the filtering you choose is suitable for your dataset.
Violin plots of common metrics
These plots show the distribution of each metric across all cells.
Advanced QC for Count Matrices
Advanced QC metrics are not always necessary, and depend largely on your scientific question. Nonetheless, they are important and are briefly covered here.
Doublet Detection
Doublets are an especially big problem for droplet-based single-cell technologies such as the popular 10X Genomics platform. Doublets are when more than one cell is fed into a bead, and so when sequenced, there will be more genetic material than one would expect from an individual cell.
Normalization
In Seurat, normalization is log-transformation and scaling by some factor (recall that normalization encompasses scaling and transformation).
Feature Selection
Preamble
Feature selection is a process of selecting a subset of features that are “informative” and/or highly varying. In the case of scRNA-seq, it is used to reduce the number of genes (which can be up to 30,000 sequenced per cell) that defines each droplet, thus acting as a form of dimensionality reduction. Ideally, this process removes non-informative genes that either mostly zero counts or vary minimally across samples. For experiments that sequence single cells of a given tissue (most common) an underlying assumption is that only a small fraction of genes are informative and biologically varying [@townes_feature_2019]. This is because most gene expression differences are expected to be across tissues.
Feature selection is a very important step, as it helps to identify the variable features that are then used to generate principal components, inform clustering, and therefore cell typing. The Theis lab has benchmarked the impact of feature selection on downstream analyses in this Nature Methods article.
In this section, we will go through several popular methods. As with many things in scRNA analyses, there are many established methods. For a tabular summary, see Table 1.
Mean Variance Plot
Implementation in Seurat
This is a relatively simple method, published in the original 2015 Seurat paper. The mean and dispersion1 of each feature is computed. Then, features are binned by average expression and a z-score is calculated for dispersion per bin. This is done to learn which genes are variable, given it is in a range of means. That is, which genes have highly variable dispersion given they are in a set bin of means. Evidently, the number of bins is a crucial parameter to select and document.
Implementation
Mean Variance Plot is implemented in both Seurat and Scanpy:
- Seurat:
FindVariableFeatures(selection.method = "mvp") - Scanpy:
scanpy.pp.highly_variable_genes(batch_key=None, flavor="seurat")
Dispersion (Seurat)
Background
NOTE: This method is obsolete, but is included for completeness. NOTE: This method has no associated publication. See the code under the DISP function for implementation reference.
Algorithm in Seurat
This process selects the top n features ranked by dispersion (σ^2/μ), which is related, but not equivalent to using the coefficient of variation (σ/μ), which uses the standard deviation (sd = √σ^2).
Implementation
Mean Variance Plot is implemented in Seurat in FindVariableFeatures(selection.method = "disp").
Variance Stabilizing Transformation
Background
Variance Stabilizing Transformation (VST) is an applied statistic technique that involves transforming data for visualization (exploratory data analysis) and/or making the data amenable to use regression models. It is a type of method, and there is no single VST method. For example, a log-transform is a VST under some conditions. In bioinformatics, it is a widely applied method in analyzing RNA-sequencing data due to its discrete nature and heteroskedasticity. Its use predates single-cell RNA-seq, being used in popular R packages such as DESeq2 to normalize and visualize bulk RNA-sequencing data.
Why is VST needed?
VST is used to model a mean-variance relationship (ex. σ^2 = μ for Poisson and σ^2 = μ + αμ^2 for Negative Binomial) as commonly seen in RNA-seq count data where larger means have larger variances. This is necessary for feature selection through identifying highly variable genes (that is, genes with high variance) as the variance is not directly comparable if it is not constant as a function of the mean. Without applying a VST, genes with higher means (higher expressed) will be selected as they will have, by heteroskedasticity, a higher variance, even if they are actually not biologically varying. A VST is intended to reduce the dependence of the variance on the mean.
Seurat’s version of VST
It is important to note that in the Seurat workflow, VST for feature selection is independent of normalization (e.g. log-transform and scaling by some factor) because log-transformation fails to adequately account for the mean-variance relationship (especially when that relationship is more complex).
As described in the Cell paper, the mean-variance is learned from the data. The un-normalized counts/UMI are used to compute the mean and variance of each gene, which are then log10-transformed. A non-linear curve is then fitted using a local fitting of polynomials of degree 2 (loess), which serves as a regularized estimator of variance given the mean of a feature. This estimated function can be used to standardize feature counts without removing higher-than-expected variation. That is, a Z-score transformation normalized with a mean-controlled standard deviation:
\[ z_{ij} = \frac{x_{ij} - \bar{x}_i}{\sigma_i} \]
where z_ij is the standardized value of feature i in cell j, x_ij is the raw value of feature i in cell j, {x}_i is the mean raw value for feature i, and σ_i is the expected standard deviation of feature i derived from the global mean-variance fit.
And enforcing clipping for technical outliers:
\[ z_{ij} \le \sqrt{N} \]
where N is the total number of cells.
The resulting variance of these standardized counts represents mean-adjusted variances and can be ranked to select highly variable genes (Seurat default returns the top 2000).
Availability and Implementation
VST is implemented in both Seurat and Scanpy:
- Seurat:
FindVariableFeatures(selection.method = "vst") - Scanpy:
scanpy.pp.highly_variable_genes(batch_key=None, flavor="seurat_v3" or "seurat_v3_paper")
Comments
This method is largely the most common method of finding highly variable genes. In Seurat, an alternative is a newer method: scTransform. In Scanpy, VST is still the most common method as scTransform is not implemented in the package.
Pearson Residuals: scTransform (Seurat)
TL;DR
The method of Pearson residuals fits a negative binomial generalized linear model using sequencing depth as a covariate and takes the Pearson residuals as the normalized UMI counts because residuals are essentially variance unexplained by the model, which in this case is sequencing depth. Thus, the residuals are the “biological variability” of interest.
Background
scTransform does not strictly fall under a feature selection method. Rather, it is a normalization method that happens to also help with feature selection. That is, it is largely intended as an alternative to the scaling (e.g. by some library size) and transformation (e.g. by log transform) approach of normalization. The most common is the Shifted Logarithm:
\[ \log\left(\frac{n_{ij}}{S} + 1\right) \]
where n_ij is the raw count of gene i in cell j and S is some scaling factor.
S is often global like S = 10,000, but are sometimes estimated cell-specific factors (e.g. in scran). If S = 1,000,000, then you would get Counts Per Million (CPM). The +1 is called a pseudocount to deal with 0 counts, as you cannot take log(0).
The authors argue that the strengths of this method are:
- Avoids heuristics of common normalization methods such as pseudocount addition, log-transform.
- Do not assume fixed library size for any cell.
- Learns mean-variance relationship.
- Data-driven VST.
In this section, we will focus on the feature selection component of scTransform.
Algorithm of V1 in Seurat
NOTE: This section is for V1, which is described in the original paper. There is an updated version V2.
The crux of why scTransform can be used to select HVGs is that it both normalizes the data and performs a variance stabilizing transformation (VST). Loosely, scTransform fits a generalized linear model, specifically a negative binomial regression model to the UMI data. It models the UMI abundance using sequencing depth as a predictor.
The GLM is formalized using a log link function:
\[ \log(\mathbb{E}(x_i)) = \beta_0 + \beta_1 \log_{10}(m) \]
where x_i is the vector of UMI counts assigned to gene i and m is the vector of molecules assigned (sequencing depth) to the cells, j, so m_j = Σ_i x_{ij}.
Negative binomial regression does not have an error term, unlike simple linear regression. We encode the model probabilistically:
\[ P(x_i \mid m) \sim \mathrm{NB}(\mu, \theta) \\ \mu = \exp(\beta_0 + \beta_1 \log_{10}(m)) \]
where θ is the overdispersion parameter.
As the paper showed that modeling each gene individually leads to overfitting especially for low-abundance genes due to high variance (an overestimation of true variance), they regularize all model parameters by sharing information across genes. Regularization is conducted in 3 steps:
- Fit a NB regression model on each gene, estimating the parameters.
- Using μ = β_0 + β_1 log(m), use kernel regression estimate to learn global trends in the data. Regularize on a per-parameter basis (i.e. share information across all genes).
- Using the regularized parameters, compute the Pearson residuals.
Quickly recall what a “normal” residual is: it is the difference between the actual and predicted value:
\[ z_{ij} = x_{ij} - \mu_{ij} \]
Pearson residuals are similar to “normal” residuals from a linear regression, but are scaled by the standard deviation to account for heteroskedasticity:
\[ z_{ij} = \frac{x_{ij} - \mu_{ij}}{\sigma_{ij}} \]
where
\[ \sigma_{ij} = \sqrt{\mu_{ij} + \frac{\mu_{ij}^2}{\theta_i}} \]
In scTransform, these z_{ij}, the Pearson residuals, are the normalized UMI counts, as they are the “variance unexplained by technical factors such as library size”.
To reduce the impact of outliers similar to in VST above, the authors enforce:
\[ z_{ij} \le \sqrt{N} \]
where N is the total number of cells.
Feature Selection using scTransform
The Pearson residuals transformation doubles as a variance stabilizing transformation, which is what allows scTransform to simultaneously normalize and select features.
With the variance stabilized, the Pearson residuals (aka normalized UMI counts) can be used directly to find HVGs. For each gene, plot the variance of the Pearson residuals z_i. Rank these by variance, and select the top n = 2000. This gives you a list of HVGs.
scTransform V2
scTransform updated to v2. See the docs.
Implementation
scTransform is implemented in Seurat:
- Seurat:
SCTransform(object, vst.flavor = "v2")
Comments
scTransform has been argued to be a poorly specified model. Supporting literature may be this and this paper(s), but I am personally not up to date on this argument. The latter paper motivated the update to scTransform V2.
If you use scTransform, use the updated version: vst.flavor = "v2".
If you use scTransform, it replaces 3 normalization functions:
- Log normalization (
log(x/10000 + 1)):NormalizeData() - Scale and center data (mean = 0, sd = 1):
ScaleData() - Select HVGs (“mvp”, “vst” etc.):
FindVariableFeatures()
The vars.to.regress option is used to apply a second round of NB regression, in addition to the first, regularized model with sequencing depth.
First: \[ \log(\mathbb{E}(x_i)) = \beta_0 + \beta_1 \log_{10}(m) \]
Second: \[ \log(\mathbb{E}(x_i)) = \beta_{0\mathrm{reg}} + \beta_{1\mathrm{reg}} \log_{10}(m) + \beta_2(p) \]
where reg indicates they are fixed, regularized parameters from the first set of regressions, while β_2 can be freely estimated.
Deviance
Background
As with Pearson Residual methods, a central concern are the effects of the heuristic steps used in common normalization (pseudocount addition, log-transformation) as the choice of the pseudocount is arbitrary and can subtly bias transformed data [@townes_feature_2019]. Townes et al., 2019 argue that the highly variable genes method is susceptible to arbitrary pseudocount addition as very small pseudocount values will increase the variance of genes with zero counts.
Therefore, in this section, we will cover the Deviance method proposed by Townes et al., 2019, which covers both feature selection and dimensionality reduction.
What is Deviance?
Deviance is a goodness-of-fit statistic for a statistical model. It generalizes the sum of squared residuals in ordinary least squares to cases where the model is fit using maximum likelihood estimation. It is often used for generalized linear models.
Deviance in scRNA-seq
Deviance is an alternative that, supposedly, is more robust to outliers.
Table 1: Common feature selection methods in scRNA-seq.
| Approach | Software | Brief Description | Ref. |
|---|---|---|---|
| VST | S, SCP | Fit curve to Mean-variance | Cell paper |
| Mean Variance Plot | S, SCP | Rank within-bin dispersion | Seurat paper |
| Dispersion | S, SCP | Rank absolute dispersion | Code |
| Pearson Residuals | S | Use residuals from NB glm | V1, critique, V2 |
| Deviance | SCP | Multinomial Model | Paper1, Paper2 |
S = Seurat, SCP = Scanpy.
Dimension Reduction
This section covers a second round of dimensionality reduction. We first reduced dimensions by identifying HVGs, now we proceed by using Principal Component Analysis.
Principal Component Analysis (PCA)
PCA is a linear dimensionality reduction method. Given a covariance matrix, it finds a set of orthonormal basis vectors, termed eigenvectors, and their associated eigenvalues. The eigenvectors are the “new” axes, composed of linear combinations of the original features, and the eigenvalues correspond to the variance explained by the corresponding eigenvector.
In single-cell analysis, we want to compute these “new” axes, which are essentially meta-features designed to maximize variance (of the original dataset) explained, and use them as the new dimensions, often much smaller than the original features.
How do we choose how many of these “new” axes to use? By definition, PCA computes p−1 principal components, where p is the number of features. If we use all the PCs we computed, then we essentially have not reduced the dimension at all.
By convention, most people use the first 50 PCs, but there are ways to check if you need to use more. In Scanpy, you use the Variance Ratio:
\[ \frac{\text{Variance Explained by PC}}{\text{Total Variance Explained by All PCs}} \]
where
\[ \sum_{i=1}^p \text{Variance Explained by PC}_i \]
Neighborhood Graph
Clustering
Batch Correction
Here, we discuss the dreaded issue in single-cell: batch effects. These are technical variations that do not reflect biological differences. The space in which these technical variations can arise in is vast. Commonly described batch effects include different sequencing machines, runs, different technicians, different institutions, etc.
Thus, we need to adjust for these differences before running any integrated analyses.
Diagnosing the quality of integration
Cell Type Identification
Differential Gene Expression Analysis
Footnotes
Dispersion here is NOT the same as the variance. It refers to the variance-to-mean ratio σ^2/μ, and is mostly used for count data.↩︎
Comments
I have not been able to find why VST is recommended over this method. There is a GitHub issue discussing this, but no clear theoretical conclusion to my understanding.