Selecting highly variable genes (HVGs) is a critical preprocessing step that profoundly impacts the performance and biological relevance of single-cell foundation models (scFMs).
Selecting highly variable genes (HVGs) is a critical preprocessing step that profoundly impacts the performance and biological relevance of single-cell foundation models (scFMs). This article provides a comprehensive guide for researchers and drug development professionals, covering foundational concepts, methodological implementation, optimization strategies, and validation approaches for HVG selection in scFM training. Drawing on recent benchmarks and emerging methodologies, we explore how informed HVG selection enhances data integration, improves cell type annotation, and boosts model robustness for downstream clinical and biomedical applications.
What are Highly Variable Genes (HVGs) and why are they important for single-cell analysis?
Highly Variable Genes (HVGs) are genes whose expression levels show significant variation across individual cells within a homogeneous cell population. Unlike bulk RNA sequencing which analyzes averaged expression from mixed cells, single-cell RNA sequencing (scRNA-seq) can detect these cell-to-cell differences. HVGs are crucial because they are presumed to contribute strongly to cellular heterogeneity and often reflect underlying biological processes, cellular states, and key transcriptional drivers of cell identity and function. Selecting HVGs is a critical feature selection step that reduces data dimensionality, enhances computational efficiency, and improves the interpretability of downstream analyses like clustering and trajectory inference [1] [2] [3].
Why is HVG selection critical for training single-cell Foundation Models (scFMs)?
HVG selection is a fundamental preprocessing step for scFM training because it directly addresses the high dimensionality, sparsity, and noise characteristic of scRNA-seq data. By focusing on the most informative features, HVG selection:
My scFM isn't performing well on downstream tasks. Could my HVG selection be the issue?
Yes, the choice of HVG method and the number of genes selected can significantly impact scFM performance. If your model is struggling, consider these troubleshooting steps:
scran) underperforms, try another (e.g., Seurat's VST or the novel GLP method) [1] [6] [3].How do I choose the right HVG method for my scFM project?
There is no single "best" method that outperforms all others in every scenario. Your choice should be guided by your data characteristics and project goals. The table below summarizes key methods:
Table 1: Comparison of Highly Variable Gene (HVG) Detection Methods
| Method | Underlying Model / Approach | Key Features | Considerations |
|---|---|---|---|
| Brennecke et al. | Fits a generalized linear model to the relationship between squared coefficient of variation (CV²) and mean expression [1]. | Uses DESeq's normalization; filters genes with high uncertainty. | A foundational method; may be superseded by more modern approaches. |
| scran | Fits a trend to the mean-variance relationship of log-transformed expression values using LOESS [1] [2]. | Uses a specialized pooling algorithm for normalization; decomposes variance into technical and biological components. | Robust; considered a strong performer in benchmarks. |
| Seurat (VST) | Uses a polynomial regression model to find a variance-stabilizing transformation of the mean-variance relationship [1] [6]. | Places genes into bins based on expression mean to calculate z-scores; widely used and integrated in Seurat workflows. | A common and effective default choice. |
| BASiCS | Employs a Bayesian hierarchical model to decompose variation into technical and biological components [1]. | Can use spike-in RNAs to model technical noise; can also identify lowly variable genes. | Computationally intensive; powerful for sophisticated noise modeling. |
| GLP | Uses optimized LOESS regression on the relationship between gene average expression and "positive ratio" (fraction of cells expressing the gene) [3]. | Designed to be robust to high sparsity and dropout noise in scRNA-seq data; reported to outperform other methods in some benchmarks. | A recently developed method; promising for handling noisy data. |
A practical workflow is to start with a well-established method like scran or Seurat's VST, and if downstream analysis is unsatisfactory, benchmark against alternative methods like GLP [1] [3].
The following protocol outlines a standard computational workflow for identifying HVGs, which can be applied prior to scFM training.
Inputs: A quality-controlled and normalized single-cell RNA-seq count matrix (cells x genes).
Procedure:
modelGeneVar() function (e.g., in the scran package) fits a trend to the per-gene variance with respect to abundance. It then decomposes the total variance for each gene into a technical component (the fitted value) and a biological component (the residual from the trend) [2].modelGeneVarWithSpikes() can provide a more precise estimate of technical noise by fitting a trend to the spike-in variances [2].The following diagram illustrates the logical workflow and the key decision points.
After training an scFM, it is critical to validate that the model has captured meaningful biological patterns and not just technical artifacts.
Inputs: A trained scFM, a held-out test scRNA-seq dataset with high-quality cell type annotations.
Procedure:
Table 2: Key Research Reagent Solutions for scRNA-seq and Validation
| Reagent / Tool | Function | Application Context |
|---|---|---|
| ERCC Spike-in RNAs | Exogenous RNA controls used to precisely model technical noise and improve the accuracy of HVG detection [2]. | scRNA-seq library preparation and normalization. |
| UMI Barcodes | Unique Molecular Identifiers are short random sequences that label individual mRNA molecules, allowing for accurate quantification by correcting for PCR amplification biases [9]. | scRNA-seq library preparation (e.g., in 10x Genomics, Drop-seq). |
| siRNAs / shRNAs | Small interfering RNAs or short hairpin RNAs used for transient gene knockdown to functionally validate the role of a target HVG [8]. | Functional validation in vitro (e.g., in HUVECs). |
| CRISPR-Cas9 System | A gene-editing tool used to create stable gene knockouts, providing definitive evidence for a gene's function [7] [8]. | Functional validation in vitro and in vivo. |
| FACS Antibodies | Fluorescently-labeled antibodies against cell surface or intracellular proteins for isolating specific cell populations via flow cytometry [7]. | Target population isolation and validation. |
| RNA FISH Probes | Fluorescently labeled nucleic acid probes that bind to specific RNA sequences, enabling visualization of gene expression and spatial localization in tissues [7]. | Spatial validation of HVG expression. |
FAQ 1: Why is Highly Variable Gene (HVG) selection a critical step in single-cell RNA-seq analysis? HVG selection is the process of identifying genes that exhibit significant cell-to-cell variation in expression within a seemingly homogeneous cell population. This step is crucial because it focuses downstream analyses on the genes most likely to be informative of biological heterogeneity, such as different cell types or states. Using HVGs improves computational efficiency, prevents overfitting, and enhances the performance of clustering algorithms by reducing the data dimensionality from tens of thousands of genes to a manageable set of features that capture key biological signals [2] [10]. Neglecting this step can obscure meaningful biological insights, as clustering and dimensionality reduction are highly sensitive to the choice of input genes [2].
FAQ 2: My single-cell analysis failed to identify a known rare cell population. Could HVG selection be the cause? Yes, this is a common challenge. While for abundant and well-separated cell types, even large random gene sets can perform adequately, the identification of rare or subtly different cell types is highly sensitive to the HVG selection method [10]. For instance, in a study focusing on CD4+ T cells, using the standard HVG method successfully identified a FOXP3+ T regulatory (Treg) population (~1.8% of cells), whereas using an equal number of randomly selected genes completely failed to reveal this population, even when the entire transcriptome was used [10]. This demonstrates that for subtle biological differences, a thoughtful choice of HVG method is essential.
FAQ 3: I see inconsistent results every time I re-run my HVG analysis on a subset of my data. How can I improve reproducibility? Low reproducibility in HVG selection is a recognized issue that can significantly impact downstream analyses like cell classification. A benchmarking study on hematopoietic cells revealed that the reproducibility of HVG methods—measured as the proportion of overlapping genes identified across multiple tests—varies considerably [11]. Methods like SCHS showed high reproducibility (>90%), while others, including some popular Seurat methods, showed lower reproducibility (50-70%) [11]. To overcome this, consider using a robust strategy like SIEVE (SIngle-cEll Variable gEnes), which employs multiple rounds of random sampling to identify a stable, high-confidence set of HVGs, thereby minimizing stochastic noise and improving the consistency of your results [11].
FAQ 4: How many Highly Variable Genes should I select for my analysis? The optimal number is not fixed and can depend on the complexity of your dataset and the biological question. However, using too many features can be as detrimental as using too few. Evidence suggests that for standard tasks like clustering peripheral blood mononuclear cells (PBMCs), performance plateaus after selecting a few hundred to a few thousand genes [10]. For example, in one PBMC dataset, clustering metrics reached a high level with around 725 selected genes [10]. It is recommended to avoid automatically selecting the maximum number of HVGs, as this can introduce noise. Start with a standard number (e.g., 2,000-3,000) and perform sensitivity checks to ensure your key findings are robust.
FAQ 5: How does HVG selection specifically impact the training of single-cell foundation models (scFMs)? Single-cell foundation models are pre-trained on massive single-cell datasets to learn universal biological knowledge. The choice of input genes fundamentally shapes what the model learns. HVG selection ensures the model focuses its capacity on the most biologically meaningful signals rather than technical noise or uninformative genes. A comprehensive benchmark of scFMs highlights that the input feature space is a critical factor in model performance [4]. While scFMs are robust tools, their ability to generate insightful embeddings for downstream tasks is directly influenced by the quality and relevance of the features they were trained on. A variability-centric view of feature selection aligns with the core strength of scRNA-seq—capturing cell-to-cell heterogeneity—and can empower scFMs to uncover deeper biological insights [12] [4].
The table below summarizes the performance of various HVG methods based on evaluations using hematopoietic stem/progenitor cells (HSPCs) and mature blood cells [11].
| Method | Reproducibility | Preference for Gene Expression Level | Notes on Performance |
|---|---|---|---|
| SCHS | High (>90%) | Prefers highly expressed genes | High accuracy in cell classification; robust performance. |
| Seurat (VST, SCT, DISP) | Low to Medium (50-70%) | Mix of high and low (quarter of genes are lowly expressed) | Common and accessible; performance can be improved with SIEVE. |
| M3Drop | Low (50-70%) | Selects lowly expressed genes | Lower distinguishing capability for similar cell types (e.g., HSPCs). |
| Scran | Medium (80-90%) | Prefers highly expressed genes | Does not select lowly expressed genes. |
| Scmap | Medium (80-90%) | Prefers highly expressed genes | Slightly lower cluster purity. |
| ROGUE/ROGUE_n | Medium (80-90%) | Prefers highly expressed genes | Does not select lowly expressed genes. |
| SIEVE | Very High (After application) | Shifts selected genes towards median expression | A meta-strategy applied to other methods to enhance reproducibility and biological relevance. |
The SIEVE strategy is designed to overcome the low reproducibility of many standalone HVG methods by leveraging multiple rounds of random sampling [11].
Seurat or SingleCellExperiment object).
SIEVE Workflow for Robust HVG Selection
| Reagent / Tool | Function in HVG Analysis / scRNA-seq |
|---|---|
| ERCC Spike-in RNAs | External RNA controls used to model technical noise and improve the accuracy of variance estimation during normalization and HVG selection [1] [2]. |
| scRNA-seq Analysis Packages (Seurat, scran, Scanpy) | Software suites that provide integrated implementations of various HVG discovery methods (e.g., VST, scran, M3Drop) within a complete analytical workflow [1] [13] [11]. |
| SIEVE Software | A dedicated tool for implementing the SIEVE resampling strategy to identify a robust and reproducible set of HVGs, available from https://github.com/YinanZhang522/SIEVE [11]. |
| Single-cell Foundation Models (scGPT, Geneformer) | Pre-trained deep learning models on large-scale scRNA-seq data. Proper HVG selection can inform the feature space used for fine-tuning these models on specific tasks [4]. |
Moving beyond traditional differential expression (DE), which focuses on changes in mean expression, Differential Variability (DV) analysis identifies genes with significant differences in expression variability (cell-to-cell heterogeneity) between two conditions [12]. These DV genes can offer distinct functional insights.
Method Spotlight: spline-DV
spline-DV identified Plpp1 (increased variability in high-fat diet) and Thrsp (decreased variability in high-fat diet) as top DV genes, providing insights into metabolic dysfunction that were not apparent from mean expression alone [12].
spline-DV Analysis Workflow
Q1: What is the fundamental difference between technical and biological variation in single-cell RNA-seq data? Biological variation refers to the natural, functionally relevant differences in gene expression between individual cells. This includes differences due to cell type, cell cycle stage, transcriptional bursts, and response to environmental stimuli [14]. Technical variation arises from the experimental process itself, including cell isolation, reverse transcription, cDNA amplification, and sequencing. This results in biases such as low capture efficiency, high dropout rates (where a gene is observed in one cell but not in another), and amplification noise [14] [15].
Q2: Why is it critical to account for technical variation before selecting Highly Variable Genes (HVGs) for model training? HVG selection focuses on genes that show more cell-to-cell variability than expected from technical noise alone [15]. If technical variation is not accounted for, the selected gene set will be contaminated with technical artifacts rather than true biological signals. This leads to poor performance in downstream tasks such as cell clustering, data integration, and training of single-cell foundation models (scFMs), as the model learns from noise instead of biology [6] [16].
Q3: How does poor feature selection impact the training and performance of a single-cell foundation model (scFM)? Benchmarking studies show that feature selection methods directly affect the quality of data integration and query mapping, which are foundational for building robust reference atlases [6]. Using poorly selected features can cause an scFM to learn incorrect cellular representations, reducing its ability to accurately predict cellular responses to perturbations (in-silico perturbation). For example, an open-loop scFM might have a low positive predictive value, which can be significantly improved by incorporating even a small amount of experimental perturbation data to guide feature selection in a "closed-loop" framework [17].
Q4: What are some common methods to identify and correct for technical variance?
Problem: After integrating multiple datasets for scFM pre-training, cells cluster strongly by batch or study of origin rather than by biological cell type.
Problem: Your trained scFM performs well on its training data but fails to accurately map or make predictions for new query samples.
Problem: Predictions made by your scFM for genetic perturbations (e.g., knockout, overexpression) have a low rate of experimental validation.
This protocol is based on a robust benchmarking pipeline from a registered report in Nature Methods [6].
1. Define Evaluation Metrics: Select metrics that cover multiple performance categories:
2. Establish Baseline Methods: Run integrations with diverse baseline feature sets to establish performance ranges for scaling metrics. Recommended baselines include:
3. Scale and Summarize Performance: Scale the metric scores for each method relative to the minimum and maximum baseline scores. Aggregate scores within each metric category to summarize performance.
This protocol ensures valid statistical testing by treating samples, not individual cells, as experimental units [18].
1. Data Processing:
2. Pseudobulk Aggregation:
3. Differential Expression Analysis:
edgeR, limma-voom) on the pseudobulk counts.| Tool Name | Statistical Approach | Key Feature / Use Case |
|---|---|---|
| muscat [18] | Mixed-effects model or Pseudobulk | Detects subpopulation-specific state transitions from multi-sample, multi-condition data. |
| NEBULA [18] | Mixed-effects model | A fast negative binomial mixed model for large-scale multi-subject data. |
| MAST [18] | Mixed-effects model | Accounts for the high number of zero counts; supports random effects. |
scran (pseudobulkDGE) [18] |
Pseudobulk | Wraps bulk tools edgeR and limma-voom for easy use with single-cell data. |
| distinct [18] | Differential distribution test | Tests for differences in the entire expression distribution, not just the mean. |
| Item | Function in Experiment |
|---|---|
| Unique Molecular Identifiers (UMIs) | Molecular barcodes added to each transcript during reverse transcription. They allow for accurate molecule counting by correcting for PCR amplification bias [20]. |
| Cell Barcodes | Short DNA sequences that uniquely label all mRNAs from a single cell, allowing samples to be multiplexed and computationally demultiplexed after sequencing [20]. |
| Fluidigm C1 System | A microfluidic-array platform for automated cell capture and library preparation, suitable for medium-throughput, full-length transcriptome analysis [20]. |
| 10x Chromium | A microfluidic-droplet platform for high-throughput, 3' or 5' tag-based library preparation. It is cost-effective for profiling tens of thousands of cells [20]. |
| SMART-seq2 | A plate-based, full-length RNA-seq protocol that provides uniform transcript coverage, enabling the study of splice variants and allele-specific expression [20]. |
Data sparsity, primarily caused by dropout events where genes are measured as unexpressed due to technical limitations, obscures the true biological signal in single-cell RNA sequencing (scRNA-seq) data. This high sparsity and high dimensionality create a "curse of dimensionality" problem where technical noise accumulates and masks subtle biological phenomena, including tumor-suppressor events in cancer and cell-type-specific transcription factor activities [21].
The core issue is that statistical properties of high-dimensional spaces differ dramatically from our intuitive understanding of two- or three-dimensional spaces. As dimensionality increases, the distance between data points becomes less meaningful, and technical noise dominates the data structure, making it difficult for foundation models to learn meaningful biological representations [21].
Solution: Implement comprehensive noise reduction before scFM training. The RECODE algorithm models technical noise arising from the entire data generation process as a general probability distribution and reduces it using eigenvalue modification theory rooted in high-dimensional statistics. This approach effectively mitigates technical noise while preserving biological signals [21].
Traditional approaches that simply combine technical noise reduction with batch correction often fail because conventional batch correction methods typically rely on dimensionality reduction techniques like PCA, which themselves are insufficient to overcome the curse of dimensionality [21].
Solution: Utilize integrated approaches like iRECODE (integrative RECODE), which synergizes high-dimensional statistical noise reduction with established batch correction methods. iRECODE integrates batch correction within an "essential space" after initial noise variance-stabilizing normalization, thereby minimizing accuracy degradation and computational costs associated with high-dimensional calculations [21].
Table 1: Performance Comparison of Noise Reduction Methods
| Method | Technical Noise Reduction | Batch Effect Correction | Relative Error in Mean Expression | Computational Efficiency |
|---|---|---|---|---|
| Raw Data | None | None | 11.1-14.3% | Baseline |
| RECODE Only | Excellent | Limited | Not Available | High |
| Traditional Batch Correction | Limited | Good | Not Available | Moderate |
| iRECODE | Excellent | Excellent | 2.4-2.5% | 10x more efficient than combined approaches |
Feature selection—specifically the identification of Highly Variable Genes (HVGs)—is critical for managing data sparsity in scFM training. The choice of feature selection method significantly affects downstream integration performance, query mapping, label transfer accuracy, and detection of unseen cell populations [6].
Benchmarking studies reveal that using highly variable genes generally leads to better integrations, but the specific feature selection strategy must be carefully chosen. Methods that leverage the relationship between gene average expression level and positive ratio (the proportion of cells where a gene is detected) can more robustly identify biologically informative features amidst technical noise [3].
Table 2: Feature Selection Method Performance Benchmarks
| Method | Adjusted Rand Index (ARI) | Normalized Mutual Information (NMI) | Silhouette Coefficient | Robustness to Dropout |
|---|---|---|---|---|
| GLP | Highest | Highest | Highest | Excellent |
| VST | High | High | High | Good |
| SCTransform | High | High | High | Good |
| M3Drop/NBDrop | Moderate | Moderate | Moderate | Excellent |
| Random Selection | Low | Low | Low | Poor |
Solution: Consider advanced feature selection methods like GLP (Genes identified through LOESS with Positive ratio), which uses optimized LOESS regression to capture the relationship between gene average expression level and positive ratio while minimizing overfitting. This approach has demonstrated consistent outperformance across multiple benchmark criteria compared to eight leading feature selection methods [3].
Current benchmarking reveals that while scFMs are robust and versatile tools for diverse applications, simpler machine learning models can be more adept at efficiently adapting to specific datasets, particularly under resource constraints. Notably, no single scFM consistently outperforms others across all tasks, emphasizing the need for tailored model selection based on factors such as dataset size, task complexity, and computational resources [4].
The performance improvement of scFMs often arises from creating a "smoother landscape" in the pretrained latent space, which reduces the difficulty of training task-specific models. However, the high sparsity, high dimensionality, and low signal-to-noise ratio of transcriptome data continue to present challenges for all models [4].
Solution: Evaluate the specific requirements of your biological question before committing to scFM approaches. For well-defined tasks with limited data, traditional methods may provide more efficient solutions. For exploratory analyses across diverse cell types and conditions, scFMs may offer advantages in capturing broader biological patterns [4] [5].
Input Preparation: Format your scRNA-seq data as a standard gene expression matrix with cells as columns and genes as rows [21].
Noise Variance-Stabilizing Normalization (NVSN): Map gene expression data to an essential space using NVSN to stabilize technical variance across the expression range [21].
Singular Value Decomposition: Apply SVD to decompose the normalized matrix into orthogonal components representing the primary sources of variation [21].
Principal Component Variance Modification: Modify principal component variances using eigenvalue modification theory to reduce technical noise [21].
Integrated Batch Correction: Apply Harmony batch correction within the essential space to minimize batch effects while preserving biological variation [21].
Reconstruction: Reconstruct the denoised, batch-corrected expression matrix for downstream scFM training [21].
Data Preprocessing: Filter out genes captured in fewer than 3 cells to ensure statistical reliability [3].
Parameter Calculation: For each gene, compute:
Bayesian Information Criterion Optimization: Use BIC to automatically determine the optimal LOESS smoothing parameter (α) through:
Two-Step LOESS Regression:
Feature Selection: Select genes with expression levels significantly higher than expected based on the LOESS-predicted values from their positive ratios [3].
Table 3: Essential Computational Tools for scFM Training
| Tool/Resource | Primary Function | Application Context | Key Advantage |
|---|---|---|---|
| RECODE/iRECODE | Technical noise and batch effect reduction | Preprocessing for scFM training | Preserves full-dimensional data; parameter-free |
| GLP | Feature selection based on positive ratio | HVG selection for sparse data | Optimized LOESS regression minimizes overfitting |
| Harmony | Batch correction | Multi-dataset integration | Compatible with iRECODE framework |
| Vitessce | Multimodal data visualization | Quality control and result interpretation | Integrates spatial and single-cell data |
| scGPT | Foundation model architecture | scFM training and fine-tuning | Supports multiple omics modalities |
| CZ CELLxGENE | Curated single-cell data | Pretraining data source | Standardized access to annotated datasets |
When assessing scFM performance beyond standard metrics, implement ontology-informed evaluation strategies:
scGraph-OntoRWR: Measures consistency between cell type relationships captured by scFMs and prior biological knowledge encoded in cell ontologies [4].
Lowest Common Ancestor Distance (LCAD): Quantifies ontological proximity between misclassified cell types to assess the severity of annotation errors [4].
Roughness Index (ROGI): Evaluates the smoothness of the cell-property landscape in the latent space, where smoother landscapes typically indicate better generalization capability [4].
The RECODE platform extends beyond transcriptomics to epigenomic and spatial data modalities. For single-cell Hi-C data, RECODE effectively mitigates sparsity to reveal cell-specific chromatin interactions and topologically associating domains that align with bulk Hi-C counterparts [21]. Similarly, for spatial transcriptomics, integrated visualization tools like Vitessce enable correlative analysis of spatial localization and gene expression patterns [22].
Given that no single scFM consistently outperforms others across all tasks, implement a decision framework based on:
1. How does the choice of Highly Variable Genes (HVGs) impact the input structure of a single-cell foundation model (scFM)?
The selection of Highly Variable Genes (HVGs) is a fundamental pre-processing step that directly determines the "vocabulary" and input sequence for a transformer-based scFM. Unlike words in a language, genes in a cell have no inherent sequential order, so models must impose one. A common strategy is to rank genes by their expression levels within each cell, feeding the ordered list of top genes as a "sentence" for the model to process [5]. The number of HVGs selected (e.g., 1,200 or 2,048) defines the sequence length for each cell [4]. Different models employ various gene ordering strategies, and the choice of HVG set can influence how effectively the model learns biological relationships.
2. My scFM is not performing well on downstream tasks like cell type annotation. Could the HVG selection be a factor?
Yes, absolutely. The benchmark study by Li et al. (2025) found that no single scFM consistently outperforms others across all tasks, and simpler baseline methods can sometimes be more effective, particularly under resource constraints [4] [23]. If your model is underperforming, consider that the HVG set used during pre-training might not be optimal for your specific downstream dataset. The biological variation captured by a general-purpose HVG list may not align perfectly with the cell types or states in your target data. Evaluating the "biological relevance" of the embeddings using ontology-informed metrics can help diagnose this issue [4].
3. What is the relationship between a model's architecture and its need for value embeddings alongside gene token embeddings?
This is a key architectural consideration. Because scRNA-seq data provides an expression value for each gene, models must encode both the gene's identity (the "word") and its expression level (the "emphasis"). This is typically handled through a two-part input layer [4] [23]:
4. Are there scFMs that avoid the HVG selection problem altogether?
Some models are designed to use the entire genome rather than a pre-selected HVG list. For example, the scFoundation model is pretrained on nearly all human protein-encoding genes (19,264 genes) [4]. While this avoids the potential bias introduced by HVG selection, it comes at a significant computational cost and may require more sophisticated architectures or training strategies to handle the high dimensionality and sparsity of the data effectively.
Symptoms: After using an scFM for dataset integration, biological cell types remain clustered by batch (e.g., by patient or sequencing platform) instead of mixing seamlessly.
Potential Causes and Solutions:
| Step | Potential Cause | Diagnostic Check | Solution |
|---|---|---|---|
| 1 | HVG Mismatch | The set of HVGs used in pre-training captures technical artifacts specific to the pre-training datasets. | Fine-tune the model on a small sample of your target data to adapt the gene representations. Alternatively, use a model like Nephrobase Cell+ that employs adversarial training to actively remove batch signals [24]. |
| 2 | Insufficient Model Pretraining | The model was not pre-trained on data with batch effects as diverse as yours. | Check the pre-training corpus of your scFM. Select a model pre-trained on massive, diverse datasets (e.g., >30 million cells) from multiple sources, as scale and diversity improve robustness [24]. |
| 3 | Suboptimal Embeddings | The zero-shot cell embeddings from the scFM are not batch-invariant. | Use the scFM embeddings as a starting point and apply a dedicated batch-integration tool like Harmony or Scanorama as a post-processing step [25]. |
Symptoms: Your scFM fails to accurately predict gene expression changes following single or double genetic perturbations, performing worse than simple additive baselines.
Potential Causes and Solutions:
| Step | Potential Cause | Diagnostic Check | Solution |
|---|---|---|---|
| 1 | Limited Perturbation Knowledge | The model's pre-training data may have lacked sufficient perturbation examples to learn causal relationships. | A recent benchmark found that simple linear models can outperform complex scFMs for this task [26]. Consider using a baseline model or a linear model enhanced with gene embeddings extracted from an scFM [26]. |
| 2 | Ineffective Gene Embeddings | The gene-token embeddings do not adequately capture functional gene-gene relationships. | Extract the gene embedding matrix (G) from the scFM and use it to train a simpler predictive model. Benchmarks show this can sometimes match or exceed the performance of the scFM's own decoder [26]. |
This protocol is adapted from the comprehensive benchmark study by Li et al. (2025) [4] [23].
Objective: To evaluate the quality of cell embeddings generated by different scFMs for tasks like batch integration and cell type annotation.
Materials:
Methodology:
Expected Output: A holistic ranking of scFMs, identifying the strengths and limitations of each for different biological applications. The study revealed that while scFMs are robust and versatile, simpler models can be more efficient for specific datasets [4].
Objective: To determine if the gene embeddings learned by an scFM capture meaningful biological relationships.
Materials:
Methodology:
Expected Output: Quantification of how well the scFM's intrinsic gene embeddings align with established biological knowledge, providing insight into the functional insights the model has learned during pre-training [23].
The diagram below illustrates how a single-cell foundation model transforms a cell's gene expression profile into a latent representation, highlighting the critical role of HVG selection and tokenization.
HVG Processing in scFM Architecture
The following table details key computational tools and resources essential for working with single-cell foundation models and Highly Variable Genes.
| Resource Name | Type | Primary Function | Relevance to HVGs & Architecture |
|---|---|---|---|
| Geneformer [4] | Pre-trained scFM | A transformer model for cell and gene representation learning. | Uses a ranked list of 2,048 genes as input, demonstrating a specific HVG-based architecture. |
| scGPT [4] [5] | Pre-trained scFM | A generative transformer for single-cell biology. | Employs 1,200 HVGs and uses value binning for expression levels, illustrating an alternative input strategy. |
| scFoundation [4] [26] | Pre-trained scFM | A large model for gene expression and perturbation prediction. | Uses all ~19k protein-encoding genes, showcasing an architecture that bypasses HVG selection. |
| Nephrobase Cell+ [24] | Organ-Specific scFM | A kidney-focused foundation model. | Pretrained on ~40M cells; its success suggests that specialized models can outperform general ones, which has implications for HVG relevance in specific tissues. |
| CellxGene [5] | Data Platform | Provides unified access to annotated single-cell datasets. | A primary source for obtaining diverse, high-quality data for model pre-training or benchmarking, which is crucial for defining robust HVG sets. |
| Seurat [25] | Analysis Toolkit | A comprehensive R package for single-cell genomics. | Provides standard pipelines for HVG selection and serves as a common baseline for benchmarking scFMs. |
| Harmony [4] [25] | Integration Algorithm | A tool for dataset integration. | Used as a post-processing step for scFM embeddings or as a baseline to compare against the integration performance of scFMs. |
Q1: What is the core purpose of selecting Highly Variable Genes (HVGs) in single-cell RNA-seq analysis?
The primary purpose of HVG selection is to overcome the "curse of dimensionality" in single-cell RNA sequencing data by identifying a subset of genes that are most informative for distinguishing cell types or states. This process filters out genes that represent technical or biological noise, thereby enhancing the signal for downstream analyses such as clustering, dimensionality reduction, and cell type identification. Typically, only 3,000–5,000 of the tens of thousands of sequenced genes relate to cell-type-specific expression patterns, making HVG selection a critical pre-processing step to improve analytical resolution and accuracy [27].
Q2: For a multi-sample experiment, what is the recommended strategy to select HVGs that are robust across batches?
The recommended strategy for multi-sample experiments involves performing HVG selection on a per-batch basis and then identifying the consensus genes. This ensures the selected feature space is shared across samples. The methodology is as follows:
batch_key parameter in your HVG selection function.highly_variable_nbatches).Q3: Can I use the same set of HVGs for different analysis tasks, such as clustering and integration?
While a single set of HVGs can be used for multiple tasks, the optimal strategy may vary. For integration, the consensus method described above is highly recommended. For clustering within a single, well-controlled dataset, standard HVG selection on the entire dataset might be sufficient. However, it's important to note that no single method is universally best. For instance, SCHS excels in reproducibility but favors highly expressed genes, while other methods like M3Drop select more lowly expressed genes, which can impact clustering results [11]. Researchers should align their HVG selection strategy with their primary analytical goal.
Q1: The tool I'm using (e.g., Seurat) is not returning the expected number of HVGs, even though I specified the nFeatures parameter. Why?
This is a documented issue that can occur in specific workflows. For example, in Seurat, this behavior has been observed when the RNA assay is split into multiple layers (e.g., by a batch key) before running FindVariableFeatures. The underlying cause may be related to how the function interacts with the split assay object. As a workaround, you can try running the HVG selection on an unsplit object first or ensure you are using the latest version of the software, as this may be a resolved bug. Always check the number of variable features stored in the output object to confirm the function's behavior [29].
Q2: My downstream clustering results are poor or do not resolve known cell populations. Could the HVG selection be the cause?
Yes, the choice of HVG selection method can significantly impact clustering resolution and accuracy. Different methods have biases; for example, some may overlook lowly expressed but biologically critical genes. If clustering performance is unsatisfactory, consider these steps:
Q3: I am using an integrated object for clustering. Should I re-select HVGs after integration?
No, it is generally not sensible to re-select HVGs based on the integrated or corrected data. Highly Variable Gene detection methods are designed and calibrated for raw (or normalized) count data, which contains the technical and biological variation they are meant to discern. Integration methods like scVI explicitly remove unwanted technical variation (e.g., batch effects) to create a corrected expression matrix. Applying standard HVG selection on this "cleaned" data will not capture the intended sources of variation and is not part of standard analytical workflows [28].
The table below summarizes the performance characteristics of various HVG selection methods based on an evaluation using scRNA-seq data from hematopoietic stem/progenitor cells and mature blood cells.
Table 1: Characteristics and Performance of HVG Selection Methods
| Method | Reproducibility | Key Strengths | Key Limitations | Bias in Gene Expression Level |
|---|---|---|---|---|
| SCHS | High | High reproducibility and accuracy [11] | Prefers selection of highly expressed genes [11] | Prefers highly expressed genes [11] |
| Seurat (VST, SCT, DISP) | Medium | Good distinguishing capability for similar cell types [11] | Moderate reproducibility [11] | Selects a mix, including ~25% lowly expressed genes [11] |
| Scran | Low to Medium | Good distinguishing capability [11] | Lower reproducibility; lower cluster purity [11] | Selects almost no lowly expressed genes [11] |
| M3Drop | Low | Can identify lowly expressed variable genes [11] | Lowest distinguishing capability and classification accuracy [11] | Selects a mix, including ~25% lowly expressed genes [11] |
| ROGUE | Low to Medium | - | Lower reproducibility; lower cluster purity [11] | Selects almost no lowly expressed genes [11] |
| Scmap | Low to Medium | - | Lower reproducibility; lower cluster purity [11] | Prefers highly expressed genes [11] |
| SIEVE | High (by design) | High robustness; improves cell classification accuracy; recovers lowly expressed variable genes [11] | Computationally intensive due to multiple rounds of sampling [11] | Mitigates bias, recovers genes across expression levels [11] |
Table 2: Impact on Downstream Analysis (Based on HSPC and Mature Blood Cell Data)
| Method | Cluster Purity | Classification Accuracy (HSPCs) | Classification Accuracy (Mature Cells) |
|---|---|---|---|
| SCHS | >90% | ~85-90% | >90% |
| Seurat | >90% | ~85-90% | >90% |
| Scran | ~90% (slightly inferior) | ~85-90% | >90% |
| M3Drop | >90% | Lowest | Lowest |
| ROGUE | ~90% (slightly inferior) | ~85-90% | >90% |
| Scmap | ~90% (slightly inferior) | ~85-90% | >90% |
| SIEVE | >90% | Substantially improved | Substantially improved |
This protocol describes a standard workflow for identifying HVGs on a single-cell dataset using Seurat.
NormalizeData. This typically involves log-normalization.FindVariableFeatures function. You must specify the following:
nfeatures: The number of genes to select (e.g., 3000).selection.method: The specific algorithm to use (e.g., "vst", "sctransform", or "dispersion").VariableFeatures(object) and visualize the selection using VariableFeaturePlot.This protocol is essential for datasets comprising multiple batches or samples and is a critical precursor to data integration.
sc.pp.highly_variable_genes(adata, batch_key='batch') function in Scanpy. This calculates HVGs within each batch independently and stores a count of how many batches each gene was variable in (highly_variable_nbatches).AnnData object to these consensus HVGs before proceeding with integration or joint analysis [28].SIEVE is a meta-strategy that can be applied to existing HVG methods to improve their robustness.
Table 3: Key Computational Tools for HVG Selection and Evaluation
| Tool / Resource | Function in HVG Research | Key Application |
|---|---|---|
| Seurat | A comprehensive toolkit for single-cell analysis. Provides multiple embedded HVG selection methods (VST, SCT, DISP). | Standardized preprocessing and HVG selection for clustering and trajectory inference [11]. |
| Scanpy | A Python-based toolkit for analyzing single-cell gene expression data. Mirrors the functionality of Seurat. | HVG selection, especially in multi-batch scenarios, and integration with other Python-based ML tools [30] [28]. |
| SCHS | A method for identifying HVGs based on the spatial distribution of cells. | Selecting a reproducible set of variable genes, particularly useful when consistency across subsamples is a priority [11]. |
| SIEVE | A strategy, not a single algorithm, that uses multiple rounds of random sampling to identify robust HVGs. | Improving the robustness and accuracy of any base HVG method, leading to better single-cell classification [11]. |
| scran | A package for low-level analyses of single-cell RNA-seq data. Provides its own method for HVG selection. | An alternative approach to HVG selection, often used in comparative benchmarks [11]. |
| Human Phenotype Ontology (HPO) | A standardized vocabulary of phenotypic abnormalities. | While not directly for HVG selection, it is crucial for phenotype-based prioritization in diagnostic variant discovery following single-cell analysis [31]. |
1. What is batch-aware feature selection and why is it critical for single-cell foundation model (scFM) training? Batch-aware feature selection is a computational strategy that identifies informative genes (features) for downstream analysis while explicitly accounting for non-biological technical differences between datasets, known as "batch effects." In the context of scFM training, which uses vast amounts of single-cell RNA sequencing (scRNA-seq) data, this is crucial because technical variation can confound true biological signals [6] [32]. Selecting features without considering batch effects can lead to a model that learns technical artifacts rather than underlying biology, compromising its performance on tasks like cell type annotation, data integration, and query mapping [6]. Proper batch-aware feature selection ensures the scFM learns robust, generalizable biological principles.
2. My integrated dataset shows good batch mixing but poor separation of known cell types. What might be the cause? This is a common challenge indicating that the integration or feature selection process may have been too aggressive, removing biological variation along with technical noise [32]. Specifically:
3. How does the number of selected features impact integration and downstream mapping tasks? The number of features selected is a critical parameter. Benchmarks show that the performance of integration and mapping is sensitive to this number [6].
Symptoms:
Investigation & Resolution Flowchart
Diagnostic Steps:
Verify Input Data Quality:
Assess Feature Selection Method:
Evaluate the Integration Algorithm:
Verify Downstream Analysis Parameters:
Symptoms:
Diagnosis and Solutions:
Table 1: Common Causes and Corrective Actions for Low Library Yield
| Category | Root Cause | Corrective Action |
|---|---|---|
| Sample Input / Quality | Degraded RNA or contaminants (phenol, salts) inhibiting enzymes. | Re-purify input sample; use fluorometric quantification (Qubit) over absorbance; ensure high purity (260/230 > 1.8) [34]. |
| Fragmentation & Ligation | Inefficient ligation due to poor enzyme activity or incorrect adapter-to-insert ratio. | Titrate adapter:insert ratios; ensure fresh ligase/buffer; optimize fragmentation parameters [34]. |
| Amplification / PCR | Too few PCR cycles or enzyme inhibitors in the reaction. | Re-amplify from leftover ligation product; avoid over-cycling which causes duplicates and bias [34]. |
| Purification & Cleanup | Overly aggressive size selection or bead cleanup leading to sample loss. | Use correct bead-to-sample ratio; avoid over-drying beads; ensure adequate washing without excessive sample loss [34] [35]. |
Proactive Prevention:
This protocol is adapted from large-scale benchmarking studies [6] to evaluate the impact of feature selection on scRNA-seq data integration and query mapping.
1. Data Preprocessing:
2. Feature Selection:
3. Data Integration:
4. Performance Evaluation:
Table 2: Key Metrics for Evaluating Integration and Mapping Performance
| Category | Metric | Description | What a Good Score Indicates |
|---|---|---|---|
| Batch Correction | iLISI (Integration LISI) | Measures diversity of batches in a cell's neighborhood [32]. | High score: Batches are well-mixed. |
| Batch PCR (Batch Principal Component Regression) | Quantifies the variance explained by batch in the latent space [6]. | Low score: Less technical variation. | |
| Biology Preservation | cLISI (Cell-type LISI) | Measures diversity of cell labels in a cell's neighborhood [6]. | High score: Cell types are distinct. |
| bNMI (Batch-balanced NMI) | Compares clustering similarity to cell labels, balanced across batches [6]. | High score: Biological groups are conserved. | |
| Query Mapping | Cell Distance | Average distance between query cells and their nearest reference neighbors [6]. | Low score: Query cells map precisely to reference. |
| mLISI (Mapping LISI) | Assesses mixing of query and reference cells in local neighborhoods [6]. | High score: Query and reference are well-integrated. |
This protocol outlines the use of sysVI, a method designed for challenging integrations [32] [33].
1. Installation and Setup:
sciv-tools Python package [32].2. Model Configuration:
3. Execution:
4. Validation:
Table 3: Essential Computational Tools & Resources for scFM Research
| Resource Name | Type | Primary Function | Relevance to Batch-Aware Analysis |
|---|---|---|---|
| scanpy [6] | Python Package | Scalable single-cell analysis. | Provides implementations for standard HVG selection and preprocessing. |
| scvi-tools [32] | Python Package | Probabilistic models for scRNA-seq. | Hosts scalable integration methods like scVI and sysVI for substantial batch effects. |
| batchelor [36] | R/Bioconductor Package | Methods for correcting batch effects. | Implements fast and efficient batch correction algorithms like MNN. |
| Seurat [37] | R Package | Single-cell genomics analysis. | Offers a comprehensive integration workflow, including anchor-based integration. |
| CZ CELLxGENE [5] | Data Platform | Curated collection of single-cell datasets. | Provides a unified source of high-quality, annotated data essential for scFM pretraining and benchmarking. |
| Harmony [37] | Algorithm / Package | Data integration method. | A popular and efficient method for integrating datasets across technical batches. |
Q1: What are the main advantages of using foundation models over traditional methods for single-cell data analysis? Single-cell foundation models (scFMs) are robust and versatile tools that learn universal biological knowledge from massive datasets during pretraining. This endows them with emergent abilities for zero-shot learning and efficient adaptation to various downstream tasks, such as cell type annotation, batch integration, and drug sensitivity prediction. However, for specific tasks with limited data or resources, simpler machine learning models can sometimes be more efficient and effective [4] [5].
Q2: How can I select highly variable genes (HVGs) effectively for my scFM training? Traditional HVG selection methods can be challenged by the high sparsity and dropout noise of scRNA-seq data. The GLP (LOESS with positive ratio) method provides a robust alternative by identifying biologically informative genes through the relationship between a gene's positive ratio (the fraction of cells where it is detected) and its average expression level. Genes with expression levels significantly higher than expected for their positive ratio are selected, which helps preserve key biological signals for downstream analysis [3].
Q3: Why is my model failing to identify rare cell types or subtle biological signals? This is a common challenge, often stemming from how features are selected. Standard HVG methods may overrepresent highly abundant cell types and miss less abundant ones. The performance is closely tied to dataset size; with larger and more diverse pilot datasets, the proportions of cells in each cluster become more similar to the ground-truth data. Using feature selection methods specifically designed to capture nuanced biological information, like GLP, can improve the detection of rare cell types [38] [3].
Q4: Can I incorporate prior biological knowledge, like gene networks, to improve my model's performance? Yes, integrating known biological networks can significantly increase the power to identify biologically relevant signals. Methods like Markov Random Field (MRF) models appropriately accommodate gene network information as well as dependencies among cell types. This allows the model to borrow information across related genes and cell types, leading to more statistically powerful and biologically insightful identification of features like cell-type-specific differentially expressed genes [39].
Symptoms:
Solutions:
Symptoms:
Solutions:
Symptoms:
Solutions:
f (independent variable) and λ (dependent variable) using LOESS regression with an optimized bandwidth selected by the Bayesian Information Criterion (BIC) to prevent overfitting.| Metric | Description | Application Context |
|---|---|---|
| scGraph-OntoRWR [4] | Measures consistency of cell type relationships captured by the model with prior biological knowledge from ontologies. | Evaluating biological relevance of scFM embeddings. |
| Lowest Common Ancestor Distance (LCAD) [4] | Measures ontological proximity between misclassified cell types; a smaller distance indicates a less severe error. | Benchmarking cell type annotation accuracy. |
| Adjusted Rand Index (ARI) [38] [3] | Measures the similarity between two data clusterings (e.g., from synthetic vs. real data). | Evaluating clustering performance in downstream analysis. |
| Silhouette Coefficient [3] | Measures how similar a cell is to its own cluster compared to other clusters. | Assessing the quality of clustering outcomes. |
| Roughness Index (ROGI) [4] | Quantifies the smoothness of the cell-property landscape in the latent space; a smoother landscape is easier for downstream modeling. | Serving as a proxy for model performance on a specific dataset. |
| Reagent / Resource | Function in Analysis | Key Reference/Source |
|---|---|---|
| CZ CELLxGENE [5] | A unified platform providing access to over 100 million curated single-cell datasets for model pretraining and benchmarking. | https://cellxgene.cziscience.com/ |
| GLP Algorithm [3] | A robust feature selection method to identify highly variable genes by modeling the relationship between positive ratio and average expression. | https://github.com/WangyuchenCS/GLP |
| MRFscRNAseq R Package [39] | Implements a Markov Random Field model to identify cell-type-specific differentially expressed genes by incorporating gene network information. | Available on GitHub |
| PEREGGRN Benchmarking Platform [40] | A software platform for fairly evaluating expression forecasting methods on a collection of perturbation transcriptomics datasets. | Associated with Genome Biology (2025) |
| Problem Category | Specific Issue | Possible Causes | Solution | Related Analysis Step |
|---|---|---|---|---|
| Data Quality | High dropout rates in scRNA-seq data | Low RNA input, inefficient cDNA amplification [41] | Use Unique Molecular Identifiers (UMIs) and spike-in controls; employ computational imputation [41] | HVG Selection, Clustering |
| Batch effects between sequenced and spatial data | Technical variation from different experimental batches [41] | Apply batch correction algorithms (e.g., Combat, Harmony, Scanorama) [41] | Data Integration | |
| Integration | Weak linkage between modalities (e.g., protein & RNA) | Few correlated features, low signal-to-noise ratio [42] | Use iterative integration methods (e.g., MaxFuse) that use all features for co-embedding [42] | Cross-Modal Integration |
| Incorrect cell type matching | Poor initial alignment, over-reliance on highly variable genes [42] | Implement fuzzy smoothing on linked features and use linear assignment for matching [42] | Cell Type Annotation | |
| Feature Selection | HVG list contains technical noise | High sparsity and dropout events masking biological variation [3] | Use GLP method modeling positive ratio vs. expression level with optimized LOESS [3] | HVG Selection for scFM Training |
| Selected genes fail to capture key biology | Assumptions of mean-variance trend do not hold [2] [3] | Quantify biological component of variation using modelGeneVar() or spike-in trends [2] |
Downstream Analysis | |
| Computational | scFM predictions have low positive predictive value | "Open-loop" model not refined with experimental data [17] | Fine-tune foundation model with perturbation data ("closed-loop" ISP) [17] | In Silico Perturbation |
Purpose: To significantly improve the positive predictive value (PPV) of a single-cell foundation model (scFM) like Geneformer by incorporating experimental data [17].
Procedure:
Purpose: To accurately integrate data from two weakly linked modalities, such as targeted spatial proteomics and whole-transcriptome scRNA-seq [42].
Procedure:
| Item | Function in Experiment | Application Context |
|---|---|---|
| Unique Molecular Identifiers (UMIs) | Tags individual mRNA molecules to correct for amplification bias and quantify absolute transcript counts [41]. | scRNA-seq Library Prep |
| Spike-In Controls (e.g., ERCC) | Exogenous RNA transcripts added to samples to monitor technical noise and help model gene variation [2]. | scRNA-seq Quality Control |
| 10X Genomics Chromium / Visium | Platform for droplet-based single-cell RNA sequencing and spatially resolved transcriptomics [43] [41]. | scRNA-seq & SRT Library Generation |
| BD Rhapsody Single-Cell Analysis System | Another platform for whole transcriptome analysis at single-cell resolution, used in spaceflight studies [43]. | scRNA-seq |
| CRISPRa/i Perturb-seq Library | Enables large-scale genetic perturbation screens coupled with single-cell RNA readout, providing data for scFM fine-tuning [17]. | Closed-Loop ISP Validation |
| CITE-seq Antibody Panel | Allows for simultaneous measurement of surface proteins and transcriptome in single cells, creating a linked dataset [42]. | Multi-Modal Integration |
| Cell Hashing Oligonucleotides | Labels cells from different samples with unique barcodes, allowing for sample multiplexing and identification of cell doublets [41]. | Sample Multiplexing & QC |
| CODEX Multiplexed Antibody Panel | Enables highly multiplexed spatial proteomics imaging, which can be integrated with transcriptomic data [42]. | Spatial Proteomics |
| Method | Core Principle | Key Metric(s) | Key Considerations for scFM Training |
|---|---|---|---|
| GLP (Genes by LOESS & Positive Ratio) [3] | Identifies genes whose average expression is significantly higher than expected based on their positive ratio (fraction of cells expressing the gene). | Deviation from optimized LOESS curve of λ vs. f [3] | Directly models dropout rate, which is a more precise population parameter than variance. Helps select biologically informative genes in sparse data [3]. |
| modelGeneVar (scran) [2] | Fits a mean-variance trend to log-normalized expression values across all genes. The biological component is total variance minus the technical component. | Biological component of variation [2] | Assumes most genes are driven by uninteresting noise. Can be inflated if many genes at an abundance are biologically variable [2]. |
| modelGeneVar with Spike-Ins [2] | Fits a mean-dependent trend to the variance of spike-in transcripts to better estimate the technical component. | Biological component of variation [2] | Provides a cleaner estimate of technical noise, but requires spike-in data and assumes they mimic technical variation of endogenous genes [2]. |
| VST (Seurat) [3] | Uses a variance stabilizing transformation based on a generalized linear model of the mean-variance relationship. | Standardized variance [3] | A widely used and robust method that is a standard benchmark in the field [3]. |
MaxFuse Cross-Modal Integration Workflow
Closed-Loop scFM Fine-Tuning for ISP
The selection of Highly Variable Genes (HVGs) is a critical preprocessing step in single-cell RNA sequencing (scRNA-seq) analysis, directly influencing the performance of downstream tasks such as clustering, data integration, and the training of single-cell foundation models (scFMs) [16] [6]. This guide addresses common challenges and provides practical solutions for integrating robust HVG selection into scFM training workflows, framed within the context of advanced research in gene selection methodologies.
Single-cell foundation models require high-quality, informative input features to learn meaningful biological representations. HVGs—genes that exhibit significant cell-to-cell variation—are prioritized because they are most likely to represent interesting biological heterogeneity rather than technical noise [16]. Selecting HVGs:
Table 1: Essential Tools and Resources for HVG Selection and scFM Training
| Category | Tool/Resource | Primary Function | Key Consideration |
|---|---|---|---|
| HVG Selection Methods | scanpy (Seurat-like), scran, BASiCS [1] |
Identifies genes with high biological variability | No single best method; consider hybrid approaches [44] |
| Integration & scFM Training | scVI [45], scANVI [46], scGPT [4] |
Deep learning models for integration and foundation model training | Performance depends on quality of input features [4] |
| Benchmarking & Evaluation | scIB [6], scGraph-OntoRWR [4] |
Metrics for integration quality and biological relevance | Evaluate both batch correction and biological conservation [6] |
| Data Resources | CellxGene [4], PanglaoDB [46] | Curated cell type markers and reference datasets | Crucial for annotation and validation |
The following diagram illustrates a robust workflow for integrating HVG selection into scFM training, designed to handle complex, multi-batch datasets.
Problem: Datasets from different technologies, species, or laboratories show strong batch effects, and standard HVG selection fails, leading to poor integration.
Solution: Implement a batch-aware consensus HVG selection strategy.
batch_key parameter in scanpy.pp.highly_variable_genes() to compute HVGs separately for each batch or system (e.g., species) [45].highly_variable_nbatches) [28].Problem: After standard HVG selection, batches remain separated in the integrated embedding.
Troubleshooting Steps:
scVI, increasing the model complexity (e.g., using 2 layers instead of 1) can sometimes help capture more complex batch effects [47].scVI or sysVI that are explicitly designed to handle batch effects as a covariate [45] [46].Problem: The choice of the number of HVGs and the selection method seems arbitrary, and performance varies.
Evidence-Based Guidance:
mixHVG) demonstrate more robust performance [44].Problem: It is unclear how to quantitatively assess the quality of the integrated embedding generated by the scFM.
Comprehensive Evaluation Metrics: A robust evaluation should assess both batch effect removal and conservation of biological variation [6] [48]. The table below summarizes key metrics.
Table 2: Key Metrics for Evaluating scFM Output After HVG Selection
| Evaluation Category | Metric | What It Measures | Ideal Outcome |
|---|---|---|---|
| Batch Effect Removal | Batch ASW [6] | How well mixed batches are within cell neighborhoods. | Higher Score |
| iLISI (Integration LISI) [6] | Likelihood of a cell's neighbors coming from multiple batches. | Higher Score | |
| Biological Conservation | cLISI (Cell-type LISI) [6] | Likelihood of a cell's neighbors being of the same cell type. | Higher Score |
| Isolated Label F1 [6] | How well rare cell types are preserved after integration. | Higher Score | |
| Biological Insight (for scFMs) | scGraph-OntoRWR [4] | Consistency of cell-type relationships in the embedding with known biology (e.g., cell ontology). | Higher Score |
Problem: Foundation models typically require large amounts of data, but my dataset is limited.
Solutions and Considerations:
For exceptionally challenging integrations, such as across species (mouse/human) or different technologies (scRNA-seq vs. snRNA-seq), a stricter HVG protocol is required. The sysVI model recommends this workflow [45]:
batch_key, identify HVGs for each system independently.
This section addresses frequently asked questions about the role of tissue atlases in single-cell research, with a focus on selecting highly variable genes (HVGs) for training single-cell foundation models (scFMs).
Q1: How can tissue atlases improve the selection of highly variable genes for scFM training? Tissue atlases provide a foundational reference for understanding gene expression patterns across diverse tissues and cell types. When selecting HVGs, researchers can use atlas data to prioritize genes that show biologically meaningful variation, such as those with high tissue specificity, rather than technical noise. For instance, the miRNATissueAtlas uses a Tissue Specificity Index (TSI) to classify RNAs, a concept that can be directly applied to gene selection for scFMs [49] [50]. By integrating TSI values, you can filter your gene list to include those with documented biological variability, thereby improving the signal captured by your scFM.
Q2: What are the consequences of poor HVG selection on scFM performance? Benchmarking studies reveal that the choice of input features significantly impacts scFM performance on downstream tasks [4]. Poor HVG selection can lead to:
Q3: Are complex scFMs always better than simpler models for tasks based on tissue atlas data? No, a key finding from recent benchmarks is that no single scFM consistently outperforms others across all tasks [4]. The decision to use a complex scFM versus a simpler machine learning model depends on:
Q4: How can I validate that my scFM has learned biologically relevant features from tissue atlas data? Beyond standard performance metrics, you can use ontology-informed metrics to assess biological relevance:
This guide helps diagnose and resolve common issues encountered when utilizing tissue atlases or building upon their data.
Problem: Inability to Replicate Tissue-Specific Findings from an Atlas
Problem: scFM Fails to Generalize to a New Perturbation or Disease Dataset
Problem: Low Accuracy in Predicting Gene Expression Changes from Perturbations
Case Study 1: Constructing a Tissue-Specific Protein-Protein Interaction Atlas [51] [52]
The workflow for this protein association analysis is summarized in the diagram below:
Case Study 2: Benchmarking Single-Cell Foundation Models [4]
Case Study 3: Large-Scale Lung Disease Perturbation Screening [53]
The conceptual pipeline for this drug perturbation atlas is as follows:
Table 1: Key Features of Recent Tissue and Interaction Atlases
| Atlas Name | Data Type | Scale | Key Application | Reference / Access |
|---|---|---|---|---|
| miRNATissueAtlas 2025 [49] [50] | 9 sncRNA classes | 61,593 samples (Human & Mouse); 224 human tissues | Tissue specificity index (TSI) calculation; cross-species comparison | https://web.ccb.uni-saarland.de/mirnatissueatlas_2025/ |
| Protein-Protein Interaction Atlas [51] [52] | Protein coabundance | 7,811 samples; 11 human tissues; 116M protein pairs | Prioritizing candidate disease genes in a tissue-specific context | www.ppiatlas.com |
| Human Protein Atlas v25 [54] | Protein expression & localization | All protein-coding genes; 10M+ images; 34 scRNA-seq tissues | Spatial proteomics; disease blood protein profiling; interaction networks | https://www.proteinatlas.org/ |
| Lung Disease Perturbation Atlas [53] | scRNA-seq post-perturbation | 900 pharmacological interventions on human lung tissue | Identifying therapeutic targets and regenerative circuits | In development (Helmholtz Munich) |
Table 2: scFM Performance on Key Tasks (Synthesized from Benchmarking Studies) [4]
| Model Task | Performance Insight | Key Metric(s) | Recommendation for HVG Selection |
|---|---|---|---|
| Cell Type Annotation | Performance varies; scFMs do not always beat baselines. Error severity can be assessed. | Accuracy, LCAD | Select HVGs with known cell-type specificity from atlases to improve accuracy. |
| Batch Integration | scFMs are generally robust, but simpler methods can be competitive. | Local Inverse Simpson's Index (LISI) | Ensure HVGs are not driven by batch-specific technical artifacts. |
| Biological Relevance | Pretrained scFM embeddings capture meaningful biological relationships. | scGraph-OntoRWR | Prioritize HVGs that are central in gene regulatory networks. |
| Drug Sensitivity Prediction | A clinically relevant task where scFM generalization can be tested. | AUPRC, MSE | Incorporate pathway-specific genes from disease atlases into the feature set. |
Table 3: Key Research Reagent Solutions for Atlas Construction and scFM Training
| Item / Resource | Function | Example Use Case |
|---|---|---|
| CORUM Database [51] | A curated database of experimentally characterized protein complexes. | Used as a ground-truth reference for training and validating protein-protein association predictions [51]. |
| Cell Ontology | A structured, controlled vocabulary for cell types. | Enables the use of metrics like LCAD and scGraph-OntoRWR to evaluate the biological plausibility of scFM outputs [4]. |
| Parse Biosciences Evercode / GigaLab [53] | A scalable single-cell RNA sequencing platform based on combinatorial barcoding. | Used for generating massive perturbation datasets, such as the lung disease atlas, with reduced batch effects [53]. |
| Olink & SomaScan Assays [54] | High-throughput proteomics platforms for measuring protein levels in biofluids. | Used in the Human Protein Atlas to build the Human Disease Blood resource, profiling 71 diseases [54]. |
| AlphaFold3 [54] | A deep learning model for highly accurate protein structure prediction. | Used to predict structures for thousands of protein-protein interactions within the Human Protein Atlas [54]. |
| PEREGGRN Benchmarking Platform [40] | A software platform for fairly evaluating expression forecasting methods on unseen genetic perturbations. | Prevents data leakage and provides a standardized way to compare new forecasting methods against simple baselines [40]. |
Q: What are Highly Variable Genes (HVGs) and why is their selection a critical step in scRNA-seq analysis?
A: Highly Variable Genes (HVGs) are those that show considerable variation in expression across the single cells in your dataset. Selecting them is a pivotal step because these genes are often the main drivers of meaningful biological heterogeneity, such as differences between cell types or states. Focusing on HVGs helps to reduce the data dimensionality, decrease computational noise, and enhance the signal for downstream analyses like clustering and trajectory inference [16] [2].
Q: How do I determine the optimal number of Highly Variable Genes to use for my analysis?
A: There is no universal "correct" number of HVGs; the optimal number is dataset-dependent and involves a trade-off between retaining biological signal and introducing noise. A common heuristic is to select between 2,000 and 5,000 HVGs [16]. The best practice is to use a data-driven approach by ranking genes based on a measure of their biological variability and then selecting a cut-off where the ranking starts to be dominated by technical noise rather than biological signal. Many analysis workflows, such as the one in Seurat, use a default of 3,000 HVGs [13]. Performance can be evaluated using downstream metrics like silhouette width or the accuracy of known cell type separation [55].
Q: What are the consequences of selecting too many or too few HVGs?
A: The number of HVGs selected has a direct impact on your results.
Q: My downstream clustering seems driven by technical artifacts like cell cycle phase. Did I choose the wrong number of HVGs?
A: Not necessarily. While an improper HVG count can exacerbate this, the issue often lies in the data normalization step. Technical variation from sources like cell cycle, mitochondrial read percentage, or sequencing depth can confound biological differences. It is recommended to check and, if necessary, regress out these nuisance variables during the normalization and HVG selection process using methods like SCTransform in Seurat [13]. This ensures that the selected HVGs reflect interesting biological variation.
Description: The cell clusters identified change significantly when you increase or decrease the number of HVGs used, leading to instability in your biological interpretation.
Solution:
Description: A cell type that you expect to be present based on prior knowledge or marker genes does not form a distinct cluster in your analysis.
Solution:
The following table summarizes the characteristics of different statistical models used to quantify per-gene variation and select HVGs. The choice of model influences which genes are prioritized.
| Method | Underlying Model | Key Feature | Best Suited For |
|---|---|---|---|
| ModelGeneVar [2] | Fits a trend to the variance of log-normalized values across all genes. | Separates total variance into technical (uninteresting) and biological (interesting) components. | General purpose analysis where most genes are not differentially expressed. |
| ModelGeneVarWithSpikes [2] | Fits a trend to the variance of spike-in transcripts. | Uses spike-ins to directly model technical noise without biological contamination. | Datasets with reliably added spike-in controls. |
| ModelGeneVarByPoisson [2] | Assumes UMI counts exhibit near-Poisson technical noise. | Constructs a technical trend based on a Poisson distribution assumption. | UMI-based datasets (e.g., 10x Genomics) without spike-in controls. |
| sctransform [13] | Regularized Negative Binomial regression. | Directly models and removes technical variation (e.g., sequencing depth), returning residuals as normalized data. | A modern, robust method recommended for UMI data that avoids overfitting. |
This protocol outlines the steps for normalizing data and identifying HVGs using the SCTransform method within the popular Seurat package, which accounts for technical confounders.
1. Prerequisite: Quality Control
2. Normalization & HVG Selection with SCTransform
SCTransform function performs normalization, variance stabilization, and HVG selection in a single step.mitoRatio (mitochondrial gene percentage) and, if identified as a major source of variation, cell cycle scores [13].SCTransform will rank genes by residual variance and output the 3,000 most variable genes, which are stored in the "SCT" assay of the Seurat object [13].3. (Optional) Cell Cycle Scoring
NormalizeData and then score cells for S and G2/M phase using pre-defined gene lists with CellCycleScoring.Phase. If the cells do not separate strongly by phase, it may not need to be regressed out [13].4. Downstream Validation
The following diagram illustrates the logical process for selecting and validating the set of Highly Variable Genes for your analysis.
| Item | Function in HVG Analysis |
|---|---|
| Spike-in RNAs (e.g., ERCC) [55] [57] | Exogenous RNA controls of known concentration used to create a standard curve. They help to accurately model technical noise for improved HVG selection, especially in full-length sequencing protocols. |
| Unique Molecular Identifiers (UMIs) [55] [57] | Random barcodes that tag individual mRNA molecules before amplification. UMIs correct for PCR amplification bias, leading to more accurate gene expression counts and a more reliable quantification of gene variability. |
| 10x Genomics Chromium [56] | A widely used droplet-based single-cell platform that incorporates UMIs by default, generating data suitable for robust HVG detection methods like SCTransform and modelGeneVarByPoisson. |
| Seurat R Toolkit [13] | A comprehensive software package that provides multiple integrated functions for scRNA-seq analysis, including the SCTransform normalization/HVG method and standard FindVariableFeatures with several model options. |
| SingleCellExperiment (SCE) Object [58] [2] | A standard data structure in Bioconductor for storing single-cell data. It is used by various packages (e.g., scran) that offer alternative HVG selection methods like the deconvolution-based approach and modelGeneVar. |
Why does Highly Variable Gene (HVG) selection significantly impact the reproducibility of my single-cell Foundation Model (scFM) training? HVG selection directly influences which biological signals your model learns. Different HVG methods can select substantially different gene sets, leading to models that capture varying aspects of the data. A 2025 benchmark found that feature selection methods significantly affect integration performance and subsequent query mapping, with implications for model generalizability [6]. Selecting inconsistent HVGs across experiments will yield models that prioritize different biological features, directly harming reproducibility.
What are the primary sources of irreproducibility in HVG selection? The main sources are:
How can I determine if my HVG selection is capturing biological signal versus technical noise? Use spike-in controls when available to model technical noise separately from biological variation [2]. For data without spike-ins, leverage mean-variance trend modeling or Poisson-based noise models [2]. Additionally, evaluate your selected HVGs using batch-aware methods that can distinguish technical batches from biological variation [6].
Issue: Your scFM produces cell embeddings that lead to inconsistent cell type annotations when compared to reference atlases.
Solution:
Issue: Your scFM performs well on training data but fails to generalize to new datasets.
Solution:
Experimental Protocol: Assessing Generalization Capability
Issue: You're unsure which of the many HVG methods to implement for optimal reproducibility.
Solution:
Table 1: HVG Method Categories and Characteristics [1] [59]
| Method Category | Representative Methods | Key Characteristics | Reproducibility Considerations |
|---|---|---|---|
| Differential Expression Based | Wilcoxon rank-sum, t-test, logistic regression | Uses statistical testing between groups; most common approach | Simple methods show strong performance; less parameter tuning needed |
| Variance Modeling | Brennecke, scran, scVEGs | Models mean-variance relationship; decomposes technical and biological variation | Requires proper normalization; sensitive to distribution assumptions |
| Feature Selection | NSForest, SMaSH, RankCorr | Selects genes maximally informative for classification | May prioritize different genes than DE methods; evaluate with task-specific metrics |
| Bayesian Approaches | BASiCS | Uses hierarchical models to decompose variation sources | Computationally intensive but provides uncertainty quantification |
Table 2: Quantitative Performance of Common Methods Across Benchmarking Studies [6] [1] [59]
| Method | Integration Performance | Biological Conservation | Query Mapping | Computational Efficiency |
|---|---|---|---|---|
| Highly Variable (scanpy) | High | High | Moderate-High | High |
| Wilcoxon Test | Moderate | High | Moderate | High |
| Seurat VDM | Moderate-High | Moderate-High | Moderate | High |
| scran | Moderate | High | Moderate | Moderate |
| BASiCS | Moderate | Moderate | Moderate | Low |
Table 3: Key Experimental Materials for Reproducible HVG Selection
| Reagent/Resource | Function in HVG Selection | Implementation Considerations |
|---|---|---|
| Spike-in Controls (ERCC) | Enables technical noise modeling for variance decomposition | Use consistent concentrations across experiments; required for methods like BASiCS [2] |
| Batch-Aware Normalization | Removes technical artifacts while preserving biological variation | Choose methods appropriate for your technology (UMI vs. full-length) [1] |
| Reference Cell Atlases | Provides ground truth for biological conservation metrics | Use consistent mapping and annotation practices across studies [23] |
| Standardized Quality Metrics | Quantifies integration and mapping performance | Implement multiple metric types: batch correction, bio conservation, and query mapping [6] |
To ensure reproducible HVG selection for scFM training, implement this standardized evaluation protocol adapted from recent benchmarks [6] [23]:
Establish Baselines:
Comprehensive Metric Selection:
Scale and Aggregate Scores:
Dataset-Specific Validation:
This structured approach ensures that HVG selection is evaluated across multiple performance dimensions relevant to scFM training, significantly enhancing reproducibility across studies and research groups.
FAQ 1: Why is my single-cell RNA-seq data failing to identify known rare cell populations?
Your data may be affected by technical noise, including the "dropout effect," where genes are not detected even when expressed [60]. This is particularly detrimental for rare cells, where biological signals are already faint. Ensuring sufficient transcriptome coverage (number of genes detected per cell) is critical; below an empirical threshold, it becomes impossible to reliably separate true rare cell expression from technical artifacts [61]. Furthermore, standard clustering algorithms often fail to identify populations comprising less than 2% of the total cells, leading to rare cells being merged with abundant populations [62].
FAQ 2: How can I improve the sensitivity of my experiment for rare cell detection?
Sensitivity can be improved both experimentally and computationally.
FAQ 3: What is the trade-off between sequencing more cells versus sequencing them more deeply?
This trade-off depends on your biological question. Research has shown that when the number of genes required to answer the question is small, greater transcriptome coverage (i.e., deeper sequencing per cell) is more important than analyzing a massive number of cells. Deeper sequencing reduces subsampling noise, which is crucial for accurately resolving the expression distribution of individual genes, especially those expressed in rare cells [61]. However, for discovering extremely rare cell types, sequencing a large number of cells remains necessary, provided each cell has sufficient coverage.
FAQ 4: Which feature selection method should I use for datasets with fine-resolution cell types or minority populations?
Many standard Highly Variable Genes (HVG) selection methods struggle with fine-resolution datasets. A novel framework called Mcadet has been developed to address this. It integrates Multiple Correspondence Analysis (MCA) and graph-based community detection to more accurately select informative genes from complex datasets, including those with minority cell populations [64]. Performance comparisons on such datasets suggest Mcadet outperforms several other established feature selection methods [64].
Problem: A high proportion of zero counts in your data, known as the "dropout effect," is obscuring real biological signals, particularly for lowly expressed genes.
Solution: Implement a computational noise-reduction method.
The following diagram illustrates the functional principle of how iRECODE processes single-cell data to enhance biological signals.
Problem: Standard unsupervised clustering methods (e.g., Seurat, SC3) are unable to identify rare cell types that constitute less than 1-2% of your total cell population [62].
Solution: Employ a two-step clustering approach specifically designed for rare cell detection.
The workflow for this two-step clustering strategy is outlined below.
This protocol is adapted from a study that used smFISH as a gold standard to validate findings from single-cell RNA sequencing [61].
1. Objective: To quantitatively assess the tradeoffs in scRNA-seq data for detecting gene expression variability in rare cells.
2. Materials:
3. Methodology:
4. Expected Outcome: The smFISH data will provide a high-resolution, quantitative baseline of true gene expression distribution, against which the sensitivity and accuracy of scRNA-seq protocols can be rigorously evaluated. This allows for the establishment of empirical quality thresholds (e.g., minimum transcripts/cell or genes/cell) necessary for reliable rare cell analysis.
Table comparing different single-cell RNA sequencing methods based on their reported sensitivity, number of genes detected, and other key metrics relevant to rare cell detection.
| Method | Reported Sensitivity (Spike-in) | Key Improvements | Impact on Rare Cell Detection |
|---|---|---|---|
| CEL-Seq2 [63] | ~20% (from 5.8% in CEL-Seq) | Shorter primer, optimized RT enzymes, bead-based clean-up, ligation-free library prep. | Detects twice as many transcripts and 30% more genes per cell, improving the chance of capturing rare cell signatures. |
| DropSeq [61] | Information Not Specified | High-throughput, low cost per cell. | Wide range of transcriptome coverage per cell; requires careful thresholding to avoid false positives/negatives for rare genes. |
| Fluidigm C1 [61] | Information Not Specified | More even transcriptome distribution, higher reads/cell. | Lower number of cells sequenced, but higher data quality per cell can be beneficial. |
Table summarizing key computational tools designed to address challenges in rare cell population identification.
| Tool | Function | Key Advantage | Reference |
|---|---|---|---|
| CellSIUS | Rare cell population identification | Identifies rare cell types and their functional transcriptomic signatures from complex data. | [62] |
| Mcadet | Feature Selection (HVG selection) | Superior performance on fine-resolution datasets and datasets with minority cell types. | [64] |
| iRECODE | Technical and Batch Noise Reduction | Comprehensive noise reduction across multiple data types (RNA-seq, spatial, scHi-C) with low computational cost. | [60] |
| Symphony | Reference Atlas Mapping | Efficiently maps query cells to a large, integrated reference to transfer annotations and identify cell states. | [65] |
Table of key reagents, technologies, and computational tools used in the field of rare cell analysis.
| Item | Function/Description | Application in Rare Cell Studies |
|---|---|---|
| Single Molecule RNA FISH | A gold standard method for quantitative, single-cell, single-molecule mRNA counting using fluorescent probes [61]. | Validating gene expression distributions and rare cell states identified by scRNA-seq [61]. |
| Fluidigm C1 System | An automated microfluidic system for capturing individual cells and performing single-cell RNA sequencing. | Provides high-sensitivity data with more uniform transcriptome coverage, useful for characterizing rare cells [61] [63]. |
| CEL-Seq2 Primers | Optimized primers with Unique Molecular Identifiers (UMIs) for highly multiplexed, sensitive scRNA-seq. | Increases transcript detection efficiency, improving the resolution of gene expression in all cells, including rare types [63]. |
| CellSIUS Software | A computational algorithm for Cell Subtype Identification from Upregulated gene Sets. | Detects rare cell populations and their signature genes from complex scRNA-seq data after coarse clustering [62]. |
| iRECODE Platform | A computational method for comprehensive noise reduction in single-cell data. | Reduces technical dropouts and batch effects, clarifying subtle biological signals from rare cells [60]. |
What are the most common sources of technical bias in scRNA-seq data for scFM training? Technical biases primarily arise from the sequencing platform (e.g., different 10x Genomics kit chemistries), library preparation protocols, and sample processing batches. For scFMs, which are trained on massive, aggregated datasets, these biases can obscure true biological variation. Key artifacts include batch effects, where technical differences mimic biological signals; ambient RNA, which is background noise from lysed cells; and variations in sequencing depth between samples [5] [32] [56].
Why is handling technical artifacts critical for selecting Highly Variable Genes (HVGs) in scFM research? HVG selection is a foundational step that identifies genes with high biological variance for downstream analysis. Technical artifacts can artificially inflate the variance of non-informative genes, leading to a biased HVG list. Training an scFM on such a list will cause the model to learn noise instead of underlying biology, reducing its performance and generalizability across diverse cell types and tissues [4] [66].
My dataset shows good clustering but poor integration with a public atlas. Is this a technical bias? Yes, this is a classic symptom of substantial batch effects, often described as "system-level" biases. This occurs when integrating across different biological systems (e.g., primary tissue vs. organoids) or technologies (e.g., single-cell vs. single-nuclei RNA-seq). Standard batch correction methods may fail or inadvertently remove biological signal in these scenarios, requiring more advanced integration strategies [32].
Problem: Suspected technical artifacts are confounding the biological signal in your dataset, leading to unreliable HVG selection.
Solution: Follow a systematic quality control (QC) and diagnostic protocol.
Experimental Protocol:
The relationship between QC metrics and data filtering is a sequential diagnostic process, summarized in the following workflow:
Problem: Widespread, low-level expression of marker genes in unlikely cell types, suggesting contamination from ambient RNA.
Solution: Use computational tools to estimate and subtract the ambient RNA profile.
Experimental Protocol:
Problem: Batch effects are so strong that they prevent meaningful integration and consensus HVG selection across datasets.
Solution: Move beyond simple linear correction methods to more powerful deep learning models.
Experimental Protocol:
The following table summarizes the key reagents and computational tools essential for tackling technical artifacts.
| Tool / Reagent | Primary Function | Key Application in scFM Research |
|---|---|---|
| Cell Ranger [56] [66] | Raw data processing & alignment | Generates standardized gene-barcode matrices from platform-specific raw data (FASTQ); the foundational step for all analysis. |
| SoupX / CellBender [56] [66] | Ambient RNA removal | Removes technical noise from the count matrix, ensuring HVG selection is based on true cellular expression. |
| Harmony [66] | Batch effect correction | A fast and efficient method for integrating datasets from different batches or donors, often used in atlas-level projects. |
| scvi-tools [32] [66] | Deep generative modeling | Uses variational autoencoders (VAEs) for powerful, probabilistic batch correction and integration of complex datasets. |
| sysVI [32] | Integration of diverse systems | A cVAE-based method designed for substantial batch effects (e.g., cross-species), using VampPrior and cycle-consistency. |
For research specifically aimed at training single-cell foundation models, where data scale and quality are paramount, a more robust pipeline is recommended. The diagram below integrates multiple correction strategies to produce clean, integrated data for robust HVG selection.
This workflow emphasizes that handling technical artifacts is not a single step but a cascade of pre-processing decisions. By systematically addressing these biases, researchers can select HVGs that more accurately reflect biology, thereby building more robust and generalizable single-cell foundation models [4] [32].
Problem: You are using only Highly Variable Genes (HVGs) or only Spatially Variable Genes (SVGs) for clustering, which may be capturing an incomplete picture of the biological variation.
Solution: Combine HVG and SVG gene sets to improve clustering accuracy.
modelGeneVar in Scran or FindVariableFeatures in Seurat) from the gene expression matrix [2].nnSVG, SPARK-X, SpatialDE) that incorporate spatial coordinates [69] [70].Problem: The choice of SVG detection method significantly impacts results, as different methods can yield highly dissimilar SVG lists [70].
Solution: Select a method based on your data type and the specific category of SVGs you wish to find.
HVGs are genes whose expression levels show high variance across individual cells, often identified from single-cell RNA-seq data without spatial context. The underlying assumption is that high biological variation is more interesting than technical noise [2]. SVGs are genes whose expression levels show a non-random, spatially autocorrelated pattern across the tissue [69]. In spatial transcriptomics data, these two gene sets are often distinct, suggesting they capture complementary biological information [68].
Using the union of HVGs and SVGs is more effective than using all genes. Analyses show that including all genes does not improve accuracy further and can sometimes decrease performance, likely due to the introduction of non-informative genes that add noise [68]. The combined set provides a curated, informative feature list that enhances downstream analysis efficiency and accuracy.
Some HVG detection methods can have low reproducibility. To address this, you can employ strategies like SIEVE (SIngle-cEll Variable gEnes), which uses multiple rounds of random sampling to identify a robust and stable set of variable genes, thereby improving downstream classification accuracy [11].
Computational time and memory usage vary significantly between SVG methods [70].
SPARK-X or nnSVG [70].The table below summarizes the improvement in clustering performance when combining HVGs and SVGs, as demonstrated across multiple spatial transcriptomics platforms [68].
| Platform | Number of Datasets | Key Performance Improvement with Combined HVG+SVG |
|---|---|---|
| 10X Visium | Multiple | Significant increase in AMI and weighted F1 score; improved delineation of cancer cells, connective tissues, and immune cells. |
| 10X Xenium | Multiple (e.g., Kidney) | Improved separation of proximal tubule segments (PCT, PCT-TAL) and better classification of endothelial and mesangial cells. |
| Nanostring CosMx | Multiple (e.g., Patient 5-2, FOV 7) | More accurate identification of tumor cells and specific immune cell types (B-cells, neutrophils). |
| Vizgen merFISH | Multiple (e.g., Mouse Hypothalamus) | Enhanced classification of inhibitory neurons. |
This table compares popular SVG detection methods based on a systematic benchmark study [70].
| Method | Key Characteristics | Considerations |
|---|---|---|
| nnSVG | Nearest-neighbor Gaussian process; high correlation with Moran's I and MERINGUE. | Low to moderate dependency on gene expression level. |
| SPARK-X | Non-parametric model; computationally fast. | High dependency on gene expression level; can be biased towards highly expressed genes. |
| SpatialDE | Gaussian process regression. | Shows low concordance with other methods; results can be highly variable across datasets. |
| Moran's I | Measures spatial autocorrelation. | Moderate dependency on gene expression level. |
| SOMDE | Self-organizing map. | Often reports very few significant SVGs. |
This protocol is based on the workflow used to evaluate the clustering performance of combined gene sets on real spatial transcriptomics data [68].
Data Preprocessing:
Feature Selection:
nnSVG, SPARK-X) using both the gene expression matrix and spatial coordinates. Select genes based on a statistically significant adjusted p-value (e.g., FDR < 0.05) to form the SVG set [70].Dimensionality Reduction and Clustering:
Performance Evaluation:
The table below lists key computational tools and their functions for analyzing variable genes in spatial transcriptomics.
| Tool / Resource | Function | Use Case |
|---|---|---|
| Seurat | R toolkit for single-cell and spatial genomics; includes HVG detection and integration of spatial coordinates. | Standard pipeline for preprocessing, HVG selection, and initial spatial analysis [70]. |
| Giotto | Suite for spatial transcriptomics data analysis; includes multiple built-in SVG detection methods. | Analyzing spatial patterns and identifying spatial domains [70]. |
| nnSVG | Scalable method for detecting SVGs using nearest neighbor Gaussian processes. | Robust and scalable SVG detection suitable for large datasets [70]. |
| SPARK-X | Non-parametric method for detecting SVGs; computationally efficient. | Rapid SVG detection on large-scale datasets [70]. |
| SIEVE | Strategy that uses multiple rounds of random sampling to identify robust HVGs. | Improving the reproducibility and accuracy of HVG selection in scRNA-seq data [11]. |
FAQ 1: Why can't I use standard HVG selection methods for multi-omic foundation model training? Standard highly variable gene (HVG) selection methods are designed for single-modal data (e.g., scRNA-seq alone) and quantify variation based on expression patterns within that single modality [71]. Multi-omic foundation models, such as scGPT, are trained on diverse data types including transcriptomic and epigenomic data (e.g., scATAC-seq) which have fundamentally different statistical characteristics and scales [72] [73]. Applying standard HVG selection directly fails to account for the integrated nature of multi-omic cellular representations, potentially selecting features that optimize for technical variance rather than shared biological meaning across modalities.
FAQ 2: How does data binarization help with multi-omic integration for foundation models? Binarizing scRNA-seq data (converting gene expression to "on"/"off" states) creates quantitative similarity with scATAC-seq data, which is inherently binary in nature [72]. This transformation enables direct vertical integration through concatenation of the two modalities, followed by application of scATAC-seq-optimized algorithms like TF-IDF and Latent Semantic Indexing (LSI) [72]. This approach avoids subjective conversion of scATAC-seq data to gene activity scores and enables direct investigation of how each data type contributes to cell identity resolution, which is crucial for foundation model pretraining.
FAQ 3: What are the key computational challenges in HVG selection for cross-species foundation models? Cross-species foundation models like scPlantFormer face significant challenges in HVG selection due to orthology mapping complexities and evolutionary divergence in gene regulatory networks [73]. The primary challenge involves identifying genes whose variability patterns conserve biological meaning across species boundaries while accounting for technical batch effects that can exceed biological variation. Successful models address this by integrating phylogenetic constraints into their attention mechanisms and employing batch correction algorithms like Harmony or Seurat's integration methods before HVG selection [74] [73].
Symptoms: Foundation model fails to learn unified representations; modality-specific clustering persists in latent space.
Solutions:
Verification: Check that cell-type separation improves in integrated UMAP visualizations and biological replicate alignment increases.
Symptoms: Technical variation dominates HVG selection; batches cluster separately despite biological similarity.
Solutions:
Verification: Compare pre- and post-integration visualizations; biological groups should cluster together across technical batches.
Symptoms: Computational bottlenecks during feature selection; memory overload with million-cell datasets.
Solutions:
Verification: Monitor computational resource usage and ensure selected HVGs maintain performance on downstream tasks.
Purpose: Create unified feature representations from scRNA-seq and scATAC-seq data for foundation model training.
Materials:
Procedure:
1 if raw count > 0, otherwise 0 [72]Feature Selection:
Data Concatenation:
[binary_RNA_data | ATAC_data] with cells as rows and union of features as columns [72]Normalization & Reduction:
Validation: Compare clustering resolution and cell-type discrimination against standard gene activity score approaches.
Purpose: Identify HVGs with dynamic expression patterns across multiple time points for temporal foundation modeling.
Materials:
Procedure:
Data Integration:
Time-Course HVG Identification:
Pathway Enrichment Analysis:
Validation: Visualize dynamic expression patterns of selected HVGs across multiple cell types and time points.
| Method | Data Modality | Integration Approach | Clustering Accuracy | Computational Efficiency |
|---|---|---|---|---|
| Standard HVG Selection [71] | scRNA-seq only | Not applicable | 77-100% (cell-type matching) | High |
| Binarization + TF-IDF/LSI [72] | scRNA-seq + scATAC-seq | Direct concatenation | 86% mean accuracy (improved separation) | Medium |
| Foundation Model Embeddings [73] | Multi-omic | Cross-modal attention | 92% cross-species accuracy | Lower (pretraining required) |
| Time-Course HVG Framework [75] | Time-series scRNA-seq | Temporal integration | Captures dynamic patterns | Medium |
| Model | Training Scale | Multi-Omic Support | Key HVG-Related Features | Reported Performance |
|---|---|---|---|---|
| scGPT [73] | 33M+ cells | Transcriptomics + Epigenomics | Zero-shot cell annotation, perturbation prediction | Superior multi-omic integration |
| scPlantFormer [73] | 1M plant cells | Cross-species transcriptomics | Phylogenetic constraints in attention | 92% cross-species accuracy |
| Nicheformer [73] | 53M spatial cells | Spatial + Dissociated data | Spatial context prediction | Improved niche identification |
| PathOmCLIP [73] | Multi-tumor datasets | Histology + Spatial transcriptomics | Contrastive learning for cross-modal alignment | Enhanced gene expression prediction |
| Tool/Package | Primary Function | Application in HVG Selection | Reference |
|---|---|---|---|
| Seurat | Single-cell analysis | HVG identification, data integration, multi-omic processing | [74] [75] |
| Scanpy | Single-cell analysis | Binarization processing, TF-IDF normalization, clustering | [72] |
| Harmony | Batch correction | Removing technical variation before HVG selection | [74] [73] |
| SCTransform | Normalization | Regularized negative binomial regression for improved HVG detection | [75] |
| BioLLM | Foundation model benchmarking | Standardized evaluation of HVG selection approaches across models | [73] |
| DoubletFinder | Quality control | Doublet identification to improve HVG selection accuracy | [75] |
| SoupX | Ambient RNA correction | Background noise reduction for cleaner HVG signals | [75] |
| gProfiler2 | Functional enrichment | Biological interpretation of selected HVGs | [75] |
This is a common problem known as overcorrection, where batch correction methods remove both technical artifacts and genuine biological signals. Recent benchmarking studies reveal that many popular methods struggle with this balance.
Solutions:
Experimental Protocol:
Current benchmarks indicate that method performance varies significantly across different scenarios, and no single method consistently outperforms others across all tasks [4].
Table 1: Batch Correction Method Performance Summary
| Method | Strengths | Limitations | Recommended Use Cases |
|---|---|---|---|
| Harmony | Consistently performs well without creating artifacts [77] | Only outputs low-dimensional embeddings [76] | Standard batch effects within similar systems |
| sysVI (VAMP + CYC) | Handles substantial batch effects while preserving biology [33] | More complex implementation | Cross-species, organoid-tissue, different protocols |
| cVAE with adversarial learning | Strong batch mixing | Prone to mixing unrelated cell types [33] | Not recommended for datasets with unbalanced cell types |
| Seurat | Good overall performance in benchmarks [76] | Can overcorrect with too many neighbors [76] | Technical batches within same biological system |
Selection Framework:
Feature selection critically impacts integration quality and downstream query mapping. Recent registered report findings provide specific guidance:
Table 2: Feature Selection Impact on Integration Performance
| Feature Selection Method | Integration Quality | Query Mapping | Biological Conservation |
|---|---|---|---|
| Highly Variable Genes (HVG) | High | Moderate to High | Good |
| Batch-aware HVG | Highest | High | Good |
| Random Features | Poor | Variable | Poor |
| Stably Expressed Genes | Poor | Poor | Poor |
Key Findings:
Traditional benchmarking has overemphasized batch mixing while underestimating biological conservation. New frameworks address this limitation:
Recommended Metric Framework:
Evaluation Workflow for scRNA-seq Integration
Overcorrection occurs when batch correction removes genuine biological variation, leading to false biological conclusions.
Detection Methods:
Experimental Protocol for Overcorrection Detection:
Deep learning methods, particularly variational autoencoders, offer flexible frameworks for balancing batch correction with biological preservation.
Key Advantages:
Deep Learning Integration Strategies
Table 3: Essential Computational Tools for scRNA-seq Integration
| Tool/Resource | Function | Application Context |
|---|---|---|
| RBET Framework | Overcorrection-aware evaluation | Validating integration quality without biological knowledge degradation [76] |
| scIB Metrics | Comprehensive integration benchmarking | Standardized evaluation of batch correction and biological conservation [6] |
| sysVI | Handling substantial batch effects | Cross-system integration (species, technologies, organoid-tissue) [33] |
| Harmony | Robust standard batch correction | Technical batch effects within similar biological systems [77] |
| scVI/scANVI | Deep learning integration | Flexible integration with semi-supervised capabilities [78] |
| PEREGGRN | Expression forecasting benchmark | Evaluating perturbation prediction performance [40] |
| GGRN Software | Grammar of gene regulatory networks | Network-based expression forecasting [40] |
Q1: Why is the selection of Highly Variable Genes (HVGs) so critical for single-cell Foundation Model (scFM) training?
HVG selection is a fundamental preprocessing step that reduces the high dimensionality and sparsity inherent in single-cell RNA-seq data. Selecting a subset of informative genes helps to mitigate technical noise and computational burden, allowing the model to focus on genes that drive biological heterogeneity. The choice of HVG selection strategy can significantly influence the model's ability to learn meaningful biological representations, ultimately affecting performance on downstream tasks like cell type annotation and perturbation prediction [4] [3] [16].
Q2: I encountered a "reciprocal condition number" error when using Seurat V3's HVG selection with a batch_key in Scanpy. How can I resolve this?
This error often arises when one or more batches in your dataset contain genes with very low or zero counts, making the covariance matrix for the LOESS regression ill-conditioned [79]. You can try the following troubleshooting steps:
sc.pp.filter_genes(adata, min_counts=) to remove low-abundance genes across all batches.flavor='cell_ranger') that does not use the same internal regression, or select HVGs without the batch_key argument. Note that selecting HVGs without batch correction may leave technical confounders in your data [79].Q3: My scFM underperforms compared to simple baseline models on perturbation prediction tasks. Is this a known issue?
Yes, recent independent benchmarks have highlighted this challenge. Several studies have found that for specific tasks like predicting transcriptome changes after genetic perturbations, sophisticated scFMs (such as scGPT and scFoundation) can be outperformed by deliberately simple baselines, including a model that just predicts the mean expression from the training data or a linear model using Gene Ontology features [26] [80]. This suggests that the goal of building a generalizable model for predicting novel experimental outcomes is still an active area of research, and simpler models should be included as baselines in your workflow.
Q4: How can I quantitatively evaluate if my chosen HVG strategy has improved my scFM's biological relevance?
Beyond standard performance metrics, you can employ novel, biology-driven evaluation metrics. For example, recent benchmarks have proposed:
The following tables summarize key findings from recent benchmark studies, comparing scFMs against traditional methods and simple baselines across various tasks.
Table 1: Performance Overview of scFMs vs. Baselines on Cell-Level Tasks
| Model Category | Example Models | Strengths | Limitations / Findings |
|---|---|---|---|
| Single-cell Foundation Models (scFMs) | Geneformer, scGPT, scFoundation [4] | Robust and versatile across diverse applications; effective at batch integration and cell type annotation [4]. | No single scFM consistently outperforms all others; performance is task- and dataset-dependent [4]. |
| Traditional Methods | Seurat, Harmony, scVI [4] | Established, efficient, and perform well with smaller datasets [4]. | May be outperformed by scFMs on complex integration tasks or when leveraging pretrained knowledge [4]. |
| Simple Baseline Models | "No change" predictor, Additive model, Linear Regression [26] | Highly efficient and can surprisingly outperform scFMs on specific tasks like perturbation prediction [26] [80]. | Incapable of representing complex biological interactions; their strong performance highlights scFM limitations [26]. |
Table 2: scFM Performance on Perturbation Prediction Benchmarks (Pearson Delta Correlation)
| Model | Adamson et al. Dataset | Norman et al. Dataset | Replogle (K562) Dataset | Replogle (RPE1) Dataset |
|---|---|---|---|---|
| Train Mean (Simple Baseline) | 0.711 | 0.557 | 0.373 | 0.628 |
| Random Forest (GO Features) | 0.739 | 0.586 | 0.480 | 0.648 |
| scGPT | 0.641 | 0.554 | 0.327 | 0.596 |
| scFoundation | 0.552 | 0.459 | 0.269 | 0.471 |
Data adapted from a benchmark study that evaluated models on predicting differential expression after genetic perturbations [80].
Below is a detailed methodology for conducting a comparative benchmark of scFMs, incorporating different HVG selection strategies.
Protocol: A Biology-Oriented Benchmarking Pipeline for scFMs
1. Data Preparation and Curation
2. Application of HVG Selection Strategies For each dataset, apply several HVG selection methods to create different gene subsets for downstream model training and evaluation.
3. Model Training and Feature Extraction
4. Performance Evaluation and Biological Validation
The diagram below visualizes the key decision points and workflow of this benchmarking protocol.
Table 3: Key Computational Tools and Resources for scFM Benchmarking
| Item Name | Function / Application | Reference / Source |
|---|---|---|
| Scanpy / Seurat | Standardized scRNA-seq analysis workflows for QC, normalization, HVG selection, and clustering. | [81] |
| scGPT / Geneformer | Representative single-cell foundation models that can be fine-tuned or used for zero-shot embedding extraction. | [4] [26] |
| CELLxGENE / Cell Atlas | Curated data portals providing access to millions of standardized single-cell datasets for training and benchmarking. | [4] [82] |
| GLP Algorithm | A robust HVG selection method using optimized LOESS regression on the relationship between positive ratio and mean expression. | [3] |
| Gene Ontology (GO) | A knowledge base providing structured biological knowledge that can be used as features in baseline models or for validation. | [26] [80] |
Biological validation is crucial to determine if your scFM has learned meaningful biological principles rather than just technical artifacts or dataset-specific noise. For models trained on highly variable genes (HVGs), this ensures that the selected features capture genuine biological variation rather than amplifying technical noise. Key performance metrics assessed during validation are detailed in the table below.
Table 1: Key Metrics for scFM Biological Validation
| Metric Category | Specific Metric | What It Measures | Interpretation |
|---|---|---|---|
| Cell-level Task Performance | Cell Type Annotation Accuracy | Model's ability to correctly assign cell identity labels [4] [5] [83] | High accuracy confirms the model captures defining transcriptional states. |
| Cell-level Task Performance | Batch Integration Quality | Ability to remove technical artifacts while preserving biological variation [4] [5] | Good integration enables analysis across diverse datasets. |
| Gene-level Task Performance | Expression Forecasting Accuracy | Prediction of gene expression changes after perturbation [40] | Tests the model's understanding of causal regulatory relationships. |
| Knowledge-based Validation | scGraph-OntoRWR | Consistency of model-derived cell relationships with established biological knowledge (e.g., cell ontology) [4] | Measures if the model recapitulates known biology. |
| Knowledge-based Validation | Lowest Common Ancestor Distance (LCAD) | Ontological proximity of misclassified cell types [4] | A smaller distance indicates a semantically reasonable error. |
Protocol 1: Validating scFMs on Cell-level Tasks Using Benchmarking Platforms
Purpose: To objectively evaluate an scFM's performance on standardized, biologically relevant tasks like cell type annotation and batch integration [4]. Methodology:
Protocol 2: Biological Knowledge Alignment with scGraph-OntoRWR
Purpose: To validate that the relationships between cells learned by the scFM are consistent with prior biological knowledge [4]. Methodology:
Protocol 3: Gene Regulatory Insight Validation via Expression Forecasting
Purpose: To test the model's capacity to predict the downstream effects of genetic perturbations, a key sign of understanding regulatory networks [40]. Methodology:
Recent comprehensive benchmarks reveal the strengths and limitations of current scFMs. The table below summarizes the performance of leading models across critical biological and clinical tasks.
Table 2: Performance of scFMs on Key Validation Tasks (Adapted from [4])
| Model Name | Cell Type Annotation | Batch Integration | Drug Sensitivity Prediction | Key Biological Strength |
|---|---|---|---|---|
| Geneformer | Good | Good | Variable | Captures dynamic gene interactions during cell state transitions [84] [5]. |
| scGPT | Good | Good | Variable | Versatile across multiple omics modalities [4] [5]. |
| scFoundation | Good | Good | Good | Robust performance on large-scale clinical tasks [4]. |
| UCE | Good | Good | Variable | Incorporates protein sequence information via protein language models [4]. |
| LangCell | Good | Good | Variable | Integrates text descriptions with gene expression data [4]. |
| scCello | Good | Good | Variable | Infers cell-specific gene regulatory networks [4]. |
Key Benchmarking Insight: No single scFM consistently outperforms all others across every task and dataset. Model selection should be guided by the specific biological question and data characteristics [4].
This table lists essential computational tools and data resources for the biological validation of scFMs.
Table 3: Essential Reagents and Resources for scFM Validation
| Item Name | Function / Purpose | Relevance to scFM Validation |
|---|---|---|
| CZ CELLxGENE [4] [5] | A unified platform providing access to over 100 million curated single-cell datasets. | Serves as a primary source of high-quality, annotated data for benchmarking and testing model generalizability. |
| PEREGGRN & GGRN [40] | A benchmarking platform and software for evaluating expression forecasting methods. | Provides a standardized environment to test your scFM's ability to predict genetic perturbation outcomes. |
| Cell Ontology [4] | A controlled, structured vocabulary for cell types. | Used as the ground-truth knowledge base for metrics like scGraph-OntoRWR and LCAD. |
| SCAVENGE [85] | An algorithm that uses network propagation to map causal genetic variants to relevant cellular contexts at single-cell resolution. | Can be used to generate trait-relevant cellular hypotheses for validating a model's functional insights. |
| Weighted Gene Correlation Network Analysis (WGCNA) [86] | A method to identify clusters (modules) of highly correlated genes. | Useful for validating if the model's latent space preserves known co-expression modules and biological processes. |
The following diagram illustrates the logical flow and key decision points in a comprehensive biological validation pipeline for a single-cell foundation model.
Q1: My single-cell foundation model (scFM) underperforms in cell type annotation on a new dataset. Could the initial selection of Highly Variable Genes (HVGs) be the cause?
Yes, this is a common issue. The HVGs selected for your scFM's pretraining define the feature space the model learns from. If the biological variation in your new query dataset is driven by genes not included in the original HVG set, the model will lack the necessary information for accurate annotation [4] [5]. This is particularly problematic when mapping data from different tissues, species, or disease states not well-represented in the pretraining corpus.
Q2: After integrating multiple datasets using our scFM, we observe strong batch effect removal but a loss of subtle biological signal. How can we improve the balance?
This indicates that the integration process may be over-correcting. The goal of integration is to align shared cell states across batches while preserving unique biological conditions. To troubleshoot:
Q3: When mapping a query dataset to a reference atlas, the model fails to identify a known rare cell population. What steps should we take?
The failure to identify a rare cell type often stems from two issues related to feature selection:
This guide addresses low annotation accuracy after transferring labels from a reference to a query dataset.
| Step | Action & Purpose | Key Parameters & Tools to Check |
|---|---|---|
| 1 | Check Feature Overlap: Confirm the genes used by the reference model are present and reliably measured in your query data. A small overlap will lead to poor performance. | Tool: Seurat's FindTransferAnchors [88]. Parameter: dims (should use the same dimensions as the reference). |
| 2 | Validate HVG Selection: Compare the HVGs from your query dataset to those used in the reference model. If the biological context is different, you may need to recompute HVGs specific to your query before mapping. | Method: FindVariableFeatures (Seurat) [89] or pp.highly_variable_genes (Scanpy) [90]. |
| 3 | Assess Prediction Scores: Examine the prediction scores from the label transfer. Low scores for a particular cell type indicate uncertain annotations, which may require manual curation or a different reference. | Tool: Seurat's TransferData [88]. Output: prediction.score.max column in metadata. |
| 4 | Use Biological Metrics: Evaluate errors using biology-informed metrics like LCAD. A misannotation between closely related cell types (e.g., CD4+ T cell subsets) is less severe than between different lineages (e.g., T cell vs. neuron) [4]. | Metric: Lowest Common Ancestor Distance (LCAD). |
This guide helps when multiple datasets fail to align properly, or when integration removes biological variation.
| Step | Action & Purpose | Key Parameters & Tools to Check |
|---|---|---|
| 1 | Preprocess Independently: Normalize and identify HVGs on each dataset individually before integration. This ensures that technical differences between batches do not confound the selection of biologically relevant features. | Method: Standard pre-processing workflow (NormalizeData > FindVariableFeatures) applied per dataset [89] [91]. |
| 2 | Select an Appropriate Integration Method: Choose a method based on your data size and goal. For large datasets (>1M cells), consider scalable methods like Harmony [90] or scArches [87]. | Tools: IntegrateLayers (Seurat) [91], harmony_integrate (Scanpy) [92], scArches [87]. |
| 3 | Evaluate Integration Quality: Use a combination of metrics to ensure both batch removal and biological conservation. Don't rely on a single metric. | Metrics: Batch Mixing: PCA regression, Entropy of Batch Mixing. Biology Conservation: ARI, NMI, cell-type ASW [87]. |
| 4 | Iterate and Refine: If biological signal is lost, adjust the integration strength or the number of HVGs used. Fine-tuning these parameters is often necessary for optimal results. | Parameter: vars.to.regress in ScaleData (Seurat) for known confounders like mitochondrial percentage [89]. |
The table below summarizes quantitative benchmarks from a 2025 study comparing single-cell foundation models (scFMs) against established baseline methods across key downstream tasks. Performance is a composite score based on multiple metrics, with higher scores being better. No single method outperforms all others in every task, highlighting the need for task-specific selection [4].
Table 1: Benchmarking Scores for Downstream Tasks (General Performance)
| Method | Category | Cell Type Annotation | Data Integration | Query Mapping | Key Strengths |
|---|---|---|---|---|---|
| Seurat (CCA) | Baseline (Anchor-based) | 0.89 | 0.85 | 0.91 | High accuracy in cross-species mapping, well-established [88] [91] |
| Harmony | Baseline (Clustering-based) | 0.85 | 0.88 | 0.82 | Fast, efficient for large datasets, good batch mixing [92] [90] |
| scVI | Baseline (Generative) | 0.87 | 0.90 | 0.84 | Robust probabilistic model, handles complex batch effects [4] [87] |
| scArches | Transfer Learning | 0.91 | 0.92 | 0.95 | Excellent for iterative mapping, preserves unseen cell types [87] |
| scGPT | Foundation Model | 0.90 | 0.87 | 0.89 | Versatile, good zero-shot performance, multimodal potential [4] [5] |
| Geneformer | Foundation Model | 0.88 | 0.83 | 0.86 | Strong on gene-level tasks, good biological interpretability [4] |
Table 2: Performance on Specific Annotation Challenges
This table shows how methods handle specific annotation difficulties, using metrics like scGraph-OntoRWR (measures consistency with known biology) and LCAD (measures severity of misclassification) [4].
| Method | scGraph-OntoRWR (Higher is Better) | LCAD for Rare Cell Types (Lower is Better) | Notes | |
|---|---|---|---|---|
| Seurat | 0.82 | 4.1 | Reliable, errors are often biologically plausible [88] [4] | |
| Harmony | 0.79 | 4.5 | [4] | |
| scArches | 0.85 | 3.8 | Excels at placing novel cell types correctly [87] | |
| scGPT | 0.88 | 3.5 | Captures rich biological relationships from pretraining [4] [5] |
This is a detailed, step-by-step protocol for mapping a query dataset to an integrated reference, a common task for annotating new data [88].
Diagram: Workflow for Reference-Based Query Mapping
Procedure:
reference <- IntegrateLayers(object = pancreas.ref, method = CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca")reference <- RunUMAP(reference, dims = 1:30, reduction = "integrated.cca", return.model = TRUE) # Critical to save the UMAP model [88].query <- NormalizeData(query)anchors <- FindTransferAnchors(reference = reference, query = query, dims = 1:30, reference.reduction = "pca")predictions <- TransferData(anchorset = anchors, refdata = reference$celltype, dims = 1:30)query <- AddMetaData(query, metadata = predictions)query <- MapQuery(anchorset = anchors, reference = reference, query = query, refdata = list(celltype = "celltype"), reference.reduction = "pca", reduction.model = "umap") [88].This protocol is optimized for integrating very large datasets (e.g., >1 million cells) and is a key baseline method [90].
Procedure:
sc.pp.filter_cells(adata, min_counts=100)sc.pp.filter_genes(adata, min_cells=5)adata = adata[(adata.obs.pct_counts_mt < 25) & (adata.obs.n_genes_by_counts < 5000) & (adata.obs.total_counts < 25000),:]sc.pp.normalize_total(adata, target_sum=1e4)sc.pp.log1p(adata)sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.25)sc.tl.pca(adata, svd_solver='arpack')sc.pp.neighbors(adata, n_neighbors=10, n_pcs=50)sc.tl.umap(adata)sc.tl.leiden(adata, resolution=0.5)Table 3: Essential Computational Tools for scFM Downstream Analysis
| Tool / Resource | Function | Relevance to HVGs & scFM Training |
|---|---|---|
| Seurat [88] [89] [91] | A comprehensive R toolkit for single-cell genomics. | Provides robust functions for HVG selection (FindVariableFeatures) and serves as a primary platform for benchmarking anchor-based integration and mapping methods against scFMs. |
| Scanpy [92] [90] | A scalable Python-based single-cell analysis suite. | Enables preprocessing and analysis of very large-scale datasets (millions of cells), with an external API that integrates methods like Harmony, facilitating direct comparison with scFMs. |
| Harmony [92] [90] | Fast, robust integration algorithm. | A top-performing baseline method for data integration. Its performance is a key benchmark for evaluating whether a new scFM provides a significant advantage over established, simpler tools [4]. |
| scArches [87] | Transfer learning method for single-cell data. | Represents a hybrid approach, using deep learning not for foundation training but for efficient, decentralized reference mapping. It is crucial for testing scFM performance in iterative query mapping tasks. |
| CellxGene / CZ CELLxGENE [4] [5] | Curated repository of single-cell datasets. | The primary source for high-quality, annotated data used for both pretraining scFMs and for creating standardized benchmarks to evaluate their performance on downstream tasks like annotation and integration. |
Within the broader thesis on selecting highly variable genes (HVGs) for single-cell foundation model (scFM) training, robust validation is paramount. Traditional metrics, while useful, often fail to capture the biological plausibility of the identified variable genes and cell states. This guide introduces advanced validation approaches that leverage curated biological knowledge from cell ontologies and established pathways to assess whether computational results reflect true biology, ensuring that your scFM training is built on a solid foundation.
Problem: Your analysis identifies a set of highly variable genes, but these genes do not align with known cell-type markers or biological pathways, making the results difficult to interpret.
Solution:
Problem: Your single-cell foundation model, trained on a specific set of tissues, performs poorly when tasked with representing or reconstructing data from a previously unseen cell type [93].
Solution:
Problem: After integrating multiple datasets to train your scFM, you suspect that batch correction has been too aggressive, removing genuine biological variation along with technical noise.
Solution:
iLISI (Integration Local Inverse Simpson's Index) or Batch PCR (Batch Principal Component Regression) to confirm that batches are well-mixed [6].cLISI (Cell-type LISI), isolated label F1, or graph connectivity to ensure distinct cell types remain separable [6].FAQ 1: Why should I use cell ontology-informed metrics instead of standard clustering metrics like silhouette score?
Standard clustering metrics evaluate compactness and separation but are agnostic to biology. You could have a statistically perfect cluster that groups biologically unrelated cells. Cell ontology-informed metrics, such as ontology enrichment scores or semantic similarity between cluster marker genes and known cell types, directly quantify the biological coherence of your results, ensuring they are not just statistically sound but also biologically meaningful.
FAQ 2: My data is from a rare disease with no established reference atlas. How can I perform knowledge-based validation?
In the absence of a perfect reference, you can still use knowledge-based approaches.
FAQ 3: How does the choice of error model (e.g., Poisson vs. Negative Binomial) in preprocessing affect my downstream HVG selection and validation?
The choice of error model is critical for accurate variance estimation [94].
sctransform [95]) for robust normalization and variance stabilization before HVG selection.FAQ 4: What is the minimum recommended number of HVGs for building a robust scFM?
There is no universal minimum, as it depends on biological complexity. However, benchmarks for data integration—a task related to scFM training—suggest that using around 2,000 highly variable features is an effective common practice that often leads to high-quality results [6]. The key is to use this as a starting point and validate that the selected number of genes captures the necessary biological variation without introducing excessive noise.
Objective: To quantitatively assess if a list of highly variable genes (HVGs) is significantly enriched for markers of specific cell types as defined by the Cell Ontology.
Objective: To validate that a computationally inferred pseudotemporal trajectory aligns with known biological stages of development.
The following table details key computational tools and resources essential for implementing the novel validation approaches described in this guide.
| Item Name | Type | Function in Validation |
|---|---|---|
| Cell Ontology (CL) | Database | Provides a structured, controlled vocabulary for cell types, used as a source of known marker genes for enrichment tests [96]. |
| scran | Software Package | A highly variable gene selection method that demonstrated strong all-round performance in benchmarking studies, suitable for generating a robust HVG list for initial validation [97]. |
| scIB | Benchmarking Pipeline / Metrics | Provides a suite of metrics (e.g., iLISI, cLISI, graph connectivity) for evaluating data integration, useful for assessing biological conservation after batch correction [6]. |
| sctransform | Software Package | A normalization method using regularized negative binomial regression that effectively removes technical confounders like sequencing depth, providing a reliable foundation for HVG selection [95] [94]. |
| Single-cell Variational Inference (scVI) | Software Package / Model | A deep generative model for scRNA-seq data that can be used for integration and representation learning; performance is impacted by the feature selection method used [6]. |
This technical support center provides troubleshooting guides and FAQs to help researchers address specific issues encountered when applying single-cell foundation models (scFMs) in clinical and biomedical research.
A: Yes, when mapping is performed correctly. A key study used a deep learning strategy called scArches (single-cell architectural surgery) to map query datasets from patients with COVID-19 onto a healthy reference atlas. The method successfully preserved the disease-specific variation, allowing for the discovery of cell states unique to COVID-19 without the need to retrain the entire model from scratch [87]. This demonstrates that scFMs can be contextually extended to pathological conditions.
A: This is a common challenge. A comprehensive 2025 benchmark study indicates that no single scFM consistently outperforms all others across every task [4]. A performance drop on a new cancer type could be due to:
A: The choice depends on your specific constraints and goals [4]:
Symptom: When integrating your new clinical dataset with a public reference atlas, batch effects are not adequately removed, or fine biological variations (e.g., subtle disease states) are being erased.
Investigation & Resolution:
Diagram: Workflow for Mapping Query Data to a Reference Atlas
Symptom: Your model fails to identify or has very low accuracy in classifying rare cell populations in a heterogeneous sample (e.g., circulating tumor cells).
Investigation & Resolution:
A 2025 benchmark study evaluated six scFMs (Geneformer, scGPT, UCE, scFoundation, LangCell, scCello) against established baselines (e.g., Seurat, Harmony, scVI) on realistic clinical tasks [4].
Objective: To provide a holistic performance ranking and guide model selection for biomedical applications.
Methodology Summary:
Key Quantitative Results:
Table 1: Overall Model Ranking Across Diverse Tasks (Based on Non-Dominated Sorting) [4]
| Model | Overall Ranking | Notable Strengths |
|---|---|---|
| scGPT | Top Tier | Versatile across tasks, handles multimodal data [4] |
| Geneformer | Top Tier | Robust performance on gene-level tasks [4] |
| scFoundation | Competitive | Strong on large-scale data integration [4] |
| UCE | Competitive | Leverages protein sequence information [4] |
| LangCell | Competitive | Incorporates text-cell pairs [4] |
| scCello | Competitive | Specialized for cell state transitions [4] |
| Baseline (scVI) | Contextual | Can be more efficient for specific, small-scale tasks [4] |
Table 2: Model Performance on Specific Clinical Tasks (Generalized Findings) [4]
| Task | Key Finding | Recommendation |
|---|---|---|
| Cancer Cell Identification | Performance varies significantly by cancer type. | Use task-specific rankings; no single model is universally best. |
| Drug Sensitivity Prediction | scFMs provide robust embeddings for prediction models. | scFMs act as effective plug-and-play feature extractors for this task. |
| Cell Type Annotation | scFMs capture biological knowledge, leading to more semantically meaningful errors (e.g., misclassifying closely related types). | Use LCAD metric to assess if misclassifications are biologically plausible. |
This protocol is useful for improving cell type identification, especially in complex disease datasets [99].
Data Pre-processing:
Feature Extraction: Generate four distinct feature matrices from the pre-processed data.
Feature Fusion: Integrate the four feature matrices using one of six fusion strategies (e.g., weighted sum, Hadamard product, attention mechanism, mixture-of-experts, residual fusion, or Transformer-based fusion).
Classification: Feed the final fused representation into a classifier (e.g., SVM, LightGBM) for cell type identification.
Diagram: Logical Workflow of the scMFF Framework
Table 3: Key Research Reagent Solutions for scFM Training and Application
| Item / Resource | Function / Description | Relevance to scFM Research |
|---|---|---|
| CZ CELLxGENE [4] [5] | A unified platform providing access to over 100 million curated and standardized single-cell datasets. | Primary data source for pre-training scFMs and for finding reference atlases for mapping. |
| Highly Variable Genes (HVGs) [99] [98] | A statistical feature set capturing genes with the highest expression variance across cells. | A foundational feature type for model input, crucial for initial dimensionality reduction and capturing cell-to-cell differences. |
| scArches (Algorithm) [87] | A transfer learning method for mapping new query datasets to existing reference atlases without sharing raw data. | Enables efficient, decentralized, and iterative updating of reference models, critical for clinical collaboration. |
| scGraph-OntoRWR (Metric) [4] | A novel evaluation metric that measures the consistency of cell type relationships captured by an scFM with prior biological knowledge from ontologies. | Moves beyond pure accuracy, assessing the biological relevance of the model's latent embeddings. |
| Roughness Index (ROGI) [4] | A metric that quantifies the "smoothness" of the cell-property landscape in a model's latent space. | Serves as a proxy for model selection; a smoother landscape often indicates easier training for downstream tasks. |
Effective selection of highly variable genes is not merely a preprocessing step but a fundamental determinant of single-cell foundation model success. By integrating robust HVG selection methods that account for batch effects, platform-specific biases, and hierarchical biological relationships, researchers can significantly enhance scFM performance across integration, classification, and knowledge extraction tasks. Future directions should focus on developing more biologically-informed selection criteria, creating standardized benchmarking frameworks, and advancing methods that seamlessly integrate HVG selection with foundation model training pipelines. As scFMs continue to transform biomedical research, optimized HVG selection will be crucial for unlocking deeper insights into cellular function, disease mechanisms, and therapeutic development, ultimately bridging the gap between computational innovation and clinical application.