Optimizing Highly Variable Gene Selection for Single-Cell Foundation Model Training: A Comprehensive Guide

Chloe Mitchell Nov 27, 2025 165

Selecting highly variable genes (HVGs) is a critical preprocessing step that profoundly impacts the performance and biological relevance of single-cell foundation models (scFMs).

Optimizing Highly Variable Gene Selection for Single-Cell Foundation Model Training: A Comprehensive Guide

Abstract

Selecting highly variable genes (HVGs) is a critical preprocessing step that profoundly impacts the performance and biological relevance of single-cell foundation models (scFMs). This article provides a comprehensive guide for researchers and drug development professionals, covering foundational concepts, methodological implementation, optimization strategies, and validation approaches for HVG selection in scFM training. Drawing on recent benchmarks and emerging methodologies, we explore how informed HVG selection enhances data integration, improves cell type annotation, and boosts model robustness for downstream clinical and biomedical applications.

The Critical Role of Highly Variable Genes in Single-Cell Foundation Models

Defining Highly Variable Genes and Their Importance in scFM Training

Frequently Asked Questions

What are Highly Variable Genes (HVGs) and why are they important for single-cell analysis?

Highly Variable Genes (HVGs) are genes whose expression levels show significant variation across individual cells within a homogeneous cell population. Unlike bulk RNA sequencing which analyzes averaged expression from mixed cells, single-cell RNA sequencing (scRNA-seq) can detect these cell-to-cell differences. HVGs are crucial because they are presumed to contribute strongly to cellular heterogeneity and often reflect underlying biological processes, cellular states, and key transcriptional drivers of cell identity and function. Selecting HVGs is a critical feature selection step that reduces data dimensionality, enhances computational efficiency, and improves the interpretability of downstream analyses like clustering and trajectory inference [1] [2] [3].

Why is HVG selection critical for training single-cell Foundation Models (scFMs)?

HVG selection is a fundamental preprocessing step for scFM training because it directly addresses the high dimensionality, sparsity, and noise characteristic of scRNA-seq data. By focusing on the most informative features, HVG selection:

Reduces Computational Burden: Training on a subset of genes (e.g., 1,000-5,000 HVGs) instead of the entire genome (>20,000 genes) drastically lowers memory and computational requirements [4] [5].
Improves Model Performance: It filters out genes that contribute mostly technical noise or uninteresting biological variation, allowing the model to learn more robust and biologically meaningful representations of cells and genes [6] [3].
Enhances Biological Insight: scFMs trained on HVGs are better at capturing the fundamental principles of cellular identity and state, which improves their performance on downstream tasks like cell type annotation, batch integration, and perturbation prediction [4] [5].

My scFM isn't performing well on downstream tasks. Could my HVG selection be the issue?

Yes, the choice of HVG method and the number of genes selected can significantly impact scFM performance. If your model is struggling, consider these troubleshooting steps:

Evaluate the Number of Features: Benchmarking studies show that the number of selected features is strongly correlated with the performance of many downstream tasks. While using too few genes can lose biological signal, an excessively large gene set may introduce more noise. It is recommended to test a range of gene set sizes (e.g., from 500 to 5,000) to find the optimum for your specific data and task [6].
Try a Different HVG Method: Different HVG methods use distinct statistical models to quantify variation, which can lead to varying gene rankings. If one method (e.g., scran) underperforms, try another (e.g., Seurat's VST or the novel GLP method) [1] [6] [3].
Check for Batch Effects: If your training data combines multiple datasets, consider using batch-aware HVG selection methods. This ensures that the selected genes are variable within biological conditions rather than being driven by technical batch effects [6].

How do I choose the right HVG method for my scFM project?

There is no single "best" method that outperforms all others in every scenario. Your choice should be guided by your data characteristics and project goals. The table below summarizes key methods:

Table 1: Comparison of Highly Variable Gene (HVG) Detection Methods

Method	Underlying Model / Approach	Key Features	Considerations
Brennecke et al.	Fits a generalized linear model to the relationship between squared coefficient of variation (CV²) and mean expression [1].	Uses DESeq's normalization; filters genes with high uncertainty.	A foundational method; may be superseded by more modern approaches.
scran	Fits a trend to the mean-variance relationship of log-transformed expression values using LOESS [1] [2].	Uses a specialized pooling algorithm for normalization; decomposes variance into technical and biological components.	Robust; considered a strong performer in benchmarks.
Seurat (VST)	Uses a polynomial regression model to find a variance-stabilizing transformation of the mean-variance relationship [1] [6].	Places genes into bins based on expression mean to calculate z-scores; widely used and integrated in Seurat workflows.	A common and effective default choice.
BASiCS	Employs a Bayesian hierarchical model to decompose variation into technical and biological components [1].	Can use spike-in RNAs to model technical noise; can also identify lowly variable genes.	Computationally intensive; powerful for sophisticated noise modeling.
GLP	Uses optimized LOESS regression on the relationship between gene average expression and "positive ratio" (fraction of cells expressing the gene) [3].	Designed to be robust to high sparsity and dropout noise in scRNA-seq data; reported to outperform other methods in some benchmarks.	A recently developed method; promising for handling noisy data.

A practical workflow is to start with a well-established method like scran or Seurat's VST, and if downstream analysis is unsatisfactory, benchmark against alternative methods like GLP [1] [3].

Experimental Protocols

Standard Workflow for HVG Detection

The following protocol outlines a standard computational workflow for identifying HVGs, which can be applied prior to scFM training.

Inputs: A quality-controlled and normalized single-cell RNA-seq count matrix (cells x genes).

Procedure:

Quantification of Variation: Calculate a measure of variation for each gene. The most straightforward approach is to compute the variance of the log-normalized expression values across all cells [2].
Modeling the Mean-Variance Relationship: Model the trend between gene expression abundance (mean) and the chosen variation metric (variance). This step is crucial to account for the fact that variance in expression data is often mean-dependent [2].
- The modelGeneVar() function (e.g., in the scran package) fits a trend to the per-gene variance with respect to abundance. It then decomposes the total variance for each gene into a technical component (the fitted value) and a biological component (the residual from the trend) [2].
- If spike-in RNAs are available, modelGeneVarWithSpikes() can provide a more precise estimate of technical noise by fitting a trend to the spike-in variances [2].
Statistical Testing & Ranking: Rank genes based on their biological variation component (or a related statistic like the z-score of residuals from the trend). A statistical test (e.g., against a null hypothesis of no biological variation) is often performed, and genes are ranked by significance or the magnitude of the biological component [1] [2].
Selection of Top HVGs: Select the top N genes (e.g., 2,000-5,000) from the ranked list for downstream analysis or scFM training [6].

The following diagram illustrates the logical workflow and the key decision points.

Protocol for Validating scFM Biological Relevance Using HVG-Derived Insights

After training an scFM, it is critical to validate that the model has captured meaningful biological patterns and not just technical artifacts.

Inputs: A trained scFM, a held-out test scRNA-seq dataset with high-quality cell type annotations.

Procedure:

Generate Cell Embeddings: Use the scFM in "zero-shot" mode to generate latent embeddings for all cells in the test dataset without any fine-tuning [4].
Evaluate with Ontology-Informed Metrics: Assess the quality of the embeddings using novel metrics that incorporate prior biological knowledge.
- scGraph-OntoRWR: This metric evaluates whether the relationships between cell types captured by the scFM embeddings are consistent with known biological relationships defined in cell ontologies [4].
- Lowest Common Ancestor Distance (LCAD): For cell type annotation tasks, this metric measures the ontological proximity between misclassified cell types and the correct label, ensuring that errors are biologically plausible (e.g., confusing two T cell subtypes is less severe than confusing a T cell with a neuron) [4].
Compare to Baseline Methods: Compare the performance of your scFM against simpler baseline models (e.g., standard HVG selection followed by PCA) on relevant downstream tasks like batch integration, cell type annotation, and cancer cell identification [4].
Functional Validation (Gold Standard): For the most critical HVGs identified or prioritized by the scFM, plan wet-lab experiments to functionally validate their role. Techniques include:
- RNA FISH / Immunofluorescence (IF): To confirm the spatial localization and protein-level expression of the gene product [7].
- Gene Knockdown/Knockout: Using CRISPR/Cas9 or RNAi to silence the gene and observe phenotypic consequences in functional assays (e.g., migration, proliferation) [7] [8].
- Specific Cell Sorting: Using FACS to isolate cell populations based on HVG expression and validate their identity and function via RT-qPCR [7].

The Scientist's Toolkit

Table 2: Key Research Reagent Solutions for scRNA-seq and Validation

Reagent / Tool	Function	Application Context
ERCC Spike-in RNAs	Exogenous RNA controls used to precisely model technical noise and improve the accuracy of HVG detection [2].	scRNA-seq library preparation and normalization.
UMI Barcodes	Unique Molecular Identifiers are short random sequences that label individual mRNA molecules, allowing for accurate quantification by correcting for PCR amplification biases [9].	scRNA-seq library preparation (e.g., in 10x Genomics, Drop-seq).
siRNAs / shRNAs	Small interfering RNAs or short hairpin RNAs used for transient gene knockdown to functionally validate the role of a target HVG [8].	Functional validation in vitro (e.g., in HUVECs).
CRISPR-Cas9 System	A gene-editing tool used to create stable gene knockouts, providing definitive evidence for a gene's function [7] [8].	Functional validation in vitro and in vivo.
FACS Antibodies	Fluorescently-labeled antibodies against cell surface or intracellular proteins for isolating specific cell populations via flow cytometry [7].	Target population isolation and validation.
RNA FISH Probes	Fluorescently labeled nucleic acid probes that bind to specific RNA sequences, enabling visualization of gene expression and spatial localization in tissues [7].	Spatial validation of HVG expression.

Frequently Asked Questions (FAQs)

FAQ 1: Why is Highly Variable Gene (HVG) selection a critical step in single-cell RNA-seq analysis? HVG selection is the process of identifying genes that exhibit significant cell-to-cell variation in expression within a seemingly homogeneous cell population. This step is crucial because it focuses downstream analyses on the genes most likely to be informative of biological heterogeneity, such as different cell types or states. Using HVGs improves computational efficiency, prevents overfitting, and enhances the performance of clustering algorithms by reducing the data dimensionality from tens of thousands of genes to a manageable set of features that capture key biological signals [2] [10]. Neglecting this step can obscure meaningful biological insights, as clustering and dimensionality reduction are highly sensitive to the choice of input genes [2].

FAQ 2: My single-cell analysis failed to identify a known rare cell population. Could HVG selection be the cause? Yes, this is a common challenge. While for abundant and well-separated cell types, even large random gene sets can perform adequately, the identification of rare or subtly different cell types is highly sensitive to the HVG selection method [10]. For instance, in a study focusing on CD4+ T cells, using the standard HVG method successfully identified a FOXP3+ T regulatory (Treg) population (~1.8% of cells), whereas using an equal number of randomly selected genes completely failed to reveal this population, even when the entire transcriptome was used [10]. This demonstrates that for subtle biological differences, a thoughtful choice of HVG method is essential.

FAQ 3: I see inconsistent results every time I re-run my HVG analysis on a subset of my data. How can I improve reproducibility? Low reproducibility in HVG selection is a recognized issue that can significantly impact downstream analyses like cell classification. A benchmarking study on hematopoietic cells revealed that the reproducibility of HVG methods—measured as the proportion of overlapping genes identified across multiple tests—varies considerably [11]. Methods like SCHS showed high reproducibility (>90%), while others, including some popular Seurat methods, showed lower reproducibility (50-70%) [11]. To overcome this, consider using a robust strategy like SIEVE (SIngle-cEll Variable gEnes), which employs multiple rounds of random sampling to identify a stable, high-confidence set of HVGs, thereby minimizing stochastic noise and improving the consistency of your results [11].

FAQ 4: How many Highly Variable Genes should I select for my analysis? The optimal number is not fixed and can depend on the complexity of your dataset and the biological question. However, using too many features can be as detrimental as using too few. Evidence suggests that for standard tasks like clustering peripheral blood mononuclear cells (PBMCs), performance plateaus after selecting a few hundred to a few thousand genes [10]. For example, in one PBMC dataset, clustering metrics reached a high level with around 725 selected genes [10]. It is recommended to avoid automatically selecting the maximum number of HVGs, as this can introduce noise. Start with a standard number (e.g., 2,000-3,000) and perform sensitivity checks to ensure your key findings are robust.

FAQ 5: How does HVG selection specifically impact the training of single-cell foundation models (scFMs)? Single-cell foundation models are pre-trained on massive single-cell datasets to learn universal biological knowledge. The choice of input genes fundamentally shapes what the model learns. HVG selection ensures the model focuses its capacity on the most biologically meaningful signals rather than technical noise or uninformative genes. A comprehensive benchmark of scFMs highlights that the input feature space is a critical factor in model performance [4]. While scFMs are robust tools, their ability to generate insightful embeddings for downstream tasks is directly influenced by the quality and relevance of the features they were trained on. A variability-centric view of feature selection aligns with the core strength of scRNA-seq—capturing cell-to-cell heterogeneity—and can empower scFMs to uncover deeper biological insights [12] [4].

Performance Comparison of HVG Selection Methods

The table below summarizes the performance of various HVG methods based on evaluations using hematopoietic stem/progenitor cells (HSPCs) and mature blood cells [11].

Method	Reproducibility	Preference for Gene Expression Level	Notes on Performance
SCHS	High (>90%)	Prefers highly expressed genes	High accuracy in cell classification; robust performance.
Seurat (VST, SCT, DISP)	Low to Medium (50-70%)	Mix of high and low (quarter of genes are lowly expressed)	Common and accessible; performance can be improved with SIEVE.
M3Drop	Low (50-70%)	Selects lowly expressed genes	Lower distinguishing capability for similar cell types (e.g., HSPCs).
Scran	Medium (80-90%)	Prefers highly expressed genes	Does not select lowly expressed genes.
Scmap	Medium (80-90%)	Prefers highly expressed genes	Slightly lower cluster purity.
ROGUE/ROGUE_n	Medium (80-90%)	Prefers highly expressed genes	Does not select lowly expressed genes.
SIEVE	Very High (After application)	Shifts selected genes towards median expression	A meta-strategy applied to other methods to enhance reproducibility and biological relevance.

Experimental Protocol: Identifying Robust HVGs with the SIEVE Strategy

The SIEVE strategy is designed to overcome the low reproducibility of many standalone HVG methods by leveraging multiple rounds of random sampling [11].

Sample the Data

Begin with your complete, quality-controlled, and normalized scRNA-seq dataset (e.g., a Seurat or SingleCellExperiment object).
Randomly sample (without replacement) a predefined proportion of cells (e.g., 70%) from the full dataset. This subset is termed the "reference set." The remaining cells (e.g., 30%) form the "query set."

Identify HVGs on the Reference Set

Apply your chosen HVG selection method (e.g., Seurat's VST, scran, SCHS) to the reference set to identify a list of highly variable genes. The number of top HVGs to select per run (e.g., 2,000) should be kept constant.
Note: It is critical to use the same HVG method and parameters for every iteration.

Iterate the Process

Repeat steps 1 and 2 a large number of times (e.g., 50 times). Each iteration generates a new, independent reference set and a corresponding list of HVGs.

Calculate Gene Frequencies and Define the Robust HVG Set

Across all iterations, count how many times each gene appears in the HVG lists.
The final, robust set of HVGs is defined as those genes that appear in a high proportion (e.g., >80%) of the iterations. This frequency threshold can be adjusted based on the desired stringency.

Validate with Downstream Analysis

Use the robust HVG set for downstream tasks such as PCA, clustering, and cell type annotation.
The SIEVE strategy has been shown to improve the accuracy of single-cell classification and helps recover more biologically relevant genes that are enriched for cluster markers [11].

SIEVE Workflow for Robust HVG Selection

The Scientist's Toolkit: Research Reagent Solutions

Reagent / Tool	Function in HVG Analysis / scRNA-seq
ERCC Spike-in RNAs	External RNA controls used to model technical noise and improve the accuracy of variance estimation during normalization and HVG selection [1] [2].
scRNA-seq Analysis Packages (Seurat, scran, Scanpy)	Software suites that provide integrated implementations of various HVG discovery methods (e.g., VST, scran, M3Drop) within a complete analytical workflow [1] [13] [11].
SIEVE Software	A dedicated tool for implementing the SIEVE resampling strategy to identify a robust and reproducible set of HVGs, available from https://github.com/YinanZhang522/SIEVE [11].
Single-cell Foundation Models (scGPT, Geneformer)	Pre-trained deep learning models on large-scale scRNA-seq data. Proper HVG selection can inform the feature space used for fine-tuning these models on specific tasks [4].

Advanced Concepts: Differential Variability (DV) Analysis

Moving beyond traditional differential expression (DE), which focuses on changes in mean expression, Differential Variability (DV) analysis identifies genes with significant differences in expression variability (cell-to-cell heterogeneity) between two conditions [12]. These DV genes can offer distinct functional insights.

Method Spotlight: spline-DV

Purpose: A statistical framework to identify DV genes from scRNA-seq data between two experimental conditions (e.g., healthy vs. diseased) [12].
How it works: It models gene-level statistics—mean expression, coefficient of variation (CV), and dropout rate—in a 3D space. A spline-fit curve is generated for each condition, representing the expected relationship between these statistics. For each gene, a vector is drawn from the nearest point on the spline to its observed position. The difference between the vectors for the two conditions (the DV vector) quantifies the change in variability, with its magnitude being the DV score [12].
Application: In a study on diet-induced obesity, spline-DV identified Plpp1 (increased variability in high-fat diet) and Thrsp (decreased variability in high-fat diet) as top DV genes, providing insights into metabolic dysfunction that were not apparent from mean expression alone [12].

spline-DV Analysis Workflow

Frequently Asked Questions

Q1: What is the fundamental difference between technical and biological variation in single-cell RNA-seq data? Biological variation refers to the natural, functionally relevant differences in gene expression between individual cells. This includes differences due to cell type, cell cycle stage, transcriptional bursts, and response to environmental stimuli [14]. Technical variation arises from the experimental process itself, including cell isolation, reverse transcription, cDNA amplification, and sequencing. This results in biases such as low capture efficiency, high dropout rates (where a gene is observed in one cell but not in another), and amplification noise [14] [15].

Q2: Why is it critical to account for technical variation before selecting Highly Variable Genes (HVGs) for model training? HVG selection focuses on genes that show more cell-to-cell variability than expected from technical noise alone [15]. If technical variation is not accounted for, the selected gene set will be contaminated with technical artifacts rather than true biological signals. This leads to poor performance in downstream tasks such as cell clustering, data integration, and training of single-cell foundation models (scFMs), as the model learns from noise instead of biology [6] [16].

Q3: How does poor feature selection impact the training and performance of a single-cell foundation model (scFM)? Benchmarking studies show that feature selection methods directly affect the quality of data integration and query mapping, which are foundational for building robust reference atlases [6]. Using poorly selected features can cause an scFM to learn incorrect cellular representations, reducing its ability to accurately predict cellular responses to perturbations (in-silico perturbation). For example, an open-loop scFM might have a low positive predictive value, which can be significantly improved by incorporating even a small amount of experimental perturbation data to guide feature selection in a "closed-loop" framework [17].

Q4: What are some common methods to identify and correct for technical variance?

Highly Variable Gene Selection: Standard practice is to select genes exhibiting high biological variability after modeling the mean-variance trend expected from technical noise [15] [16].
Batch Correction: Computational integration methods are used to remove technical differences between samples or batches while conserving biological variation [6] [18].
Utilizing Stable Genes: Using stably expressed genes as a negative control can help establish a baseline for technical noise [6].

Troubleshooting Guides

Issue 1: High Batch Effect in Integrated Data

Problem: After integrating multiple datasets for scFM pre-training, cells cluster strongly by batch or study of origin rather than by biological cell type.

Potential Cause 1: The feature selection method did not properly account for batch effects.
- Solution: Use a batch-aware feature selection method. Instead of selecting HVGs per dataset, perform feature selection on a collaboratively corrected matrix or use a method that explicitly models batch information to identify features robust to technical variation [6].
Potential Cause 2: The selected features are themselves driven by technical artifacts.
- Solution: As a diagnostic, check the expression patterns of the top selected features. If they are dominated by mitochondrial or ribosomal genes, they may reflect cell viability or other technical confounders. Re-run HVG selection with these genes filtered out.

Issue 2: scFM Fails to Generalize to New Query Data

Problem: Your trained scFM performs well on its training data but fails to accurately map or make predictions for new query samples.

Potential Cause 1: The feature space used for training is not representative of the biological variation in the query.
- Solution: Re-evaluate the feature selection strategy. Ensure that the set of highly variable genes captures broad biological programs rather than being overly specific to the training data. Benchmarking suggests that using highly variable feature selection is effective for producing high-quality integrations and mappings [6].
Potential Cause 2: Technical differences (e.g., sequencing depth, protocol) between the training and query data are too great.
- Solution: Apply a robust scaling/normalization method (e.g., Robust Scaler) to both training and query data using the same reference to minimize the effect of outliers and technical discrepancies [19].

Issue 3: Low Positive Predictive Value in In-Silico Perturbation

Problem: Predictions made by your scFM for genetic perturbations (e.g., knockout, overexpression) have a low rate of experimental validation.

Potential Cause: The "open-loop" model predictions are based on patterns in the baseline data that may not fully capture the effects of perturbation.
- Solution: Implement a "closed-loop" fine-tuning framework. Incorporate a small set (as few as 10-20 examples) of experimental perturbation data (e.g., from Perturb-seq) into the model's fine-tuning process. This has been shown to triple the positive predictive value of in-silico perturbation predictions [17].

Experimental Protocols

Protocol 1: Benchmarking Feature Selection for Data Integration and Query Mapping

This protocol is based on a robust benchmarking pipeline from a registered report in Nature Methods [6].

1. Define Evaluation Metrics: Select metrics that cover multiple performance categories:

Batch Effect Removal: Batch ASW (Average Silhouette Width), Batch PCR (Principal Component Regression).
Biological Conservation: cLISI (Cell-type Local Inverse Simpson's Index), isolated label F1 score.
Query Mapping Quality: Cell distance, mLISI (Mapping LISI).
Unseen Population Detection: Milo, Unseen cell distance.

2. Establish Baseline Methods: Run integrations with diverse baseline feature sets to establish performance ranges for scaling metrics. Recommended baselines include:

All features.
2,000 highly variable features (e.g., using the scanpy implementation).
500 randomly selected features (average over 5 sets).
200 stably expressed features (e.g., using scSEGIndex) as a negative control.

3. Scale and Summarize Performance: Scale the metric scores for each method relative to the minimum and maximum baseline scores. Aggregate scores within each metric category to summarize performance.

Protocol 2: Performing Cell-Type Specific Differential Expression with Biological Replicates

This protocol ensures valid statistical testing by treating samples, not individual cells, as experimental units [18].

1. Data Processing:

Start with raw count data (do not use batch-corrected counts for DE).
Perform quality control filtering and cell type annotation.
If multiple samples are analyzed, integrate them with batch correction.

2. Pseudobulk Aggregation:

For each cell type of interest, sum the UMI counts across all cells belonging to the same sample.
This creates a representative expression profile for that cell type in each sample.

3. Differential Expression Analysis:

Use established bulk RNA-seq tools (e.g., edgeR, limma-voom) on the pseudobulk counts.
The statistical model tests for expression differences between conditions (e.g., treated vs. control), using the samples as replicates.

Data Presentation

Table 1: Comparison of Multi-Condition Differential Expression Tools for scRNA-seq

Tool Name	Statistical Approach	Key Feature / Use Case
muscat [18]	Mixed-effects model or Pseudobulk	Detects subpopulation-specific state transitions from multi-sample, multi-condition data.
NEBULA [18]	Mixed-effects model	A fast negative binomial mixed model for large-scale multi-subject data.
MAST [18]	Mixed-effects model	Accounts for the high number of zero counts; supports random effects.
scran (`pseudobulkDGE`) [18]	Pseudobulk	Wraps bulk tools `edgeR` and `limma-voom` for easy use with single-cell data.
distinct [18]	Differential distribution test	Tests for differences in the entire expression distribution, not just the mean.

Table 2: Reagent and Tool Solutions for scRNA-seq Experimental Design

Item	Function in Experiment
Unique Molecular Identifiers (UMIs)	Molecular barcodes added to each transcript during reverse transcription. They allow for accurate molecule counting by correcting for PCR amplification bias [20].
Cell Barcodes	Short DNA sequences that uniquely label all mRNAs from a single cell, allowing samples to be multiplexed and computationally demultiplexed after sequencing [20].
Fluidigm C1 System	A microfluidic-array platform for automated cell capture and library preparation, suitable for medium-throughput, full-length transcriptome analysis [20].
10x Chromium	A microfluidic-droplet platform for high-throughput, 3' or 5' tag-based library preparation. It is cost-effective for profiling tens of thousands of cells [20].
SMART-seq2	A plate-based, full-length RNA-seq protocol that provides uniform transcript coverage, enabling the study of splice variants and allele-specific expression [20].

Workflow Visualizations

HVG Selection for scFM Training

scRNA-seq Multi-Condition Experimental Design

Troubleshooting Guide: Addressing Data Sparsity and Technical Noise

FAQ 1: How does data sparsity fundamentally challenge scFM training?

Data sparsity, primarily caused by dropout events where genes are measured as unexpressed due to technical limitations, obscures the true biological signal in single-cell RNA sequencing (scRNA-seq) data. This high sparsity and high dimensionality create a "curse of dimensionality" problem where technical noise accumulates and masks subtle biological phenomena, including tumor-suppressor events in cancer and cell-type-specific transcription factor activities [21].

The core issue is that statistical properties of high-dimensional spaces differ dramatically from our intuitive understanding of two- or three-dimensional spaces. As dimensionality increases, the distance between data points becomes less meaningful, and technical noise dominates the data structure, making it difficult for foundation models to learn meaningful biological representations [21].

Solution: Implement comprehensive noise reduction before scFM training. The RECODE algorithm models technical noise arising from the entire data generation process as a general probability distribution and reduces it using eigenvalue modification theory rooted in high-dimensional statistics. This approach effectively mitigates technical noise while preserving biological signals [21].

FAQ 2: What methods effectively reduce both technical noise and batch effects simultaneously?

Traditional approaches that simply combine technical noise reduction with batch correction often fail because conventional batch correction methods typically rely on dimensionality reduction techniques like PCA, which themselves are insufficient to overcome the curse of dimensionality [21].

Solution: Utilize integrated approaches like iRECODE (integrative RECODE), which synergizes high-dimensional statistical noise reduction with established batch correction methods. iRECODE integrates batch correction within an "essential space" after initial noise variance-stabilizing normalization, thereby minimizing accuracy degradation and computational costs associated with high-dimensional calculations [21].

Table 1: Performance Comparison of Noise Reduction Methods

Method	Technical Noise Reduction	Batch Effect Correction	Relative Error in Mean Expression	Computational Efficiency
Raw Data	None	None	11.1-14.3%	Baseline
RECODE Only	Excellent	Limited	Not Available	High
Traditional Batch Correction	Limited	Good	Not Available	Moderate
iRECODE	Excellent	Excellent	2.4-2.5%	10x more efficient than combined approaches

FAQ 3: How does feature selection impact scFM performance with sparse data?

Feature selection—specifically the identification of Highly Variable Genes (HVGs)—is critical for managing data sparsity in scFM training. The choice of feature selection method significantly affects downstream integration performance, query mapping, label transfer accuracy, and detection of unseen cell populations [6].

Benchmarking studies reveal that using highly variable genes generally leads to better integrations, but the specific feature selection strategy must be carefully chosen. Methods that leverage the relationship between gene average expression level and positive ratio (the proportion of cells where a gene is detected) can more robustly identify biologically informative features amidst technical noise [3].

Table 2: Feature Selection Method Performance Benchmarks

Method	Adjusted Rand Index (ARI)	Normalized Mutual Information (NMI)	Silhouette Coefficient	Robustness to Dropout
GLP	Highest	Highest	Highest	Excellent
VST	High	High	High	Good
SCTransform	High	High	High	Good
M3Drop/NBDrop	Moderate	Moderate	Moderate	Excellent
Random Selection	Low	Low	Low	Poor

Solution: Consider advanced feature selection methods like GLP (Genes identified through LOESS with Positive ratio), which uses optimized LOESS regression to capture the relationship between gene average expression level and positive ratio while minimizing overfitting. This approach has demonstrated consistent outperformance across multiple benchmark criteria compared to eight leading feature selection methods [3].

FAQ 4: Are foundation models inherently robust to data sparsity, or do traditional methods remain competitive?

Current benchmarking reveals that while scFMs are robust and versatile tools for diverse applications, simpler machine learning models can be more adept at efficiently adapting to specific datasets, particularly under resource constraints. Notably, no single scFM consistently outperforms others across all tasks, emphasizing the need for tailored model selection based on factors such as dataset size, task complexity, and computational resources [4].

The performance improvement of scFMs often arises from creating a "smoother landscape" in the pretrained latent space, which reduces the difficulty of training task-specific models. However, the high sparsity, high dimensionality, and low signal-to-noise ratio of transcriptome data continue to present challenges for all models [4].

Solution: Evaluate the specific requirements of your biological question before committing to scFM approaches. For well-defined tasks with limited data, traditional methods may provide more efficient solutions. For exploratory analyses across diverse cell types and conditions, scFMs may offer advantages in capturing broader biological patterns [4] [5].

Experimental Protocols for Noise Mitigation

Protocol 1: Implementing iRECODE for Dual Noise Reduction

Input Preparation: Format your scRNA-seq data as a standard gene expression matrix with cells as columns and genes as rows [21].
Noise Variance-Stabilizing Normalization (NVSN): Map gene expression data to an essential space using NVSN to stabilize technical variance across the expression range [21].
Singular Value Decomposition: Apply SVD to decompose the normalized matrix into orthogonal components representing the primary sources of variation [21].
Principal Component Variance Modification: Modify principal component variances using eigenvalue modification theory to reduce technical noise [21].
Integrated Batch Correction: Apply Harmony batch correction within the essential space to minimize batch effects while preserving biological variation [21].
Reconstruction: Reconstruct the denoised, batch-corrected expression matrix for downstream scFM training [21].

Protocol 2: GLP Feature Selection for scFM Training

Data Preprocessing: Filter out genes captured in fewer than 3 cells to ensure statistical reliability [3].
Parameter Calculation: For each gene, compute:
- Average expression level (λ) = (1/c) × ΣXij
- Positive ratio (f) = (1/c) × Σmin(1, Xij) where c is the number of cells and Xij is the expression value [3].
Bayesian Information Criterion Optimization: Use BIC to automatically determine the optimal LOESS smoothing parameter (α) through:
- RSS = Σ(yj - ŷj)²
- BIC = c × ln(RSS/c) + k × ln(c) where k is the degrees of freedom [3].
Two-Step LOESS Regression:
- First step: Apply Tukey's biweight robust statistical method to identify outlier genes
- Second step: Assign zero weights to outliers and repeat LOESS regression for accurate gene selection [3].
Feature Selection: Select genes with expression levels significantly higher than expected based on the LOESS-predicted values from their positive ratios [3].

Workflow Visualization

Research Reagent Solutions

Table 3: Essential Computational Tools for scFM Training

Tool/Resource	Primary Function	Application Context	Key Advantage
RECODE/iRECODE	Technical noise and batch effect reduction	Preprocessing for scFM training	Preserves full-dimensional data; parameter-free
GLP	Feature selection based on positive ratio	HVG selection for sparse data	Optimized LOESS regression minimizes overfitting
Harmony	Batch correction	Multi-dataset integration	Compatible with iRECODE framework
Vitessce	Multimodal data visualization	Quality control and result interpretation	Integrates spatial and single-cell data
scGPT	Foundation model architecture	scFM training and fine-tuning	Supports multiple omics modalities
CZ CELLxGENE	Curated single-cell data	Pretraining data source	Standardized access to annotated datasets

Advanced Technical Considerations

Evaluating Biological Relevance of scFM Embeddings

When assessing scFM performance beyond standard metrics, implement ontology-informed evaluation strategies:

scGraph-OntoRWR: Measures consistency between cell type relationships captured by scFMs and prior biological knowledge encoded in cell ontologies [4].
Lowest Common Ancestor Distance (LCAD): Quantifies ontological proximity between misclassified cell types to assess the severity of annotation errors [4].
Roughness Index (ROGI): Evaluates the smoothness of the cell-property landscape in the latent space, where smoother landscapes typically indicate better generalization capability [4].

Cross-Modality Applications

The RECODE platform extends beyond transcriptomics to epigenomic and spatial data modalities. For single-cell Hi-C data, RECODE effectively mitigates sparsity to reveal cell-specific chromatin interactions and topologically associating domains that align with bulk Hi-C counterparts [21]. Similarly, for spatial transcriptomics, integrated visualization tools like Vitessce enable correlative analysis of spatial localization and gene expression patterns [22].

Adaptive Model Selection Framework

Given that no single scFM consistently outperforms others across all tasks, implement a decision framework based on:

Dataset size: Traditional methods often suffice for smaller datasets (<10,000 cells)
Task complexity: scFMs show advantages for novel cell type discovery and cross-tissue analyses
Resource constraints: Consider computational requirements relative to available infrastructure
Biological interpretability: Assess need for mechanistic insights versus predictive accuracy [4]

The Relationship Between HVGs and Foundational Model Architecture

Frequently Asked Questions (FAQs)

1. How does the choice of Highly Variable Genes (HVGs) impact the input structure of a single-cell foundation model (scFM)?

The selection of Highly Variable Genes (HVGs) is a fundamental pre-processing step that directly determines the "vocabulary" and input sequence for a transformer-based scFM. Unlike words in a language, genes in a cell have no inherent sequential order, so models must impose one. A common strategy is to rank genes by their expression levels within each cell, feeding the ordered list of top genes as a "sentence" for the model to process [5]. The number of HVGs selected (e.g., 1,200 or 2,048) defines the sequence length for each cell [4]. Different models employ various gene ordering strategies, and the choice of HVG set can influence how effectively the model learns biological relationships.

2. My scFM is not performing well on downstream tasks like cell type annotation. Could the HVG selection be a factor?

Yes, absolutely. The benchmark study by Li et al. (2025) found that no single scFM consistently outperforms others across all tasks, and simpler baseline methods can sometimes be more effective, particularly under resource constraints [4] [23]. If your model is underperforming, consider that the HVG set used during pre-training might not be optimal for your specific downstream dataset. The biological variation captured by a general-purpose HVG list may not align perfectly with the cell types or states in your target data. Evaluating the "biological relevance" of the embeddings using ontology-informed metrics can help diagnose this issue [4].

3. What is the relationship between a model's architecture and its need for value embeddings alongside gene token embeddings?

This is a key architectural consideration. Because scRNA-seq data provides an expression value for each gene, models must encode both the gene's identity (the "word") and its expression level (the "emphasis"). This is typically handled through a two-part input layer [4] [23]:

Gene Token Embedding: A lookup table that represents each gene's identity as a vector, analogous to a word embedding in NLP.
Value Embedding: A separate representation for the gene's expression value. Models use different strategies for this, such as value binning (discretizing the expression into categories) or value projection (creating an embedding based on the continuous value) [4]. This dual-embedding approach allows the transformer architecture to use its attention mechanisms to weight the importance of genes dynamically based on their context and expression level.

4. Are there scFMs that avoid the HVG selection problem altogether?

Some models are designed to use the entire genome rather than a pre-selected HVG list. For example, the scFoundation model is pretrained on nearly all human protein-encoding genes (19,264 genes) [4]. While this avoids the potential bias introduced by HVG selection, it comes at a significant computational cost and may require more sophisticated architectures or training strategies to handle the high dimensionality and sparsity of the data effectively.

Troubleshooting Guides

Problem: Poor Batch Integration Performance

Symptoms: After using an scFM for dataset integration, biological cell types remain clustered by batch (e.g., by patient or sequencing platform) instead of mixing seamlessly.

Potential Causes and Solutions:

Step	Potential Cause	Diagnostic Check	Solution
1	HVG Mismatch	The set of HVGs used in pre-training captures technical artifacts specific to the pre-training datasets.	Fine-tune the model on a small sample of your target data to adapt the gene representations. Alternatively, use a model like Nephrobase Cell+ that employs adversarial training to actively remove batch signals [24].
2	Insufficient Model Pretraining	The model was not pre-trained on data with batch effects as diverse as yours.	Check the pre-training corpus of your scFM. Select a model pre-trained on massive, diverse datasets (e.g., >30 million cells) from multiple sources, as scale and diversity improve robustness [24].
3	Suboptimal Embeddings	The zero-shot cell embeddings from the scFM are not batch-invariant.	Use the scFM embeddings as a starting point and apply a dedicated batch-integration tool like Harmony or Scanorama as a post-processing step [25].

Problem: Inaccurate Gene Perturbation Prediction

Symptoms: Your scFM fails to accurately predict gene expression changes following single or double genetic perturbations, performing worse than simple additive baselines.

Potential Causes and Solutions:

Step	Potential Cause	Diagnostic Check	Solution
1	Limited Perturbation Knowledge	The model's pre-training data may have lacked sufficient perturbation examples to learn causal relationships.	A recent benchmark found that simple linear models can outperform complex scFMs for this task [26]. Consider using a baseline model or a linear model enhanced with gene embeddings extracted from an scFM [26].
2	Ineffective Gene Embeddings	The gene-token embeddings do not adequately capture functional gene-gene relationships.	Extract the gene embedding matrix (G) from the scFM and use it to train a simpler predictive model. Benchmarks show this can sometimes match or exceed the performance of the scFM's own decoder [26].

Experimental Protocols from Key Studies

Benchmarking ScFM Performance on Cell-Level Tasks

This protocol is adapted from the comprehensive benchmark study by Li et al. (2025) [4] [23].

Objective: To evaluate the quality of cell embeddings generated by different scFMs for tasks like batch integration and cell type annotation.

Materials:

Test Datasets: Five high-quality scRNA-seq datasets with manual annotations. These should include multiple sources of batch effects (e.g., inter-patient, inter-platform, inter-tissue variations).
scFMs for Testing: e.g., Geneformer, scGPT, UCE, scFoundation, LangCell, scCello.
Baseline Methods: Traditional approaches such as Highly Variable Genes (HVGs) selection, Seurat, Harmony, and scVI.
Evaluation Metrics: A suite of 12 metrics including:
- Traditional: Clustering accuracy, Silhouette score.
- Biology-Informed: scGraph-OntoRWR (measures consistency of captured cell type relationships with known biology), Lowest Common Ancestor Distance (LCAD) (measures severity of cell type misclassification).

Methodology:

Feature Extraction: For each scFM and baseline method, generate zero-shot cell embeddings from the test datasets.
Downstream Task Application: Apply the embeddings to specific cell-level tasks, such as:
- Dataset Integration: Visualize embeddings using UMAP and assess batch mixing and biological conservation.
- Cell Type Annotation: Train a simple classifier on the embeddings and evaluate its accuracy.
Evaluation: Score the performance of each model using the full set of evaluation metrics.
Ranking: Aggregate results using a non-dominated sorting algorithm to provide task-specific and overall model rankings.

Expected Output: A holistic ranking of scFMs, identifying the strengths and limitations of each for different biological applications. The study revealed that while scFMs are robust and versatile, simpler models can be more efficient for specific datasets [4].

Evaluating Gene Embeddings for Functional Relevance

Objective: To determine if the gene embeddings learned by an scFM capture meaningful biological relationships.

Materials:

Gene Embeddings: The gene-token embedding matrix extracted from the input layer of the scFM.
Reference Data: Known biological relationships from databases like Gene Ontology (GO).
Baseline Embeddings: e.g., Functional Representation of Gene Signatures (FRoGS), which learns gene embeddings from GO hypergraphs [23].

Methodology:

Embedding Extraction: For a set of common genes, obtain their vector representations from the scFM and the baseline method.
Similarity Calculation: Compute the cosine similarity between all pairs of gene embeddings within each method.
Functional Prediction Task: Use the gene embeddings to predict known biological relationships, such as GO term associations or tissue specificity.
Performance Comparison: Evaluate and compare the prediction accuracy of the scFM-derived embeddings against the baseline embeddings.

Expected Output: Quantification of how well the scFM's intrinsic gene embeddings align with established biological knowledge, providing insight into the functional insights the model has learned during pre-training [23].

Model Architecture and HVG Processing Diagram

The diagram below illustrates how a single-cell foundation model transforms a cell's gene expression profile into a latent representation, highlighting the critical role of HVG selection and tokenization.

HVG Processing in scFM Architecture

Research Reagent Solutions

The following table details key computational tools and resources essential for working with single-cell foundation models and Highly Variable Genes.

Resource Name	Type	Primary Function	Relevance to HVGs & Architecture
Geneformer [4]	Pre-trained scFM	A transformer model for cell and gene representation learning.	Uses a ranked list of 2,048 genes as input, demonstrating a specific HVG-based architecture.
scGPT [4] [5]	Pre-trained scFM	A generative transformer for single-cell biology.	Employs 1,200 HVGs and uses value binning for expression levels, illustrating an alternative input strategy.
scFoundation [4] [26]	Pre-trained scFM	A large model for gene expression and perturbation prediction.	Uses all ~19k protein-encoding genes, showcasing an architecture that bypasses HVG selection.
Nephrobase Cell+ [24]	Organ-Specific scFM	A kidney-focused foundation model.	Pretrained on ~40M cells; its success suggests that specialized models can outperform general ones, which has implications for HVG relevance in specific tissues.
CellxGene [5]	Data Platform	Provides unified access to annotated single-cell datasets.	A primary source for obtaining diverse, high-quality data for model pre-training or benchmarking, which is crucial for defining robust HVG sets.
Seurat [25]	Analysis Toolkit	A comprehensive R package for single-cell genomics.	Provides standard pipelines for HVG selection and serves as a common baseline for benchmarking scFMs.
Harmony [4] [25]	Integration Algorithm	A tool for dataset integration.	Used as a post-processing step for scFM embeddings or as a baseline to compare against the integration performance of scFMs.

Practical Implementation: HVG Selection Methods and Integration with scFM Pipelines

Comparative Analysis of Popular HVG Selection Methods (Seurat, Scran, SCHS, SIEVE)

FAQs on HVG Selection Principles and Best Practices

Q1: What is the core purpose of selecting Highly Variable Genes (HVGs) in single-cell RNA-seq analysis?

The primary purpose of HVG selection is to overcome the "curse of dimensionality" in single-cell RNA sequencing data by identifying a subset of genes that are most informative for distinguishing cell types or states. This process filters out genes that represent technical or biological noise, thereby enhancing the signal for downstream analyses such as clustering, dimensionality reduction, and cell type identification. Typically, only 3,000–5,000 of the tens of thousands of sequenced genes relate to cell-type-specific expression patterns, making HVG selection a critical pre-processing step to improve analytical resolution and accuracy [27].

Q2: For a multi-sample experiment, what is the recommended strategy to select HVGs that are robust across batches?

The recommended strategy for multi-sample experiments involves performing HVG selection on a per-batch basis and then identifying the consensus genes. This ensures the selected feature space is shared across samples. The methodology is as follows:

Compute HVGs separately for each batch using the batch_key parameter in your HVG selection function.
For each gene, note how many batches it was identified as an HVG.
Sort all genes by this count (highly_variable_nbatches).
Select the top N genes (e.g., 3000) that are most frequently variable across batches for downstream analysis [28]. This consensus approach is crucial for data integration tasks, as it focuses on a shared set of features, improving integration quality and subsequent analysis.

Q3: Can I use the same set of HVGs for different analysis tasks, such as clustering and integration?

While a single set of HVGs can be used for multiple tasks, the optimal strategy may vary. For integration, the consensus method described above is highly recommended. For clustering within a single, well-controlled dataset, standard HVG selection on the entire dataset might be sufficient. However, it's important to note that no single method is universally best. For instance, SCHS excels in reproducibility but favors highly expressed genes, while other methods like M3Drop select more lowly expressed genes, which can impact clustering results [11]. Researchers should align their HVG selection strategy with their primary analytical goal.

Troubleshooting Common HVG Selection Issues

Q1: The tool I'm using (e.g., Seurat) is not returning the expected number of HVGs, even though I specified the nFeatures parameter. Why?

This is a documented issue that can occur in specific workflows. For example, in Seurat, this behavior has been observed when the RNA assay is split into multiple layers (e.g., by a batch key) before running FindVariableFeatures. The underlying cause may be related to how the function interacts with the split assay object. As a workaround, you can try running the HVG selection on an unsplit object first or ensure you are using the latest version of the software, as this may be a resolved bug. Always check the number of variable features stored in the output object to confirm the function's behavior [29].

Q2: My downstream clustering results are poor or do not resolve known cell populations. Could the HVG selection be the cause?

Yes, the choice of HVG selection method can significantly impact clustering resolution and accuracy. Different methods have biases; for example, some may overlook lowly expressed but biologically critical genes. If clustering performance is unsatisfactory, consider these steps:

Re-evaluate your HVG method: Switch to a method known for higher accuracy in your biological context. The SIEVE method, for example, was developed to improve robustness and accuracy by minimizing stochastic noise [11].
Check for batch effects: If your data contains multiple batches, ensure you are using a batch-aware HVG selection strategy. Using HVGs selected without considering batches can lead to batch effects dominating the biological signal.
Explore method-specific diagnostics: Some methods, like SCHS, show high reproducibility, meaning the same genes are consistently selected across subsamples of your data. Low reproducibility in your chosen method can lead to unstable clustering results [11].

Q3: I am using an integrated object for clustering. Should I re-select HVGs after integration?

No, it is generally not sensible to re-select HVGs based on the integrated or corrected data. Highly Variable Gene detection methods are designed and calibrated for raw (or normalized) count data, which contains the technical and biological variation they are meant to discern. Integration methods like scVI explicitly remove unwanted technical variation (e.g., batch effects) to create a corrected expression matrix. Applying standard HVG selection on this "cleaned" data will not capture the intended sources of variation and is not part of standard analytical workflows [28].

Performance Comparison of HVG Methods

The table below summarizes the performance characteristics of various HVG selection methods based on an evaluation using scRNA-seq data from hematopoietic stem/progenitor cells and mature blood cells.

Table 1: Characteristics and Performance of HVG Selection Methods

Method	Reproducibility	Key Strengths	Key Limitations	Bias in Gene Expression Level
SCHS	High	High reproducibility and accuracy [11]	Prefers selection of highly expressed genes [11]	Prefers highly expressed genes [11]
Seurat (VST, SCT, DISP)	Medium	Good distinguishing capability for similar cell types [11]	Moderate reproducibility [11]	Selects a mix, including ~25% lowly expressed genes [11]
Scran	Low to Medium	Good distinguishing capability [11]	Lower reproducibility; lower cluster purity [11]	Selects almost no lowly expressed genes [11]
M3Drop	Low	Can identify lowly expressed variable genes [11]	Lowest distinguishing capability and classification accuracy [11]	Selects a mix, including ~25% lowly expressed genes [11]
ROGUE	Low to Medium	-	Lower reproducibility; lower cluster purity [11]	Selects almost no lowly expressed genes [11]
Scmap	Low to Medium	-	Lower reproducibility; lower cluster purity [11]	Prefers highly expressed genes [11]
SIEVE	High (by design)	High robustness; improves cell classification accuracy; recovers lowly expressed variable genes [11]	Computationally intensive due to multiple rounds of sampling [11]	Mitigates bias, recovers genes across expression levels [11]

Table 2: Impact on Downstream Analysis (Based on HSPC and Mature Blood Cell Data)

Method	Cluster Purity	Classification Accuracy (HSPCs)	Classification Accuracy (Mature Cells)
SCHS	>90%	~85-90%	>90%
Seurat	>90%	~85-90%	>90%
Scran	~90% (slightly inferior)	~85-90%	>90%
M3Drop	>90%	Lowest	Lowest
ROGUE	~90% (slightly inferior)	~85-90%	>90%
Scmap	~90% (slightly inferior)	~85-90%	>90%
SIEVE	>90%	Substantially improved	Substantially improved

Experimental Protocols for Key HVG Methods

Protocol 1: Standard HVG Selection with Seurat

This protocol describes a standard workflow for identifying HVGs on a single-cell dataset using Seurat.

Normalization: Normalize the raw count data to account for sequencing depth using NormalizeData. This typically involves log-normalization.
Selection: Run the FindVariableFeatures function. You must specify the following:
- nfeatures: The number of genes to select (e.g., 3000).
- selection.method: The specific algorithm to use (e.g., "vst", "sctransform", or "dispersion").
Validation: The selected HVGs are stored in the Seurat object. You can access them with VariableFeatures(object) and visualize the selection using VariableFeaturePlot.

Protocol 2: Robust Multi-Batch HVG Selection with Scanpy

This protocol is essential for datasets comprising multiple batches or samples and is a critical precursor to data integration.

Per-Batch HVG Calculation: Use the sc.pp.highly_variable_genes(adata, batch_key='batch') function in Scanpy. This calculates HVGs within each batch independently and stores a count of how many batches each gene was variable in (highly_variable_nbatches).
Consensus Gene Selection: Identify the consensus HVGs by selecting genes that are variable in the most batches.
Subsetting: Subset the AnnData object to these consensus HVGs before proceeding with integration or joint analysis [28].

Protocol 3: The SIEVE Strategy for Robust HVG Identification

SIEVE is a meta-strategy that can be applied to existing HVG methods to improve their robustness.

Random Sampling: Perform multiple rounds (e.g., 50) of random sampling. In each round, randomly select a subset of cells (e.g., 70%) to serve as the reference set.
HVG Selection on Subsets: In each round, apply your chosen base HVG selection method (e.g., Seurat's VST) to the reference set to identify a set of HVGs.
Identify Consensus HVGs: Across all rounds, compute how frequently each gene is selected as an HVG. The final robust set of HVGs is composed of genes with the highest selection frequency. This process minimizes stochastic noise and identifies a core set of variable genes that are consistently detected, substantially improving downstream classification accuracy [11].

Workflow Diagrams for HVG Selection

SIEVE Strategy Workflow

Multi-Batch HVG Selection Process

The Scientist's Toolkit: Essential Research Reagents & Solutions

Table 3: Key Computational Tools for HVG Selection and Evaluation

Tool / Resource	Function in HVG Research	Key Application
Seurat	A comprehensive toolkit for single-cell analysis. Provides multiple embedded HVG selection methods (VST, SCT, DISP).	Standardized preprocessing and HVG selection for clustering and trajectory inference [11].
Scanpy	A Python-based toolkit for analyzing single-cell gene expression data. Mirrors the functionality of Seurat.	HVG selection, especially in multi-batch scenarios, and integration with other Python-based ML tools [30] [28].
SCHS	A method for identifying HVGs based on the spatial distribution of cells.	Selecting a reproducible set of variable genes, particularly useful when consistency across subsamples is a priority [11].
SIEVE	A strategy, not a single algorithm, that uses multiple rounds of random sampling to identify robust HVGs.	Improving the robustness and accuracy of any base HVG method, leading to better single-cell classification [11].
scran	A package for low-level analyses of single-cell RNA-seq data. Provides its own method for HVG selection.	An alternative approach to HVG selection, often used in comparative benchmarks [11].
Human Phenotype Ontology (HPO)	A standardized vocabulary of phenotypic abnormalities.	While not directly for HVG selection, it is crucial for phenotype-based prioritization in diagnostic variant discovery following single-cell analysis [31].

Batch-Aware Feature Selection for Multi-Dataset Integration

FAQs

1. What is batch-aware feature selection and why is it critical for single-cell foundation model (scFM) training? Batch-aware feature selection is a computational strategy that identifies informative genes (features) for downstream analysis while explicitly accounting for non-biological technical differences between datasets, known as "batch effects." In the context of scFM training, which uses vast amounts of single-cell RNA sequencing (scRNA-seq) data, this is crucial because technical variation can confound true biological signals [6] [32]. Selecting features without considering batch effects can lead to a model that learns technical artifacts rather than underlying biology, compromising its performance on tasks like cell type annotation, data integration, and query mapping [6]. Proper batch-aware feature selection ensures the scFM learns robust, generalizable biological principles.

2. My integrated dataset shows good batch mixing but poor separation of known cell types. What might be the cause? This is a common challenge indicating that the integration or feature selection process may have been too aggressive, removing biological variation along with technical noise [32]. Specifically:

Over-correction via KL Regularization: In conditional Variational Autoencoder (cVAE) models, increasing the Kullback–Leibler (KL) divergence regularization strength to force batch integration can indiscriminately remove both batch and biological information, leading to a loss of cell type definition [32] [33].
Aggressive Adversarial Learning: Methods that use adversarial learning to align batch distributions can sometimes incorrectly merge distinct but proportionally unbalanced cell types across batches (e.g., mixing acinar and immune cells) to achieve statistical indistinguishability [32] [33]. A potential solution is to use more advanced integration methods like sysVI, which combines a VampPrior and cycle-consistency constraints, as it has been shown to improve batch correction while better preserving biological signals [32].

3. How does the number of selected features impact integration and downstream mapping tasks? The number of features selected is a critical parameter. Benchmarks show that the performance of integration and mapping is sensitive to this number [6].

Integration Metrics: Most metrics evaluating batch effect removal and conservation of biological variation are positively correlated with the number of selected features.
Mapping Metrics: In contrast, metrics assessing the quality of mapping a new query dataset to a reference are often negatively correlated with the feature count. This may be because smaller feature sets can produce noisier, more mixed integrations where mapping a query cell somewhere within its broad, mixed population is easier [6]. Therefore, there is a trade-off, and the optimal number may depend on the primary goal of your analysis (e.g., building a reference atlas versus mapping queries to it). It is recommended to benchmark different feature set sizes for your specific application [6].

Troubleshooting Guides

Problem: Poor Data Integration After Batch-Aware Feature Selection

Symptoms:

Cells cluster strongly by batch instead of by cell type in visualizations like UMAP.
Low scores on batch correction metrics (e.g., low iLISI scores [32]).
Inability to transfer labels accurately from a reference to a query dataset [6].

Investigation & Resolution Flowchart

Diagnostic Steps:

Verify Input Data Quality:
- Action: Check the quality control metrics for each batch individually. Look for signs of library preparation issues, such as abnormally low library yield or high levels of adapter contamination, which can create insurmountable batch effects [34].
- Resolution: If problems are found, consult the "Low Library Yield" troubleshooting guide below and consider re-preparing libraries if necessary.
Assess Feature Selection Method:
- Action: Confirm you are using a batch-aware feature selection method. Common practice is to use Highly Variable Gene (HVG) selection. For stronger batch effects, ensure the HVG method accounts for batch.
- Resolution: As demonstrated in benchmarks, using a batch-aware variant of a standard HVG method (like the scanpy-Cell Ranger method) is effective [6]. Avoid simple random gene selection or using all genes.
Evaluate the Integration Algorithm:
- Action: Identify which integration algorithm you are using and understand its limitations. Standard methods (including basic cVAE) can struggle with "substantial batch effects" found across different biological systems (e.g., species) or technologies (e.g., single-cell vs. single-nuclei) [32].
- Resolution: For substantial batch effects, consider newer methods like sysVI, a cVAE-based model that uses VampPrior and cycle-consistency. It has been shown to provide better batch correction while preserving biological variation compared to simply tuning KL regularization or using adversarial learning [32] [33].
Verify Downstream Analysis Parameters:
- Action: Re-visit the number of features selected for the analysis.
- Resolution: Perform a sensitivity analysis. The benchmark by [6] suggests that the number of features significantly impacts results. Test a range of values (e.g., 500 to 5000) to find the optimum for your data and goal.

Problem: Low Library Yield in Single-Cell RNA-seq Experiments

Symptoms:

Final cDNA or library concentrations are well below expectations.
Electropherogram traces show broad or faint peaks, missing target fragment sizes, or a dominant peak of adapter dimers (~70-90 bp) [34].

Diagnosis and Solutions:

Table 1: Common Causes and Corrective Actions for Low Library Yield

Category	Root Cause	Corrective Action
Sample Input / Quality	Degraded RNA or contaminants (phenol, salts) inhibiting enzymes.	Re-purify input sample; use fluorometric quantification (Qubit) over absorbance; ensure high purity (260/230 > 1.8) [34].
Fragmentation & Ligation	Inefficient ligation due to poor enzyme activity or incorrect adapter-to-insert ratio.	Titrate adapter:insert ratios; ensure fresh ligase/buffer; optimize fragmentation parameters [34].
Amplification / PCR	Too few PCR cycles or enzyme inhibitors in the reaction.	Re-amplify from leftover ligation product; avoid over-cycling which causes duplicates and bias [34].
Purification & Cleanup	Overly aggressive size selection or bead cleanup leading to sample loss.	Use correct bead-to-sample ratio; avoid over-drying beads; ensure adequate washing without excessive sample loss [34] [35].

Proactive Prevention:

Run Pilot Experiments: Test a few samples and controls to optimize conditions before processing valuable samples [35].
Use Controls: Always include a positive control with RNA input mass similar to your cells (e.g., 1-10 pg) and a negative control (e.g., mock FACS buffer) to distinguish experimental from technical issues [35].
Practice Good Technique: Wear gloves, use low-binding plasticware, maintain separate pre- and post-PCR workspaces, and be meticulous during bead cleanup steps to minimize sample loss and contamination [35].

Experimental Protocols

Protocol: Benchmarking Feature Selection and Integration Workflow

This protocol is adapted from large-scale benchmarking studies [6] to evaluate the impact of feature selection on scRNA-seq data integration and query mapping.

1. Data Preprocessing:

Input: Multiple scRNA-seq datasets (count matrices).
Steps:
- Perform quality control (QC) and normalization separately for each batch [36].
- Subset all datasets to a common set of gene features.
- Rescale batches using multiBatchNorm() or similar to adjust for differences in sequencing depth [36].

2. Feature Selection:

Method: Apply different feature selection strategies to be benchmarked.
- A. Highly Variable Genes (HVGs): Use the scanpy or Seurat algorithm. For batch-aware selection, use a variant that computes HVGs per batch and aggregates the results [6] [36].
- B. Negative Controls: Select 500 random genes or 200 stably expressed genes (using scSEGIndex) [6].
Parameter: Test a range of feature set sizes (e.g., 500, 1000, 2000, 5000).

3. Data Integration:

Tool: Choose one or more integration methods. For standard batches, methods like Harmony, Seurat, or scVI are common. For substantial batch effects, consider sysVI [32].
Action: Integrate the reference datasets using each selected feature set from Step 2.

4. Performance Evaluation:

Action: Calculate a suite of metrics on the integrated data. The table below summarizes key metrics from benchmarks [6].

Table 2: Key Metrics for Evaluating Integration and Mapping Performance

Category	Metric	Description	What a Good Score Indicates
Batch Correction	iLISI (Integration LISI)	Measures diversity of batches in a cell's neighborhood [32].	High score: Batches are well-mixed.
	Batch PCR (Batch Principal Component Regression)	Quantifies the variance explained by batch in the latent space [6].	Low score: Less technical variation.
Biology Preservation	cLISI (Cell-type LISI)	Measures diversity of cell labels in a cell's neighborhood [6].	High score: Cell types are distinct.
	bNMI (Batch-balanced NMI)	Compares clustering similarity to cell labels, balanced across batches [6].	High score: Biological groups are conserved.
Query Mapping	Cell Distance	Average distance between query cells and their nearest reference neighbors [6].	Low score: Query cells map precisely to reference.
	mLISI (Mapping LISI)	Assesses mixing of query and reference cells in local neighborhoods [6].	High score: Query and reference are well-integrated.

Protocol: Implementing sysVI for Substantial Batch Effects

This protocol outlines the use of sysVI, a method designed for challenging integrations [32] [33].

1. Installation and Setup:

Tool: Access sysVI through the sciv-tools Python package [32].
Input: An AnnData object containing your multi-batch scRNA-seq data.

2. Model Configuration:

Key Features: sysVI enhances a standard cVAE with two components:
- VampPrior: Replaces the standard Gaussian prior with a mixture of posteriors, which can better capture multi-modal data distributions and improve biological preservation [32].
- Cycle-Consistency Loss: Encourages that translating a cell's expression from one batch to another and back again reconstructs the original expression, helping to preserve biological identity during integration [32].

3. Execution:

Train the sysVI model on your multi-batch dataset.
Obtain the integrated latent representation from the model for downstream analysis like clustering and visualization.

4. Validation:

Use the metrics in Table 2 to validate the integration quality, paying close attention to the balance between iLISI (batch mixing) and cLISI/bNMI (biology preservation).

The Scientist's Toolkit

Table 3: Essential Computational Tools & Resources for scFM Research

Resource Name	Type	Primary Function	Relevance to Batch-Aware Analysis
scanpy [6]	Python Package	Scalable single-cell analysis.	Provides implementations for standard HVG selection and preprocessing.
scvi-tools [32]	Python Package	Probabilistic models for scRNA-seq.	Hosts scalable integration methods like scVI and sysVI for substantial batch effects.
batchelor [36]	R/Bioconductor Package	Methods for correcting batch effects.	Implements fast and efficient batch correction algorithms like MNN.
Seurat [37]	R Package	Single-cell genomics analysis.	Offers a comprehensive integration workflow, including anchor-based integration.
CZ CELLxGENE [5]	Data Platform	Curated collection of single-cell datasets.	Provides a unified source of high-quality, annotated data essential for scFM pretraining and benchmarking.
Harmony [37]	Algorithm / Package	Data integration method.	A popular and efficient method for integrating datasets across technical batches.

Gene Module-Based Approaches for Enhanced Biological Signal

Frequently Asked Questions

Q1: What are the main advantages of using foundation models over traditional methods for single-cell data analysis? Single-cell foundation models (scFMs) are robust and versatile tools that learn universal biological knowledge from massive datasets during pretraining. This endows them with emergent abilities for zero-shot learning and efficient adaptation to various downstream tasks, such as cell type annotation, batch integration, and drug sensitivity prediction. However, for specific tasks with limited data or resources, simpler machine learning models can sometimes be more efficient and effective [4] [5].

Q2: How can I select highly variable genes (HVGs) effectively for my scFM training? Traditional HVG selection methods can be challenged by the high sparsity and dropout noise of scRNA-seq data. The GLP (LOESS with positive ratio) method provides a robust alternative by identifying biologically informative genes through the relationship between a gene's positive ratio (the fraction of cells where it is detected) and its average expression level. Genes with expression levels significantly higher than expected for their positive ratio are selected, which helps preserve key biological signals for downstream analysis [3].

Q3: Why is my model failing to identify rare cell types or subtle biological signals? This is a common challenge, often stemming from how features are selected. Standard HVG methods may overrepresent highly abundant cell types and miss less abundant ones. The performance is closely tied to dataset size; with larger and more diverse pilot datasets, the proportions of cells in each cluster become more similar to the ground-truth data. Using feature selection methods specifically designed to capture nuanced biological information, like GLP, can improve the detection of rare cell types [38] [3].

Q4: Can I incorporate prior biological knowledge, like gene networks, to improve my model's performance? Yes, integrating known biological networks can significantly increase the power to identify biologically relevant signals. Methods like Markov Random Field (MRF) models appropriately accommodate gene network information as well as dependencies among cell types. This allows the model to borrow information across related genes and cell types, leading to more statistically powerful and biologically insightful identification of features like cell-type-specific differentially expressed genes [39].

Troubleshooting Guides

Issue 1: Poor Model Generalization Across Datasets

Symptoms:

Model performs well on training data but poorly on new, unseen datasets.
High performance variability across different biological conditions or technologies.

Solutions:

Ensure Diverse Pretraining: If building a foundation model, pretrain on large-scale, diverse datasets that encompass many cell types, tissues, and conditions. Platforms like CZ CELLxGENE provide access to tens of millions of cells for this purpose [5].
Benchmark Your Model: Use a holistic benchmarking framework to evaluate your scFM against established baselines. Evaluate performance across multiple tasks (e.g., batch integration, cell annotation) and using multiple metrics to understand its strengths and limitations [4].
Check Feature Selection: The choice of highly variable genes can impact generalizability. Employ robust feature selection methods like GLP that are less sensitive to technical noise [3].

Issue 2: High Technical Noise Obscuring Biological Signal

Symptoms:

Clustering results are driven by batch effects rather than biological cell types.
Inability to distinguish true biological zeros in expression from technical dropouts.

Solutions:

Leverage Foundational Models: Use the zero-shot embeddings from scFMs, as they have been shown to be robust tools for integrating heterogeneous datasets and mitigating technical noise [4].
Refine Feature Selection: Adopt advanced feature selection methods that explicitly model or are robust to data sparsity. The GLP method, for instance, uses the positive ratio as a precise estimator to distinguish biological signals from technical noise [3].
Utilize Network Information: Integrate gene network information using models like MRF. This allows the model to distinguish true biological signals from random noise by considering the coordinated behavior of functionally related genes [39].

Issue 3: Inefficient or Uninformative Feature Selection

Symptoms:

Downstream analyses (clustering, trajectory inference) yield poor results.
Selected gene modules do not align with known biological pathways.

Solutions:

Implement GLP Algorithm:
- Compute the average expression (λ) and positive ratio (f) for each gene.
- Model the relationship between f (independent variable) and λ (dependent variable) using LOESS regression with an optimized bandwidth selected by the Bayesian Information Criterion (BIC) to prevent overfitting.
- Perform a two-step regression: the first step identifies outlier genes using Tukey’s biweight method, and the second step reruns LOESS while assigning zero weight to these outliers.
- Select genes whose actual average expression level is significantly above the regression-predicted value [3].
Incorporate External Knowledge: Use gene modules derived from known biological pathways or protein-protein interaction networks as priors in your model, as done in network-based differential expression analysis [39].

Experimental Protocols & Data

Table 1: Key Evaluation Metrics for scFMs and Feature Selection

Metric	Description	Application Context
scGraph-OntoRWR [4]	Measures consistency of cell type relationships captured by the model with prior biological knowledge from ontologies.	Evaluating biological relevance of scFM embeddings.
Lowest Common Ancestor Distance (LCAD) [4]	Measures ontological proximity between misclassified cell types; a smaller distance indicates a less severe error.	Benchmarking cell type annotation accuracy.
Adjusted Rand Index (ARI) [38] [3]	Measures the similarity between two data clusterings (e.g., from synthetic vs. real data).	Evaluating clustering performance in downstream analysis.
Silhouette Coefficient [3]	Measures how similar a cell is to its own cluster compared to other clusters.	Assessing the quality of clustering outcomes.
Roughness Index (ROGI) [4]	Quantifies the smoothness of the cell-property landscape in the latent space; a smoother landscape is easier for downstream modeling.	Serving as a proxy for model performance on a specific dataset.

Table 2: Research Reagent Solutions for scFM Workflows

Reagent / Resource	Function in Analysis	Key Reference/Source
CZ CELLxGENE [5]	A unified platform providing access to over 100 million curated single-cell datasets for model pretraining and benchmarking.	https://cellxgene.cziscience.com/
GLP Algorithm [3]	A robust feature selection method to identify highly variable genes by modeling the relationship between positive ratio and average expression.	https://github.com/WangyuchenCS/GLP
MRFscRNAseq R Package [39]	Implements a Markov Random Field model to identify cell-type-specific differentially expressed genes by incorporating gene network information.	Available on GitHub
PEREGGRN Benchmarking Platform [40]	A software platform for fairly evaluating expression forecasting methods on a collection of perturbation transcriptomics datasets.	Associated with Genome Biology (2025)

Workflow and Pathway Visualizations

scFM Training and Eval Workflow

GLP Gene Selection Logic

Troubleshooting Guides & FAQs

Common Experimental Issues and Solutions

Problem Category	Specific Issue	Possible Causes	Solution	Related Analysis Step
Data Quality	High dropout rates in scRNA-seq data	Low RNA input, inefficient cDNA amplification [41]	Use Unique Molecular Identifiers (UMIs) and spike-in controls; employ computational imputation [41]	HVG Selection, Clustering
	Batch effects between sequenced and spatial data	Technical variation from different experimental batches [41]	Apply batch correction algorithms (e.g., Combat, Harmony, Scanorama) [41]	Data Integration
Integration	Weak linkage between modalities (e.g., protein & RNA)	Few correlated features, low signal-to-noise ratio [42]	Use iterative integration methods (e.g., MaxFuse) that use all features for co-embedding [42]	Cross-Modal Integration
	Incorrect cell type matching	Poor initial alignment, over-reliance on highly variable genes [42]	Implement fuzzy smoothing on linked features and use linear assignment for matching [42]	Cell Type Annotation
Feature Selection	HVG list contains technical noise	High sparsity and dropout events masking biological variation [3]	Use GLP method modeling positive ratio vs. expression level with optimized LOESS [3]	HVG Selection for scFM Training
	Selected genes fail to capture key biology	Assumptions of mean-variance trend do not hold [2] [3]	Quantify biological component of variation using `modelGeneVar()` or spike-in trends [2]	Downstream Analysis
Computational	scFM predictions have low positive predictive value	"Open-loop" model not refined with experimental data [17]	Fine-tune foundation model with perturbation data ("closed-loop" ISP) [17]	In Silico Perturbation

Detailed Methodologies

Closed-Loop In Silico Perturbation (ISP) Fine-Tuning

Purpose: To significantly improve the positive predictive value (PPV) of a single-cell foundation model (scFM) like Geneformer by incorporating experimental data [17].

Procedure:

Fine-tune Base Model: Start with a scFM pre-trained on a large corpus (e.g., Geneformer). Fine-tune it to classify your cell states of interest (e.g., diseased vs. control HSCs) using a standard scRNA-seq dataset [17].
Incorporate Perturbation Data: Further fine-tune this model by adding single-cell RNA sequencing data from CRISPR activation/interference screens (e.g., Perturb-seq). The training labels should be the cell's activation status, not the identity of the perturbed gene [17].
Perform ISP: Use the fine-tuned model to perform in silico perturbations across the gene set. The model will predict which gene perturbations can shift a cell from a diseased state to a control-like state [17].
Validation: Benchmarks show this closed-loop approach can increase PPV three-fold (e.g., from 3% to 9%) while also improving sensitivity and specificity [17].

Purpose: To accurately integrate data from two weakly linked modalities, such as targeted spatial proteomics and whole-transcriptome scRNA-seq [42].

Procedure:

Input Matrices: Prepare two pairs of matrices for the two modalities (e.g., Protein 'Y' and RNA 'Z'):
- All-Feature Matrices: Cell-by-all-features (e.g., all proteins in panel, all genes).
- Linked-Feature Matrices: Cell-by-features with one-to-one correspondence (e.g., a protein and its coding gene) [42].
Stage 1 - Initial Matching:
- Compute a fuzzy nearest-neighbor graph within each modality using all features.
- Apply "fuzzy smoothing" to the linked features, shrinking each cell's values towards its neighborhood average.
- Perform an initial cross-modal cell matching using linear assignment on the smoothed features [42].
Stage 2 - Iterative Refinement:
- Iterate until matching quality stabilizes:
  - Learn a linear joint embedding of the matched cells using all features.
  - Treat the embedding coordinates as new "linked features" and apply fuzzy smoothing.
  - Update the cell matching via linear assignment on the new smoothed coordinates [42].
Stage 3 - Final Output:
- Retain high-quality matches as "pivots."
- Use pivots to create a final joint embedding and propagate matches to unmatched cells [42].

The Scientist's Toolkit: Key Research Reagents & Materials

Item	Function in Experiment	Application Context
Unique Molecular Identifiers (UMIs)	Tags individual mRNA molecules to correct for amplification bias and quantify absolute transcript counts [41].	scRNA-seq Library Prep
Spike-In Controls (e.g., ERCC)	Exogenous RNA transcripts added to samples to monitor technical noise and help model gene variation [2].	scRNA-seq Quality Control
10X Genomics Chromium / Visium	Platform for droplet-based single-cell RNA sequencing and spatially resolved transcriptomics [43] [41].	scRNA-seq & SRT Library Generation
BD Rhapsody Single-Cell Analysis System	Another platform for whole transcriptome analysis at single-cell resolution, used in spaceflight studies [43].	scRNA-seq
CRISPRa/i Perturb-seq Library	Enables large-scale genetic perturbation screens coupled with single-cell RNA readout, providing data for scFM fine-tuning [17].	Closed-Loop ISP Validation
CITE-seq Antibody Panel	Allows for simultaneous measurement of surface proteins and transcriptome in single cells, creating a linked dataset [42].	Multi-Modal Integration
Cell Hashing Oligonucleotides	Labels cells from different samples with unique barcodes, allowing for sample multiplexing and identification of cell doublets [41].	Sample Multiplexing & QC
CODEX Multiplexed Antibody Panel	Enables highly multiplexed spatial proteomics imaging, which can be integrated with transcriptomic data [42].	Spatial Proteomics

HVG Selection Methods for scFM Training

Method	Core Principle	Key Metric(s)	Key Considerations for scFM Training
GLP (Genes by LOESS & Positive Ratio) [3]	Identifies genes whose average expression is significantly higher than expected based on their positive ratio (fraction of cells expressing the gene).	Deviation from optimized LOESS curve of λ vs. f [3]	Directly models dropout rate, which is a more precise population parameter than variance. Helps select biologically informative genes in sparse data [3].
modelGeneVar (scran) [2]	Fits a mean-variance trend to log-normalized expression values across all genes. The biological component is total variance minus the technical component.	Biological component of variation [2]	Assumes most genes are driven by uninteresting noise. Can be inflated if many genes at an abundance are biologically variable [2].
modelGeneVar with Spike-Ins [2]	Fits a mean-dependent trend to the variance of spike-in transcripts to better estimate the technical component.	Biological component of variation [2]	Provides a cleaner estimate of technical noise, but requires spike-in data and assumes they mimic technical variation of endogenous genes [2].
VST (Seurat) [3]	Uses a variance stabilizing transformation based on a generalized linear model of the mean-variance relationship.	Standardized variance [3]	A widely used and robust method that is a standard benchmark in the field [3].

Workflow and Pathway Visualizations

MaxFuse Cross-Modal Integration Workflow

Closed-Loop scFM Fine-Tuning for ISP

The selection of Highly Variable Genes (HVGs) is a critical preprocessing step in single-cell RNA sequencing (scRNA-seq) analysis, directly influencing the performance of downstream tasks such as clustering, data integration, and the training of single-cell foundation models (scFMs) [16] [6]. This guide addresses common challenges and provides practical solutions for integrating robust HVG selection into scFM training workflows, framed within the context of advanced research in gene selection methodologies.

Why HVG Selection Matters for scFMs

Single-cell foundation models require high-quality, informative input features to learn meaningful biological representations. HVGs—genes that exhibit significant cell-to-cell variation—are prioritized because they are most likely to represent interesting biological heterogeneity rather than technical noise [16]. Selecting HVGs:

Reduces computational complexity and memory requirements by focusing on informative genes.
Enhances model performance by emphasizing genes that drive biological signal.
Mitigates the impact of technical artifacts and batch effects [6].

The Researcher's Toolkit: Key Reagents & Computational Tools

Table 1: Essential Tools and Resources for HVG Selection and scFM Training

Category	Tool/Resource	Primary Function	Key Consideration
HVG Selection Methods	`scanpy` (Seurat-like), `scran`, `BASiCS` [1]	Identifies genes with high biological variability	No single best method; consider hybrid approaches [44]
Integration & scFM Training	`scVI` [45], `scANVI` [46], `scGPT` [4]	Deep learning models for integration and foundation model training	Performance depends on quality of input features [4]
Benchmarking & Evaluation	`scIB` [6], `scGraph-OntoRWR` [4]	Metrics for integration quality and biological relevance	Evaluate both batch correction and biological conservation [6]
Data Resources	CellxGene [4], PanglaoDB [46]	Curated cell type markers and reference datasets	Crucial for annotation and validation

Standardized Workflow for HVG Selection and scFM Training

The following diagram illustrates a robust workflow for integrating HVG selection into scFM training, designed to handle complex, multi-batch datasets.

Frequently Asked Questions & Troubleshooting

FAQ 1: How should I handle HVG selection when integrating datasets with substantial batch effects?

Problem: Datasets from different technologies, species, or laboratories show strong batch effects, and standard HVG selection fails, leading to poor integration.

Solution: Implement a batch-aware consensus HVG selection strategy.

Step-by-Step Protocol:
- Identify HVGs per batch: Use the batch_key parameter in scanpy.pp.highly_variable_genes() to compute HVGs separately for each batch or system (e.g., species) [45].
- Rank by frequency: For each gene, count how many batches it was identified as an HVG (highly_variable_nbatches) [28].
- Select consensus genes: Sort genes by this count and select the top N (e.g., 2000-3000) genes that are highly variable across the maximum number of batches [28] [47].

FAQ 2: My datasets fail to integrate well even after HVG selection. What can I do?

Problem: After standard HVG selection, batches remain separated in the integrated embedding.

Troubleshooting Steps:

Investigate gene-specific effects: Perform differential expression analysis between batches that are not integrating. An overabundance of genes from a specific family (e.g., RPS genes) can indicate protocol-specific artifacts [47].
Try alternative feature selection:
- Use randomly selected genes as a baseline to determine if the HVG selection itself is the problem [47].
- Consider using the entire genome if you have a sufficiently large number of cells [47].
Adjust model architecture: For models like scVI, increasing the model complexity (e.g., using 2 layers instead of 1) can sometimes help capture more complex batch effects [47].
Use batch-aware integration methods: Employ deep learning integration methods like scVI or sysVI that are explicitly designed to handle batch effects as a covariate [45] [46].

FAQ 3: How many HVGs should I select, and which method should I use?

Problem: The choice of the number of HVGs and the selection method seems arbitrary, and performance varies.

Evidence-Based Guidance:

Number of HVGs: The selection is often arbitrary, but common practice uses 2,000-5,000 genes [16] [6]. Benchmarking shows that the number of selected features correlates with integration metrics; more features generally improve performance up to a point, but can negatively impact query mapping [6].
Selection Method: A systematic evaluation of 47 methods found that no single baseline HVG method consistently outperforms all others [44]. Hybrid methods that combine top-ranked features from multiple baseline methods (e.g., mixHVG) demonstrate more robust performance [44].
Recommendation: Do not rely on a single method. For critical analyses, test a few different HVG selection methods and numbers of genes, and evaluate integration quality using metrics like batch ASW or iLISI [6].

FAQ 4: How do I evaluate if my HVG selection + scFM pipeline is successful?

Problem: It is unclear how to quantitatively assess the quality of the integrated embedding generated by the scFM.

Comprehensive Evaluation Metrics: A robust evaluation should assess both batch effect removal and conservation of biological variation [6] [48]. The table below summarizes key metrics.

Table 2: Key Metrics for Evaluating scFM Output After HVG Selection

Evaluation Category	Metric	What It Measures	Ideal Outcome
Batch Effect Removal	Batch ASW [6]	How well mixed batches are within cell neighborhoods.	Higher Score
	iLISI (Integration LISI) [6]	Likelihood of a cell's neighbors coming from multiple batches.	Higher Score
Biological Conservation	cLISI (Cell-type LISI) [6]	Likelihood of a cell's neighbors being of the same cell type.	Higher Score
	Isolated Label F1 [6]	How well rare cell types are preserved after integration.	Higher Score
Biological Insight (for scFMs)	scGraph-OntoRWR [4]	Consistency of cell-type relationships in the embedding with known biology (e.g., cell ontology).	Higher Score

FAQ 5: I have a small dataset. Can I still effectively train an scFM using HVGs?

Problem: Foundation models typically require large amounts of data, but my dataset is limited.

Solutions and Considerations:

Leverage Pre-trained scFMs: Many existing scFMs (e.g., Geneformer, scGPT) are pre-trained on millions of cells. The recommended approach is to fine-tune these models on your dataset using the standard HVG selection workflow [4].
Use HVGs from a Reference Atlas: If your small dataset is part of a larger biological system (e.g., a specific tissue), identify HVGs from a large, public reference atlas of that tissue. Use this shared HVG set to subset your data before training or fine-tuning.
Benchmark Simpler Models: For small-scale studies, simpler machine learning models applied to a well-chosen HVG set can sometimes outperform large, complex foundation models, especially under computational constraints [4]. Evaluate whether an scFM is necessary for your specific task.

Advanced Technical Note: HVG Selection for Cross-System Integration

For exceptionally challenging integrations, such as across species (mouse/human) or different technologies (scRNA-seq vs. snRNA-seq), a stricter HVG protocol is required. The sysVI model recommends this workflow [45]:

Preprocess systems separately: Normalize and log-transform each system (e.g., species) independently.
Find shared genes: Start with the intersection of genes present in all systems.
Select HVGs per system: Using within-system batches as batch_key, identify HVGs for each system independently.
Take the final intersection: The features used for integration are the HVGs that are shared across all systems. This typically results in a robust set of ~2000 genes [45].

FAQs: Tissue Atlases and Single-Cell Foundation Models

This section addresses frequently asked questions about the role of tissue atlases in single-cell research, with a focus on selecting highly variable genes (HVGs) for training single-cell foundation models (scFMs).

Q1: How can tissue atlases improve the selection of highly variable genes for scFM training? Tissue atlases provide a foundational reference for understanding gene expression patterns across diverse tissues and cell types. When selecting HVGs, researchers can use atlas data to prioritize genes that show biologically meaningful variation, such as those with high tissue specificity, rather than technical noise. For instance, the miRNATissueAtlas uses a Tissue Specificity Index (TSI) to classify RNAs, a concept that can be directly applied to gene selection for scFMs [49] [50]. By integrating TSI values, you can filter your gene list to include those with documented biological variability, thereby improving the signal captured by your scFM.

Q2: What are the consequences of poor HVG selection on scFM performance? Benchmarking studies reveal that the choice of input features significantly impacts scFM performance on downstream tasks [4]. Poor HVG selection can lead to:

Poor Cell Type Annotation: Models struggle to distinguish between cell types if key marker genes are missing.
Ineffective Batch Integration: Technical batch effects may dominate the latent representation if HVGs capture noise over biological signal.
Limited Biological Insight: The model's embeddings may fail to capture meaningful biological relationships, as measured by ontology-based metrics like scGraph-OntoRWR [4]. Essentially, the model cannot learn the "language" of cells if provided with an uninformative vocabulary.

Q3: Are complex scFMs always better than simpler models for tasks based on tissue atlas data? No, a key finding from recent benchmarks is that no single scFM consistently outperforms others across all tasks [4]. The decision to use a complex scFM versus a simpler machine learning model depends on:

Dataset Size: Simpler models can be more efficient and perform just as well on smaller, focused datasets.
Task Complexity: For novel tasks like predicting responses to unseen drug perturbations, scFMs may have an advantage due to their broad pretraining.
Computational Resources: Training and fine-tuning scFMs are computationally intensive [4] [5]. The choice should be guided by a trade-off between expected performance gain and resource cost.

Q4: How can I validate that my scFM has learned biologically relevant features from tissue atlas data? Beyond standard performance metrics, you can use ontology-informed metrics to assess biological relevance:

scGraph-OntoRWR: This novel metric evaluates whether the relationships between cell types learned by the scFM are consistent with established biological knowledge in cell ontologies [4].
Lowest Common Ancestor Distance (LCAD): This metric assesses the severity of cell type misclassification by measuring the ontological distance between the predicted and true cell type. A smaller LCAD indicates a less severe error [4].

Troubleshooting Guides for Atlas-Based Research

This guide helps diagnose and resolve common issues encountered when utilizing tissue atlases or building upon their data.

Problem: Inability to Replicate Tissue-Specific Findings from an Atlas

Potential Cause 1: Differences in Data Processing.
- Solution: Ensure your processing pipeline matches the atlas's methodology. For example, the miRNATissueAtlas and protein association atlas both rely on uniformly processed data [49] [51]. Standardize your gene identifiers, normalization techniques, and batch correction methods to align with the atlas.
Potential Cause 2: Underestimated Inter-tissue Variability.
- Solution: Confirm that your analysis accounts for tissue context. The protein association atlas found that over 25% of protein-protein associations are tissue-specific, many driven by cell-type-specific structures like synapses, not just gene expression [51] [52]. Always use the most tissue-relevant data available.

Problem: scFM Fails to Generalize to a New Perturbation or Disease Dataset

Potential Cause 1: Data Leakage During Training.
- Solution: Implement a strict data splitting strategy where no perturbation condition appears in both training and test sets. The PEREGGRN benchmarking platform uses this method to properly evaluate a model's ability to predict effects of novel perturbations [40].
Potential Cause 2: Inadequate Pretraining Data Coverage.
- Solution: Fine-tune your model on data that is more specific to your target domain. If working on lung disease, incorporating data from a specialized resource like the lung disease perturbation atlas could provide the necessary context for the model to adapt [53].

Problem: Low Accuracy in Predicting Gene Expression Changes from Perturbations

Potential Cause: Over-reliance on Simple Baselines.
- Solution: Systematically benchmark your forecasting method. Studies show that it is uncommon for complex expression forecasting methods to outperform simple baselines across diverse contexts [40]. Use platforms like PEREGGRN to compare your method's Mean Absolute Error (MAE) and Spearman correlation against dummy predictors on multiple datasets.

Experimental Protocols from Key Case Studies

Case Study 1: Constructing a Tissue-Specific Protein-Protein Interaction Atlas [51] [52]

Objective: To create an atlas of protein-protein associations across 11 human tissues, enabling the prioritization of candidate disease genes.
Methodology:
- Data Compilation: Collect protein abundance data from 7,811 proteomic samples (tumor and adjacent healthy tissue) from 50 public studies.
- Coabundance Calculation: For each study, compute the Pearson correlation of normalized protein abundance for every protein pair.
- Probability Conversion: Convert correlation coefficients to association probabilities using a logistic model. Known protein complexes from the CORUM database are used as ground-truth positives.
- Score Aggregation: Aggregate probabilities from cohorts of the same tissue into a single tissue-level association score.
- Validation: Validate predictions using orthogonal methods such as cofractionation experiments, brain-derived pulldown data, and AlphaFold2 modeling.

The workflow for this protein association analysis is summarized in the diagram below:

Case Study 2: Benchmarking Single-Cell Foundation Models [4]

Objective: To evaluate the performance of six scFMs against established baselines on biologically and clinically relevant tasks.
Methodology:
- Model and Task Selection: Evaluate six scFMs (e.g., Geneformer, scGPT) on two gene-level and four cell-level tasks, including batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction.
- Zero-Shot Evaluation: Use the pretrained models to generate gene and cell embeddings without further fine-tuning (zero-shot) for the initial assessment.
- Novel Metric Application: Evaluate model outputs using novel ontology-informed metrics like scGraph-OntoRWR (to measure consistency with biological knowledge) and LCAD (to measure severity of cell type misclassification).
- Holistic Ranking: Rank model performance using a non-dominated sorting algorithm that aggregates multiple evaluation metrics to guide task-specific model selection.

Case Study 3: Large-Scale Lung Disease Perturbation Screening [53]

Objective: To discover new therapeutic targets and cellular circuits for lung diseases by profiling responses to pharmacological interventions.
Methodology:
- Model System: Use human lung ex-vivo tissue slice cultures from normal donors and patients with chronic lung disease.
- Perturbation: Apply 900 different pharmacological interventions to the tissue cultures.
- Single-Cell Profiling: Use Parse Biosciences' GigaLab platform, based on Evercode chemistry, for massive-scale single-cell RNA sequencing to measure transcriptomic responses.
- AI Model Training: Utilize the generated large-scale dataset to train foundational AI models for understanding gene regulation in lung health and disease.

The conceptual pipeline for this drug perturbation atlas is as follows:

Table 1: Key Features of Recent Tissue and Interaction Atlases

Atlas Name	Data Type	Scale	Key Application	Reference / Access
miRNATissueAtlas 2025 [49] [50]	9 sncRNA classes	61,593 samples (Human & Mouse); 224 human tissues	Tissue specificity index (TSI) calculation; cross-species comparison	https://web.ccb.uni-saarland.de/mirnatissueatlas_2025/
Protein-Protein Interaction Atlas [51] [52]	Protein coabundance	7,811 samples; 11 human tissues; 116M protein pairs	Prioritizing candidate disease genes in a tissue-specific context	www.ppiatlas.com
Human Protein Atlas v25 [54]	Protein expression & localization	All protein-coding genes; 10M+ images; 34 scRNA-seq tissues	Spatial proteomics; disease blood protein profiling; interaction networks	https://www.proteinatlas.org/
Lung Disease Perturbation Atlas [53]	scRNA-seq post-perturbation	900 pharmacological interventions on human lung tissue	Identifying therapeutic targets and regenerative circuits	In development (Helmholtz Munich)

Table 2: scFM Performance on Key Tasks (Synthesized from Benchmarking Studies) [4]

Model Task	Performance Insight	Key Metric(s)	Recommendation for HVG Selection
Cell Type Annotation	Performance varies; scFMs do not always beat baselines. Error severity can be assessed.	Accuracy, LCAD	Select HVGs with known cell-type specificity from atlases to improve accuracy.
Batch Integration	scFMs are generally robust, but simpler methods can be competitive.	Local Inverse Simpson's Index (LISI)	Ensure HVGs are not driven by batch-specific technical artifacts.
Biological Relevance	Pretrained scFM embeddings capture meaningful biological relationships.	scGraph-OntoRWR	Prioritize HVGs that are central in gene regulatory networks.
Drug Sensitivity Prediction	A clinically relevant task where scFM generalization can be tested.	AUPRC, MSE	Incorporate pathway-specific genes from disease atlases into the feature set.

Table 3: Key Research Reagent Solutions for Atlas Construction and scFM Training

Item / Resource	Function	Example Use Case
CORUM Database [51]	A curated database of experimentally characterized protein complexes.	Used as a ground-truth reference for training and validating protein-protein association predictions [51].
Cell Ontology	A structured, controlled vocabulary for cell types.	Enables the use of metrics like LCAD and scGraph-OntoRWR to evaluate the biological plausibility of scFM outputs [4].
Parse Biosciences Evercode / GigaLab [53]	A scalable single-cell RNA sequencing platform based on combinatorial barcoding.	Used for generating massive perturbation datasets, such as the lung disease atlas, with reduced batch effects [53].
Olink & SomaScan Assays [54]	High-throughput proteomics platforms for measuring protein levels in biofluids.	Used in the Human Protein Atlas to build the Human Disease Blood resource, profiling 71 diseases [54].
AlphaFold3 [54]	A deep learning model for highly accurate protein structure prediction.	Used to predict structures for thousands of protein-protein interactions within the Human Protein Atlas [54].
PEREGGRN Benchmarking Platform [40]	A software platform for fairly evaluating expression forecasting methods on unseen genetic perturbations.	Prevents data leakage and provides a standardized way to compare new forecasting methods against simple baselines [40].

Advanced Strategies for Optimizing HVG Selection in Complex Scenarios

Frequently Asked Questions

Q: What are Highly Variable Genes (HVGs) and why is their selection a critical step in scRNA-seq analysis?

A: Highly Variable Genes (HVGs) are those that show considerable variation in expression across the single cells in your dataset. Selecting them is a pivotal step because these genes are often the main drivers of meaningful biological heterogeneity, such as differences between cell types or states. Focusing on HVGs helps to reduce the data dimensionality, decrease computational noise, and enhance the signal for downstream analyses like clustering and trajectory inference [16] [2].

Q: How do I determine the optimal number of Highly Variable Genes to use for my analysis?

A: There is no universal "correct" number of HVGs; the optimal number is dataset-dependent and involves a trade-off between retaining biological signal and introducing noise. A common heuristic is to select between 2,000 and 5,000 HVGs [16]. The best practice is to use a data-driven approach by ranking genes based on a measure of their biological variability and then selecting a cut-off where the ranking starts to be dominated by technical noise rather than biological signal. Many analysis workflows, such as the one in Seurat, use a default of 3,000 HVGs [13]. Performance can be evaluated using downstream metrics like silhouette width or the accuracy of known cell type separation [55].

Q: What are the consequences of selecting too many or too few HVGs?

A: The number of HVGs selected has a direct impact on your results.

Too few HVGs: You risk excluding biologically important genes, which can lead to an oversimplified view of the data and the failure to identify rare or subtle cell subpopulations.
Too many HVGs: You include genes with high variation that is primarily due to technical noise. This can obscure the true biological signal, reduce the performance of downstream clustering, and increase computational time.

Q: My downstream clustering seems driven by technical artifacts like cell cycle phase. Did I choose the wrong number of HVGs?

A: Not necessarily. While an improper HVG count can exacerbate this, the issue often lies in the data normalization step. Technical variation from sources like cell cycle, mitochondrial read percentage, or sequencing depth can confound biological differences. It is recommended to check and, if necessary, regress out these nuisance variables during the normalization and HVG selection process using methods like SCTransform in Seurat [13]. This ensures that the selected HVGs reflect interesting biological variation.

Troubleshooting Guides

Problem: Inconsistent Clustering Results When Varying the Number of HVGs

Description: The cell clusters identified change significantly when you increase or decrease the number of HVGs used, leading to instability in your biological interpretation.

Solution:

Systematic Exploration: Perform your clustering pipeline (e.g., PCA, graph-based clustering) using a range of HVG numbers (e.g., 1,000, 2,000, 3,000, 5,000).
Evaluate Cluster Stability: Use metrics like the silhouette width to assess the compactness and separation of clusters at each HVG set size [55].
Leverage Biological Priors: Check if known cell-type-specific marker genes are present and correctly clustered in each scenario. A stable, biologically interpretable result across a range of HVG numbers increases confidence.
Select a Plateau: Often, the performance metrics will improve and then plateau. Choosing a number of HVGs at the beginning of this plateau is a robust strategy.

Problem: Failure to Identify a Known or Rare Cell Population

Description: A cell type that you expect to be present based on prior knowledge or marker genes does not form a distinct cluster in your analysis.

Solution:

Increase the HVG Count: The marker genes for the rare population might not have ranked highly enough to be included in your initial HVG list. Try increasing the number of HVGs to 4,000 or 5,000 to capture these weaker but biologically important signals [16].
Inspect Gene Rankings: Manually check the ranking of known marker genes in your HVG list. If they are not highly ranked, investigate if they were removed during quality control or if their variation was modeled incorrectly.
Validate with Markers: After increasing the HVG count, confirm that the expression of the known markers now defines a distinct cluster.

Quantitative Comparison of HVG Selection Methods

The following table summarizes the characteristics of different statistical models used to quantify per-gene variation and select HVGs. The choice of model influences which genes are prioritized.

Method	Underlying Model	Key Feature	Best Suited For
ModelGeneVar [2]	Fits a trend to the variance of log-normalized values across all genes.	Separates total variance into technical (uninteresting) and biological (interesting) components.	General purpose analysis where most genes are not differentially expressed.
ModelGeneVarWithSpikes [2]	Fits a trend to the variance of spike-in transcripts.	Uses spike-ins to directly model technical noise without biological contamination.	Datasets with reliably added spike-in controls.
ModelGeneVarByPoisson [2]	Assumes UMI counts exhibit near-Poisson technical noise.	Constructs a technical trend based on a Poisson distribution assumption.	UMI-based datasets (e.g., 10x Genomics) without spike-in controls.
sctransform [13]	Regularized Negative Binomial regression.	Directly models and removes technical variation (e.g., sequencing depth), returning residuals as normalized data.	A modern, robust method recommended for UMI data that avoids overfitting.

Experimental Protocol: A Standard Workflow for HVG Selection with Seurat

This protocol outlines the steps for normalizing data and identifying HVGs using the SCTransform method within the popular Seurat package, which accounts for technical confounders.

1. Prerequisite: Quality Control

Begin with a filtered Seurat object where low-quality cells (based on low UMI counts, high mitochondrial read percentage, or outlier gene counts) have been removed [56].

2. Normalization & HVG Selection with SCTransform

The SCTransform function performs normalization, variance stabilization, and HVG selection in a single step.
Crucially, it allows you to regress out unwanted sources of variation. Common variables to regress out include mitoRatio (mitochondrial gene percentage) and, if identified as a major source of variation, cell cycle scores [13].

By default, SCTransform will rank genes by residual variance and output the 3,000 most variable genes, which are stored in the "SCT" assay of the Seurat object [13].

3. (Optional) Cell Cycle Scoring

Before SCTransform, it is good practice to check if cell cycle phase is a major source of variation.
Normalize data using NormalizeData and then score cells for S and G2/M phase using pre-defined gene lists with CellCycleScoring.
Visualize the cells via PCA colored by Phase. If the cells do not separate strongly by phase, it may not need to be regressed out [13].

4. Downstream Validation

Use the selected HVGs for clustering and dimensionality reduction.
Evaluate the biological coherence of the results using known marker genes. The success of the HVG selection is ultimately validated by the quality and interpretability of the clusters it produces.

HVG Selection Decision Workflow

The following diagram illustrates the logical process for selecting and validating the set of Highly Variable Genes for your analysis.

Research Reagent Solutions

Item	Function in HVG Analysis
Spike-in RNAs (e.g., ERCC) [55] [57]	Exogenous RNA controls of known concentration used to create a standard curve. They help to accurately model technical noise for improved HVG selection, especially in full-length sequencing protocols.
Unique Molecular Identifiers (UMIs) [55] [57]	Random barcodes that tag individual mRNA molecules before amplification. UMIs correct for PCR amplification bias, leading to more accurate gene expression counts and a more reliable quantification of gene variability.
10x Genomics Chromium [56]	A widely used droplet-based single-cell platform that incorporates UMIs by default, generating data suitable for robust HVG detection methods like `SCTransform` and `modelGeneVarByPoisson`.
Seurat R Toolkit [13]	A comprehensive software package that provides multiple integrated functions for scRNA-seq analysis, including the `SCTransform` normalization/HVG method and standard `FindVariableFeatures` with several model options.
SingleCellExperiment (SCE) Object [58] [2]	A standard data structure in Bioconductor for storing single-cell data. It is used by various packages (e.g., `scran`) that offer alternative HVG selection methods like the deconvolution-based approach and `modelGeneVar`.

Addressing Reproducibility Concerns in HVG Selection

Frequently Asked Questions

Why does Highly Variable Gene (HVG) selection significantly impact the reproducibility of my single-cell Foundation Model (scFM) training? HVG selection directly influences which biological signals your model learns. Different HVG methods can select substantially different gene sets, leading to models that capture varying aspects of the data. A 2025 benchmark found that feature selection methods significantly affect integration performance and subsequent query mapping, with implications for model generalizability [6]. Selecting inconsistent HVGs across experiments will yield models that prioritize different biological features, directly harming reproducibility.

What are the primary sources of irreproducibility in HVG selection? The main sources are:

Methodological variability: Over 20 different feature selection methods exist, each with different statistical assumptions and outputs [6].
Technical artifacts: Batch effects and technical noise can be misinterpreted as biological variation without proper correction [1].
Data-dependent performance: No single HVG method consistently outperforms others across all dataset types and sizes [23].
Parameter sensitivity: The number of selected features significantly impacts downstream results, with most metrics correlating with feature set size [6].

How can I determine if my HVG selection is capturing biological signal versus technical noise? Use spike-in controls when available to model technical noise separately from biological variation [2]. For data without spike-ins, leverage mean-variance trend modeling or Poisson-based noise models [2]. Additionally, evaluate your selected HVGs using batch-aware methods that can distinguish technical batches from biological variation [6].

Troubleshooting Guides

Problem: Inconsistent Cell Type Annotations Across Studies

Issue: Your scFM produces cell embeddings that lead to inconsistent cell type annotations when compared to reference atlases.

Solution:

Implement batch-aware HVG selection: Use methods that explicitly account for batch effects during feature selection rather than correcting for them post-hoc [6].
Validate with ontology-informed metrics: Use metrics like scGraph-OntoRWR or Lowest Common Ancestor Distance (LCAD) to ensure cell type relationships in your embeddings align with established biological knowledge [23].
Leverage multiple baseline methods: Compare your HVG selection against negative controls (random genes, stably expressed genes) to establish performance ranges [6].

Problem: Poor Cross-Dataset Generalization

Issue: Your scFM performs well on training data but fails to generalize to new datasets.

Solution:

Benchmark with multiple integration metrics: Evaluate HVG selection using metrics specifically designed for query mapping and unseen population detection, not just batch correction [6].
Use dataset-specific roughness index (ROGI): Quantify the smoothness of the cell-property landscape in your latent space as a proxy for generalization capability [23].
Select features robust to dataset complexity: Methods that maintain performance as dataset complexity (number of cells, batches, labels) increases are more likely to generalize well [6].

Experimental Protocol: Assessing Generalization Capability

Split your data into reference and query sets, ensuring distinct biological samples in each
Apply HVG selection to the reference set only
Train your scFM using only these reference-selected features
Map query data to the reference space using the same feature set
Evaluate using mapping-specific metrics: Cell distance, Label distance, mLISI, and qLISI [6]
Calculate ROGI scores for both reference and query embeddings [23]

Problem: HVG Method Selection Uncertainty

Issue: You're unsure which of the many HVG methods to implement for optimal reproducibility.

Solution:

Start with established baselines: Highly variable feature selection using the scanpy-Cell Ranger method (batch-aware variant) provides a robust starting point [6].
Evaluate multiple method types: Test methods based on different statistical approaches (differential expression, feature selection, predictive performance) [59].
Prioritize simplicity when appropriate: Simple methods like Wilcoxon rank-sum test, Student's t-test, and logistic regression often outperform more complex approaches for marker gene selection tasks [59].

Performance Comparison of HVG Methods

Table 1: HVG Method Categories and Characteristics [1] [59]

Method Category	Representative Methods	Key Characteristics	Reproducibility Considerations
Differential Expression Based	Wilcoxon rank-sum, t-test, logistic regression	Uses statistical testing between groups; most common approach	Simple methods show strong performance; less parameter tuning needed
Variance Modeling	Brennecke, scran, scVEGs	Models mean-variance relationship; decomposes technical and biological variation	Requires proper normalization; sensitive to distribution assumptions
Feature Selection	NSForest, SMaSH, RankCorr	Selects genes maximally informative for classification	May prioritize different genes than DE methods; evaluate with task-specific metrics
Bayesian Approaches	BASiCS	Uses hierarchical models to decompose variation sources	Computationally intensive but provides uncertainty quantification

Table 2: Quantitative Performance of Common Methods Across Benchmarking Studies [6] [1] [59]

Method	Integration Performance	Biological Conservation	Query Mapping	Computational Efficiency
Highly Variable (scanpy)	High	High	Moderate-High	High
Wilcoxon Test	Moderate	High	Moderate	High
Seurat VDM	Moderate-High	Moderate-High	Moderate	High
scran	Moderate	High	Moderate	Moderate
BASiCS	Moderate	Moderate	Moderate	Low

Essential Research Reagent Solutions

Table 3: Key Experimental Materials for Reproducible HVG Selection

Reagent/Resource	Function in HVG Selection	Implementation Considerations
Spike-in Controls (ERCC)	Enables technical noise modeling for variance decomposition	Use consistent concentrations across experiments; required for methods like BASiCS [2]
Batch-Aware Normalization	Removes technical artifacts while preserving biological variation	Choose methods appropriate for your technology (UMI vs. full-length) [1]
Reference Cell Atlases	Provides ground truth for biological conservation metrics	Use consistent mapping and annotation practices across studies [23]
Standardized Quality Metrics	Quantifies integration and mapping performance	Implement multiple metric types: batch correction, bio conservation, and query mapping [6]

Workflow Diagram for Reproducible HVG Selection

Critical Protocol: Standardized HVG Evaluation Framework

To ensure reproducible HVG selection for scFM training, implement this standardized evaluation protocol adapted from recent benchmarks [6] [23]:

Establish Baselines:
- Compare against 2,000 highly variable features (batch-aware scanpy)
- Include negative controls (500 random features, 200 stable genes)
- Use theoretical "Good" and "Bad" method performance boundaries
Comprehensive Metric Selection:
- Integration (Batch): Batch PCR, CMS, iLISI
- Integration (Bio): isolated label F1, bNMI, graph connectivity
- Mapping: Cell distance, Label distance, mLISI
- Classification: F1 (Macro), F1 (Micro), F1 (Rarity)
Scale and Aggregate Scores:
- Scale metric scores relative to minimum and maximum baseline performance
- Aggregate across metric categories separately
- Document any scores exceeding baseline ranges (values >1)
Dataset-Specific Validation:
- Assess method robustness to dataset size and complexity
- Evaluate performance conservation as technical factors increase
- Use roughness indices to predict generalization capability

This structured approach ensures that HVG selection is evaluated across multiple performance dimensions relevant to scFM training, significantly enhancing reproducibility across studies and research groups.

Strategies for Capturing Rare Cell Populations and Subtle Biological Signals

Frequently Asked Questions (FAQs)

FAQ 1: Why is my single-cell RNA-seq data failing to identify known rare cell populations?

Your data may be affected by technical noise, including the "dropout effect," where genes are not detected even when expressed [60]. This is particularly detrimental for rare cells, where biological signals are already faint. Ensuring sufficient transcriptome coverage (number of genes detected per cell) is critical; below an empirical threshold, it becomes impossible to reliably separate true rare cell expression from technical artifacts [61]. Furthermore, standard clustering algorithms often fail to identify populations comprising less than 2% of the total cells, leading to rare cells being merged with abundant populations [62].

FAQ 2: How can I improve the sensitivity of my experiment for rare cell detection?

Sensitivity can be improved both experimentally and computationally.

Experimentally: Consider using more sensitive scRNA-seq protocols. For instance, the CEL-Seq2 method demonstrated a ~20% efficiency in transcript detection, a significant improvement over earlier versions, resulting in the identification of more genes and transcripts per cell [63].
Computationally: Apply specialized computational tools designed for rare cell detection. The CellSIUS (Cell Subtype Identification from Upregulated gene Sets) method is specifically tailored to identify rare cell types and their transcriptomic signatures from complex data [62]. For data preprocessing, using a noise-reduction method like iRECODE can resolve sparsity and reduce both technical and batch noise, clarifying subtle biological signals [60].

FAQ 3: What is the trade-off between sequencing more cells versus sequencing them more deeply?

This trade-off depends on your biological question. Research has shown that when the number of genes required to answer the question is small, greater transcriptome coverage (i.e., deeper sequencing per cell) is more important than analyzing a massive number of cells. Deeper sequencing reduces subsampling noise, which is crucial for accurately resolving the expression distribution of individual genes, especially those expressed in rare cells [61]. However, for discovering extremely rare cell types, sequencing a large number of cells remains necessary, provided each cell has sufficient coverage.

FAQ 4: Which feature selection method should I use for datasets with fine-resolution cell types or minority populations?

Many standard Highly Variable Genes (HVG) selection methods struggle with fine-resolution datasets. A novel framework called Mcadet has been developed to address this. It integrates Multiple Correspondence Analysis (MCA) and graph-based community detection to more accurately select informative genes from complex datasets, including those with minority cell populations [64]. Performance comparisons on such datasets suggest Mcadet outperforms several other established feature selection methods [64].

Troubleshooting Guides

Issue 1: High Technical Noise and Dropouts in scRNA-seq Data

Problem: A high proportion of zero counts in your data, known as the "dropout effect," is obscuring real biological signals, particularly for lowly expressed genes.

Solution: Implement a computational noise-reduction method.

Recommended Tool: iRECODE (Integrative RECODE) [60].
Procedure: This tool is applied as a preprocessing step to your raw count matrix.
Workflow:
- Input: Prepare your single-cell gene expression matrix (cells x genes).
- Processing: Run iRECODE, which uses high-dimensional statistical analysis to resolve the "curse of dimensionality" and distinguish technical noise from true biological variation.
- Output: Obtain a denoised expression matrix where sparsity is reduced and batch effects are minimized, leading to clearer separation of cell states.

The following diagram illustrates the functional principle of how iRECODE processes single-cell data to enhance biological signals.

Issue 2: Failure to Detect Rare Cell Populations with Standard Clustering

Problem: Standard unsupervised clustering methods (e.g., Seurat, SC3) are unable to identify rare cell types that constitute less than 1-2% of your total cell population [62].

Solution: Employ a two-step clustering approach specifically designed for rare cell detection.

Recommended Tool: CellSIUS [62].
Procedure:
- Step 1 - Coarse Clustering: Perform an initial, standard clustering of your data to identify major cell types.
- Step 2 - Rare Population Detection: Within each major cluster, run CellSIUS to identify subpopulations of cells that consistently overexpress a set of genes relative to their parent cluster.
Workflow Details:
- Input: The expression values of N cells grouped into M coarse clusters.
- Process: For each coarse cluster, CellSIUS identifies candidate marker genes and then pinpoints cells that show concerted upregulation of these genes.
- Output: A list of rare cell populations and their transcriptomic signatures, which are indicative of the rare cell type's function.

The workflow for this two-step clustering strategy is outlined below.

Experimental Protocol: Validating Rare Cell Expression with Single Molecule RNA FISH

This protocol is adapted from a study that used smFISH as a gold standard to validate findings from single-cell RNA sequencing [61].

1. Objective: To quantitatively assess the tradeoffs in scRNA-seq data for detecting gene expression variability in rare cells.

2. Materials:

Cell Line: WM989-A6 (or your relevant cell line of interest).
Key Reagents: Multiplexed single molecule RNA FISH probes for target genes (e.g., EGFR, AXL, NGFR) and a housekeeping gene (e.g., GAPDH).
Equipment: High-throughput fluorescence microscope.

3. Methodology:

Step 1 - Cell Culture and Preparation: Culture the WM989-A6 cell line under standard conditions. Seed cells onto imaging-compatible slides or plates.
Step 2 - Multiplexed smFISH: Follow the standard smFISH procedure for your probe set. This typically involves:
- Fixing cells with paraformaldehyde.
- Permeabilizing cells.
- Hybridizing labeled probes to target mRNA sequences.
- Washing to remove non-specific binding.
Step 3 - Imaging and Quantification: Image tens of thousands of cells using a high-throughput microscope. For each cell, count the number of fluorescent spots for each gene, which corresponds to the number of mRNA molecules.
Step 4 - Data Analysis: Generate gene expression distributions across the population for each of the 26+ genes analyzed. Use these distributions as a "gold standard" to compare against distributions derived from your scRNA-seq data (e.g., from DropSeq or Fluidigm C1 platforms) [61].

4. Expected Outcome: The smFISH data will provide a high-resolution, quantitative baseline of true gene expression distribution, against which the sensitivity and accuracy of scRNA-seq protocols can be rigorously evaluated. This allows for the establishment of empirical quality thresholds (e.g., minimum transcripts/cell or genes/cell) necessary for reliable rare cell analysis.

Data Presentation

Table 1: Comparison of scRNA-seq Method Sensitivities for Rare Cell Analysis

Table comparing different single-cell RNA sequencing methods based on their reported sensitivity, number of genes detected, and other key metrics relevant to rare cell detection.

Method	Reported Sensitivity (Spike-in)	Key Improvements	Impact on Rare Cell Detection
CEL-Seq2 [63]	~20% (from 5.8% in CEL-Seq)	Shorter primer, optimized RT enzymes, bead-based clean-up, ligation-free library prep.	Detects twice as many transcripts and 30% more genes per cell, improving the chance of capturing rare cell signatures.
DropSeq [61]	Information Not Specified	High-throughput, low cost per cell.	Wide range of transcriptome coverage per cell; requires careful thresholding to avoid false positives/negatives for rare genes.
Fluidigm C1 [61]	Information Not Specified	More even transcriptome distribution, higher reads/cell.	Lower number of cells sequenced, but higher data quality per cell can be beneficial.

Table 2: Performance of Computational Tools for Rare Cell Detection

Table summarizing key computational tools designed to address challenges in rare cell population identification.

Tool	Function	Key Advantage	Reference
CellSIUS	Rare cell population identification	Identifies rare cell types and their functional transcriptomic signatures from complex data.	[62]
Mcadet	Feature Selection (HVG selection)	Superior performance on fine-resolution datasets and datasets with minority cell types.	[64]
iRECODE	Technical and Batch Noise Reduction	Comprehensive noise reduction across multiple data types (RNA-seq, spatial, scHi-C) with low computational cost.	[60]
Symphony	Reference Atlas Mapping	Efficiently maps query cells to a large, integrated reference to transfer annotations and identify cell states.	[65]

The Scientist's Toolkit: Research Reagent Solutions

Table of key reagents, technologies, and computational tools used in the field of rare cell analysis.

Item	Function/Description	Application in Rare Cell Studies
Single Molecule RNA FISH	A gold standard method for quantitative, single-cell, single-molecule mRNA counting using fluorescent probes [61].	Validating gene expression distributions and rare cell states identified by scRNA-seq [61].
Fluidigm C1 System	An automated microfluidic system for capturing individual cells and performing single-cell RNA sequencing.	Provides high-sensitivity data with more uniform transcriptome coverage, useful for characterizing rare cells [61] [63].
CEL-Seq2 Primers	Optimized primers with Unique Molecular Identifiers (UMIs) for highly multiplexed, sensitive scRNA-seq.	Increases transcript detection efficiency, improving the resolution of gene expression in all cells, including rare types [63].
CellSIUS Software	A computational algorithm for Cell Subtype Identification from Upregulated gene Sets.	Detects rare cell populations and their signature genes from complex scRNA-seq data after coarse clustering [62].
iRECODE Platform	A computational method for comprehensive noise reduction in single-cell data.	Reduces technical dropouts and batch effects, clarifying subtle biological signals from rare cells [60].

Handling Platform-Specific Biases and Technical Artifacts

Frequently Asked Questions

What are the most common sources of technical bias in scRNA-seq data for scFM training? Technical biases primarily arise from the sequencing platform (e.g., different 10x Genomics kit chemistries), library preparation protocols, and sample processing batches. For scFMs, which are trained on massive, aggregated datasets, these biases can obscure true biological variation. Key artifacts include batch effects, where technical differences mimic biological signals; ambient RNA, which is background noise from lysed cells; and variations in sequencing depth between samples [5] [32] [56].

Why is handling technical artifacts critical for selecting Highly Variable Genes (HVGs) in scFM research? HVG selection is a foundational step that identifies genes with high biological variance for downstream analysis. Technical artifacts can artificially inflate the variance of non-informative genes, leading to a biased HVG list. Training an scFM on such a list will cause the model to learn noise instead of underlying biology, reducing its performance and generalizability across diverse cell types and tissues [4] [66].

My dataset shows good clustering but poor integration with a public atlas. Is this a technical bias? Yes, this is a classic symptom of substantial batch effects, often described as "system-level" biases. This occurs when integrating across different biological systems (e.g., primary tissue vs. organoids) or technologies (e.g., single-cell vs. single-nuclei RNA-seq). Standard batch correction methods may fail or inadvertently remove biological signal in these scenarios, requiring more advanced integration strategies [32].

Troubleshooting Guides

Issue 1: Identifying and Diagnosing Technical Biases

Problem: Suspected technical artifacts are confounding the biological signal in your dataset, leading to unreliable HVG selection.

Solution: Follow a systematic quality control (QC) and diagnostic protocol.

Experimental Protocol:

Initial QC with Platform-Specific Tools: Process raw sequencing data (FASTQ) with tools like Cell Ranger to generate a gene-barcode matrix and an initial QC report [56] [66].
Calculate QC Metrics: For each cell barcode, compute:
- Total UMI counts
- Number of genes detected
- Percentage of mitochondrial reads
Visualize Metrics: Use Loupe Browser or Scanpy/Seurat to create visualizations that help identify low-quality cells [56] [67].
Filter Low-Quality Cells: Apply thresholds based on your visualizations. For example, in PBMC data, you might filter out cells with >10% mitochondrial reads [56].

The relationship between QC metrics and data filtering is a sequential diagnostic process, summarized in the following workflow:

Issue 2: Correcting for Ambient RNA Contamination

Problem: Widespread, low-level expression of marker genes in unlikely cell types, suggesting contamination from ambient RNA.

Solution: Use computational tools to estimate and subtract the ambient RNA profile.

Experimental Protocol:

Choose a Tool: Select an ambient RNA removal tool such as CellBender or SoupX [56] [66].
Run the Tool: Provide the tool with your raw gene-barcode matrix. These tools use deep learning or statistical models to distinguish real cell signals from background noise.
Use the Corrected Matrix: Perform all subsequent analysis, including HVG selection, on the denoised matrix produced by the tool. Integrating this step before HVG selection prevents genes that are highly variable due to contamination from being selected [66].

Issue 3: Integrating Datasets with Substantial Batch Effects

Problem: Batch effects are so strong that they prevent meaningful integration and consensus HVG selection across datasets.

Solution: Move beyond simple linear correction methods to more powerful deep learning models.

Experimental Protocol:

Model Selection: For substantial system-level biases (e.g., cross-species, organoid-tissue), consider cVAE-based methods like scvi-tools or the recently proposed sysVI [32] [66]. sysVI, which combines a VampPrior with cycle-consistency constraints, has been shown to improve integration while preserving biological information better than tuning KL regularization or using adversarial learning alone [32].
Benchmark Correction: Evaluate integration quality using metrics like graph iLISI (for batch mixing) and NMI (for biological preservation) to ensure batch effects are removed without destroying meaningful biological variation [32].

The following table summarizes the key reagents and computational tools essential for tackling technical artifacts.

Tool / Reagent	Primary Function	Key Application in scFM Research
Cell Ranger [56] [66]	Raw data processing & alignment	Generates standardized gene-barcode matrices from platform-specific raw data (FASTQ); the foundational step for all analysis.
SoupX / CellBender [56] [66]	Ambient RNA removal	Removes technical noise from the count matrix, ensuring HVG selection is based on true cellular expression.
Harmony [66]	Batch effect correction	A fast and efficient method for integrating datasets from different batches or donors, often used in atlas-level projects.
scvi-tools [32] [66]	Deep generative modeling	Uses variational autoencoders (VAEs) for powerful, probabilistic batch correction and integration of complex datasets.
sysVI [32]	Integration of diverse systems	A cVAE-based method designed for substantial batch effects (e.g., cross-species), using VampPrior and cycle-consistency.

Advanced Workflow: An scFM-Oriented HVG Selection Pipeline

For research specifically aimed at training single-cell foundation models, where data scale and quality are paramount, a more robust pipeline is recommended. The diagram below integrates multiple correction strategies to produce clean, integrated data for robust HVG selection.

This workflow emphasizes that handling technical artifacts is not a single step but a cascade of pre-processing decisions. By systematically addressing these biases, researchers can select HVGs that more accurately reflect biology, thereby building more robust and generalizable single-cell foundation models [4] [32].

Combining HVGs with Spatially Variable Genes for Enhanced Performance

Troubleshooting Guides

Why is My Cell Type Clustering Performance Poor on Spatial Transcriptomics Data?

Problem: You are using only Highly Variable Genes (HVGs) or only Spatially Variable Genes (SVGs) for clustering, which may be capturing an incomplete picture of the biological variation.

Solution: Combine HVG and SVG gene sets to improve clustering accuracy.

Diagnostic Steps:
- Check Gene Set Overlap: Compare your selected HVGs and SVGs. A low overlap is common and indicates that each set captures distinct biological signals [68].
- Evaluate Clustering Metrics: Use metrics like Adjusted Rand Index (ARI), weighted F1 score, and cluster purity. If these are low with one gene set, try the combined set [68] [11].
Resolution:
- Identify HVGs using established methods (e.g., modelGeneVar in Scran or FindVariableFeatures in Seurat) from the gene expression matrix [2].
- Detect SVGs using spatial-aware methods (e.g., nnSVG, SPARK-X, SpatialDE) that incorporate spatial coordinates [69] [70].
- Take the union of the HVG and SVG sets for downstream analysis like dimensionality reduction and clustering [68].
Expected Outcome: Studies across over 50 datasets show that using the combined gene set significantly improves clustering accuracy (AMI and F1 scores) and better delineates specific cell types, such as tumor cells and inhibitory neurons [68].

How Do I Choose the Right SVG Detection Method for My Data?

Problem: The choice of SVG detection method significantly impacts results, as different methods can yield highly dissimilar SVG lists [70].

Solution: Select a method based on your data type and the specific category of SVGs you wish to find.

Diagnostic Steps:
- Categorize Your Goal: Determine if you need:
  - Overall SVGs: For general spatial patterning.
  - Cell-type-specific SVGs: For variation within a known cell type.
  - Spatial-domain-marker SVGs: For annotating pre-defined spatial domains [69].
- Check Method Dependency: Be aware that some SVG statistics are highly correlated with mean gene expression levels, which could bias your results [70].
Resolution:
- For a general-purpose, robust, and scalable method, consider nnSVG or SPARK-X [70].
- If you have prior knowledge of spatial domains, use methods designed to find spatial-domain markers, such as spaGCN [69].
- For a comprehensive analysis, consider running multiple methods and inspecting the consensus.
Expected Outcome: A more biologically relevant and consistent set of SVGs, leading to more reliable downstream interpretations.

Frequently Asked Questions (FAQs)

What is the fundamental difference between HVGs and SVGs?

HVGs are genes whose expression levels show high variance across individual cells, often identified from single-cell RNA-seq data without spatial context. The underlying assumption is that high biological variation is more interesting than technical noise [2]. SVGs are genes whose expression levels show a non-random, spatially autocorrelated pattern across the tissue [69]. In spatial transcriptomics data, these two gene sets are often distinct, suggesting they capture complementary biological information [68].

When should I use the combined HVG+SVG set versus all genes?

Using the union of HVGs and SVGs is more effective than using all genes. Analyses show that including all genes does not improve accuracy further and can sometimes decrease performance, likely due to the introduction of non-informative genes that add noise [68]. The combined set provides a curated, informative feature list that enhances downstream analysis efficiency and accuracy.

How can I ensure my HVG selection is robust?

Some HVG detection methods can have low reproducibility. To address this, you can employ strategies like SIEVE (SIngle-cEll Variable gEnes), which uses multiple rounds of random sampling to identify a robust and stable set of variable genes, thereby improving downstream classification accuracy [11].

How do I handle the computational cost of SVG detection on large datasets?

Computational time and memory usage vary significantly between SVG methods [70].

For large datasets (e.g., with many spots/cells), consider faster methods like SPARK-X or nnSVG [70].
Always check the scalability of a method against the size of your dataset before running an analysis.

Performance Metrics Across Platforms (HVG vs. SVG vs. Combined)

The table below summarizes the improvement in clustering performance when combining HVGs and SVGs, as demonstrated across multiple spatial transcriptomics platforms [68].

Platform	Number of Datasets	Key Performance Improvement with Combined HVG+SVG
10X Visium	Multiple	Significant increase in AMI and weighted F1 score; improved delineation of cancer cells, connective tissues, and immune cells.
10X Xenium	Multiple (e.g., Kidney)	Improved separation of proximal tubule segments (PCT, PCT-TAL) and better classification of endothelial and mesangial cells.
Nanostring CosMx	Multiple (e.g., Patient 5-2, FOV 7)	More accurate identification of tumor cells and specific immune cell types (B-cells, neutrophils).
Vizgen merFISH	Multiple (e.g., Mouse Hypothalamus)	Enhanced classification of inhibitory neurons.

Comparison of SVG Detection Methods

This table compares popular SVG detection methods based on a systematic benchmark study [70].

Method	Key Characteristics	Considerations
nnSVG	Nearest-neighbor Gaussian process; high correlation with Moran's I and MERINGUE.	Low to moderate dependency on gene expression level.
SPARK-X	Non-parametric model; computationally fast.	High dependency on gene expression level; can be biased towards highly expressed genes.
SpatialDE	Gaussian process regression.	Shows low concordance with other methods; results can be highly variable across datasets.
Moran's I	Measures spatial autocorrelation.	Moderate dependency on gene expression level.
SOMDE	Self-organizing map.	Often reports very few significant SVGs.

Experimental Protocols

Detailed Methodology: Benchmarking Combined HVG and SVG Sets

This protocol is based on the workflow used to evaluate the clustering performance of combined gene sets on real spatial transcriptomics data [68].

Data Preprocessing:
- Obtain a spatial transcriptomics dataset with a gene expression matrix and spatial coordinates for each spot/cell.
- Perform standard quality control (e.g., filtering low-quality spots/cells and genes) and normalization.
Feature Selection:
- HVG Detection: Apply a standard HVG detection method (e.g., from Seurat or Scran) to the preprocessed gene expression matrix. Select the top N genes (e.g., 2000-3000) as the HVG set [2].
- SVG Detection: Apply one or more SVG detection methods (e.g., nnSVG, SPARK-X) using both the gene expression matrix and spatial coordinates. Select genes based on a statistically significant adjusted p-value (e.g., FDR < 0.05) to form the SVG set [70].
- Gene Set Combination: Create a new gene set by taking the union of the selected HVGs and SVGs.
Dimensionality Reduction and Clustering:
- Perform Principal Component Analysis (PCA) on the expression matrix of each gene set (HVG-only, SVG-only, and Combined).
- Construct a shared nearest neighbor (sNN) graph using the top principal components.
- Conduct clustering on the sNN graph using the Leiden algorithm.
Performance Evaluation:
- Supervised Metrics: Calculate clustering accuracy against known ground truth labels using Adjusted Mutual Information (AMI) and weighted F1 score [68].
- Unsupervised Metric: Use the Pearson Gamma coefficient to assess the quality of the clustering without ground truth [68].
- Spatial Metrics: Employ spatial metrics like Spatial Concordance (SC) and mean spatial AMI to evaluate if the clustering results are spatially coherent [68].

The Scientist's Toolkit

Essential Research Reagent Solutions

The table below lists key computational tools and their functions for analyzing variable genes in spatial transcriptomics.

Tool / Resource	Function	Use Case
Seurat	R toolkit for single-cell and spatial genomics; includes HVG detection and integration of spatial coordinates.	Standard pipeline for preprocessing, HVG selection, and initial spatial analysis [70].
Giotto	Suite for spatial transcriptomics data analysis; includes multiple built-in SVG detection methods.	Analyzing spatial patterns and identifying spatial domains [70].
nnSVG	Scalable method for detecting SVGs using nearest neighbor Gaussian processes.	Robust and scalable SVG detection suitable for large datasets [70].
SPARK-X	Non-parametric method for detecting SVGs; computationally efficient.	Rapid SVG detection on large-scale datasets [70].
SIEVE	Strategy that uses multiple rounds of random sampling to identify robust HVGs.	Improving the reproducibility and accuracy of HVG selection in scRNA-seq data [11].

Workflow and Relationship Diagrams

Diagram of Combined Gene Analysis Workflow

Relationship Between Gene Categories

Adapting HVG Selection for Multi-Omic Foundation Models

Frequently Asked Questions (FAQs)

FAQ 1: Why can't I use standard HVG selection methods for multi-omic foundation model training? Standard highly variable gene (HVG) selection methods are designed for single-modal data (e.g., scRNA-seq alone) and quantify variation based on expression patterns within that single modality [71]. Multi-omic foundation models, such as scGPT, are trained on diverse data types including transcriptomic and epigenomic data (e.g., scATAC-seq) which have fundamentally different statistical characteristics and scales [72] [73]. Applying standard HVG selection directly fails to account for the integrated nature of multi-omic cellular representations, potentially selecting features that optimize for technical variance rather than shared biological meaning across modalities.

FAQ 2: How does data binarization help with multi-omic integration for foundation models? Binarizing scRNA-seq data (converting gene expression to "on"/"off" states) creates quantitative similarity with scATAC-seq data, which is inherently binary in nature [72]. This transformation enables direct vertical integration through concatenation of the two modalities, followed by application of scATAC-seq-optimized algorithms like TF-IDF and Latent Semantic Indexing (LSI) [72]. This approach avoids subjective conversion of scATAC-seq data to gene activity scores and enables direct investigation of how each data type contributes to cell identity resolution, which is crucial for foundation model pretraining.

FAQ 3: What are the key computational challenges in HVG selection for cross-species foundation models? Cross-species foundation models like scPlantFormer face significant challenges in HVG selection due to orthology mapping complexities and evolutionary divergence in gene regulatory networks [73]. The primary challenge involves identifying genes whose variability patterns conserve biological meaning across species boundaries while accounting for technical batch effects that can exceed biological variation. Successful models address this by integrating phylogenetic constraints into their attention mechanisms and employing batch correction algorithms like Harmony or Seurat's integration methods before HVG selection [74] [73].

Troubleshooting Guides

Symptoms: Foundation model fails to learn unified representations; modality-specific clustering persists in latent space.

Solutions:

Apply binarization preprocessing: Convert scRNA-seq counts to binary (0/1) values to match scATAC-seq data characteristics, then concatenate matrices before applying TF-IDF/LSI normalization [72].
Implement mosaic integration: Use tools like StabMap for non-overlapping feature alignment when different gene panels are measured across modalities [73].
Adjust HVG selection metrics: For integrated data, use analytical Pearson residuals instead of variance-based selection to account for multi-omic scale differences [72].

Verification: Check that cell-type separation improves in integrated UMAP visualizations and biological replicate alignment increases.

Problem 2: Batch Effects Obscuring Biological Variation

Symptoms: Technical variation dominates HVG selection; batches cluster separately despite biological similarity.

Solutions:

Apply systematic batch correction: Use SCTransform regularized negative binomial regression followed by integration anchors to preserve biological variance while removing technical artifacts [75].
Leverage foundation model capabilities: Employ scGPT's built-in batch integration features that use transfer learning to harmonize datasets while preserving biologically relevant variation [73].
Validate with positive controls: Include known biological markers in HVG lists to ensure biological signals remain after batch effect correction.

Verification: Compare pre- and post-integration visualizations; biological groups should cluster together across technical batches.

Problem 3: Inefficient HVG Selection for Large-Scale Pretraining

Symptoms: Computational bottlenecks during feature selection; memory overload with million-cell datasets.

Solutions:

Implement stratified HVG selection: Process cell-type specific subsets independently, then merge results, leveraging federated computational platforms like DISCO or CZ CELLxGENE Discover [73].
Use foundation model embeddings: Extract preliminary embeddings from a subsampled model, then perform HVG selection in the compressed latent space rather than raw feature space.
Employ GPU-accelerated workflows: Utilize BioLLM benchmarking frameworks with optimized feature selection implementations for large-scale data [73].

Verification: Monitor computational resource usage and ensure selected HVGs maintain performance on downstream tasks.

Experimental Protocols & Methodologies

Protocol 1: Binarization-Based Multi-Omic Integration for Foundation Models

Purpose: Create unified feature representations from scRNA-seq and scATAC-seq data for foundation model training.

Materials:

Paired scRNA-seq and scATAC-seq data (e.g., from 10X Multiome)
Computational tools: Scanpy, Seurat, or custom foundation model preprocessing pipelines

Procedure:

Data Binarization:
- Convert scRNA-seq raw counts to binary values: 1 if raw count > 0, otherwise 0 [72]
- Retain scATAC-seq data in native binary format (peak accessibility calls)

Feature Selection:
- Select top highly variable genes (2,000 recommended) based on pre-binarized data using analytical Pearson residuals [72]
- Select top accessible peaks (5,000 recommended) based on TF-IDF variability
Data Concatenation:
- Create combined matrix: [binary_RNA_data | ATAC_data] with cells as rows and union of features as columns [72]
Normalization & Reduction:
- Apply TF-IDF normalization to combined matrix
- Perform dimensionality reduction using Singular Value Decomposition (SVD/LSI)
- Use resulting embeddings as input for foundation model training

Validation: Compare clustering resolution and cell-type discrimination against standard gene activity score approaches.

Protocol 2: Time-Course HVG Identification for Dynamic Foundation Models

Purpose: Identify HVGs with dynamic expression patterns across multiple time points for temporal foundation modeling.

Materials:

Time-course scRNA-seq data (multiple time points)
R packages: Seurat, gProfiler2, ggplot2, tidyverse [75]

Procedure:

Data Preprocessing:
- Perform quality control: exclude cells with >10% mitochondrial genes [75]
- Correct ambient RNA using SoupX [75]
- Remove doublets using DoubletFinder [75]

Data Integration:
- Normalize data using SCTransform regularized negative binomial regression [75]
- Integrate samples across time points using Seurat integration anchors [75]
Time-Course HVG Identification:
- Calculate gene variability across all time points simultaneously
- Identify genes with highly dynamic expression patterns across the time series
- Cluster HVGs based on temporal expression patterns
Pathway Enrichment Analysis:
- Perform functional enrichment on time-course HVGs using gProfiler2 [75]
- Identify biological pathways with common and cell-type-specific expression dynamics

Validation: Visualize dynamic expression patterns of selected HVGs across multiple cell types and time points.

Table 1: HVG Selection Method Performance Comparison

Method	Data Modality	Integration Approach	Clustering Accuracy	Computational Efficiency
Standard HVG Selection [71]	scRNA-seq only	Not applicable	77-100% (cell-type matching)	High
Binarization + TF-IDF/LSI [72]	scRNA-seq + scATAC-seq	Direct concatenation	86% mean accuracy (improved separation)	Medium
Foundation Model Embeddings [73]	Multi-omic	Cross-modal attention	92% cross-species accuracy	Lower (pretraining required)
Time-Course HVG Framework [75]	Time-series scRNA-seq	Temporal integration	Captures dynamic patterns	Medium

Table 2: Foundation Model Capabilities in Multi-Omic Contexts

Model	Training Scale	Multi-Omic Support	Key HVG-Related Features	Reported Performance
scGPT [73]	33M+ cells	Transcriptomics + Epigenomics	Zero-shot cell annotation, perturbation prediction	Superior multi-omic integration
scPlantFormer [73]	1M plant cells	Cross-species transcriptomics	Phylogenetic constraints in attention	92% cross-species accuracy
Nicheformer [73]	53M spatial cells	Spatial + Dissociated data	Spatial context prediction	Improved niche identification
PathOmCLIP [73]	Multi-tumor datasets	Histology + Spatial transcriptomics	Contrastive learning for cross-modal alignment	Enhanced gene expression prediction

Research Reagent Solutions

Table 3: Essential Computational Tools for Multi-Omic HVG Selection

Tool/Package	Primary Function	Application in HVG Selection	Reference
Seurat	Single-cell analysis	HVG identification, data integration, multi-omic processing	[74] [75]
Scanpy	Single-cell analysis	Binarization processing, TF-IDF normalization, clustering	[72]
Harmony	Batch correction	Removing technical variation before HVG selection	[74] [73]
SCTransform	Normalization	Regularized negative binomial regression for improved HVG detection	[75]
BioLLM	Foundation model benchmarking	Standardized evaluation of HVG selection approaches across models	[73]
DoubletFinder	Quality control	Doublet identification to improve HVG selection accuracy	[75]
SoupX	Ambient RNA correction	Background noise reduction for cleaner HVG signals	[75]
gProfiler2	Functional enrichment	Biological interpretation of selected HVGs	[75]

Workflow Visualization

Binarization-Based Multi-Omic Integration

Time-Course HVG Identification Workflow

Benchmarking and Validation: Assessing HVG Selection Impact on scFM Performance

Troubleshooting Guide & FAQs for scRNA-seq Data Integration

Why is my integrated data losing important biological variation after batch correction?

This is a common problem known as overcorrection, where batch correction methods remove both technical artifacts and genuine biological signals. Recent benchmarking studies reveal that many popular methods struggle with this balance.

Solutions:

Use methods specifically designed to preserve biological variation: Systems like sysVI, which combine VampPrior with cycle-consistency constraints, show improved biological preservation while handling substantial batch effects [33] [32].
Monitor for overcorrection: Employ the Reference-informed Batch Effect Testing (RBET) framework, which is specifically sensitive to overcorrection and shows a characteristic biphasic response when biological variation is being degraded [76].
Avoid excessive KL regularization: Increasing Kullback–Leibler divergence regularization indiscriminately removes both biological and technical variation without discrimination [33].

Experimental Protocol:

Integrate your data using your chosen method
Apply RBET evaluation with reference genes
If RBET values show a biphasic pattern (decreasing then increasing with parameter tuning), reduce correction strength
Validate with biological knowledge checks (cell type separation, known markers)

Which batch correction method should I choose for my scFM training data?

Current benchmarks indicate that method performance varies significantly across different scenarios, and no single method consistently outperforms others across all tasks [4].

Table 1: Batch Correction Method Performance Summary

Method	Strengths	Limitations	Recommended Use Cases
Harmony	Consistently performs well without creating artifacts [77]	Only outputs low-dimensional embeddings [76]	Standard batch effects within similar systems
sysVI (VAMP + CYC)	Handles substantial batch effects while preserving biology [33]	More complex implementation	Cross-species, organoid-tissue, different protocols
cVAE with adversarial learning	Strong batch mixing	Prone to mixing unrelated cell types [33]	Not recommended for datasets with unbalanced cell types
Seurat	Good overall performance in benchmarks [76]	Can overcorrect with too many neighbors [76]	Technical batches within same biological system

Selection Framework:

Assess batch effect substantiality by comparing distances within and between systems [33]
For standard technical batches: Harmony or Seurat with careful parameter tuning
For substantial batch effects (cross-species, different technologies): sysVI or similar advanced methods
Always validate with multiple metrics including biological conservation

How should I select features for optimal integration performance?

Feature selection critically impacts integration quality and downstream query mapping. Recent registered report findings provide specific guidance:

Table 2: Feature Selection Impact on Integration Performance

Feature Selection Method	Integration Quality	Query Mapping	Biological Conservation
Highly Variable Genes (HVG)	High	Moderate to High	Good
Batch-aware HVG	Highest	High	Good
Random Features	Poor	Variable	Poor
Stably Expressed Genes	Poor	Poor	Poor

Key Findings:

2000 batch-aware highly variable features generally produce high-quality integrations [6]
Feature set size correlates positively with integration metrics but negatively with mapping metrics [6]
Lineage-specific feature selection may be beneficial for specialized applications [6]

What metrics should I use to evaluate integration quality comprehensively?

Traditional benchmarking has overemphasized batch mixing while underestimating biological conservation. New frameworks address this limitation:

Recommended Metric Framework:

Batch Correction Metrics: iLISI, Batch PCR, CMS [6]
Biological Conservation Metrics: bNMI, cLISI, isolated label F1 [6]
Overcorrection Detection: RBET framework with reference genes [76]
Intra-cell-type Conservation: Novel metrics capturing within-cell-type variation [78]

Evaluation Workflow for scRNA-seq Integration

How can I detect and prevent overcorrection in my integrated data?

Overcorrection occurs when batch correction removes genuine biological variation, leading to false biological conclusions.

Detection Methods:

RBET framework: Uses reference genes (housekeeping genes) to detect overcorrection by monitoring loss of biological variation [76]
Cell type merging: Monitor whether distinct cell types incorrectly merge after integration [33]
Biological validation: Check whether known biological relationships are preserved

Experimental Protocol for Overcorrection Detection:

Select reference genes with stable expression across cell types
Apply batch correction with varying parameter strengths
Calculate RBET values for each parameter setting
Identify the optimal point where RBET is minimal before increasing again
Validate with biological knowledge of expected cell type separations

How do deep learning approaches improve biological conservation in integration?

Deep learning methods, particularly variational autoencoders, offer flexible frameworks for balancing batch correction with biological preservation.

Key Advantages:

Multi-level loss functions can separately address batch effects and biological conservation [78]
Semi-supervised approaches (e.g., scANVI) incorporate cell-type labels to guide biological preservation [78]
Novel regularization strategies (VampPrior, cycle-consistency) improve preservation of biological signals [33]

Deep Learning Integration Strategies

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools for scRNA-seq Integration

Tool/Resource	Function	Application Context
RBET Framework	Overcorrection-aware evaluation	Validating integration quality without biological knowledge degradation [76]
scIB Metrics	Comprehensive integration benchmarking	Standardized evaluation of batch correction and biological conservation [6]
sysVI	Handling substantial batch effects	Cross-system integration (species, technologies, organoid-tissue) [33]
Harmony	Robust standard batch correction	Technical batch effects within similar biological systems [77]
scVI/scANVI	Deep learning integration	Flexible integration with semi-supervised capabilities [78]
PEREGGRN	Expression forecasting benchmark	Evaluating perturbation prediction performance [40]
GGRN Software	Grammar of gene regulatory networks	Network-based expression forecasting [40]

Comparative Benchmarking of scFMs with Different HVG Selection Strategies

Frequently Asked Questions

Q1: Why is the selection of Highly Variable Genes (HVGs) so critical for single-cell Foundation Model (scFM) training?

HVG selection is a fundamental preprocessing step that reduces the high dimensionality and sparsity inherent in single-cell RNA-seq data. Selecting a subset of informative genes helps to mitigate technical noise and computational burden, allowing the model to focus on genes that drive biological heterogeneity. The choice of HVG selection strategy can significantly influence the model's ability to learn meaningful biological representations, ultimately affecting performance on downstream tasks like cell type annotation and perturbation prediction [4] [3] [16].

Q2: I encountered a "reciprocal condition number" error when using Seurat V3's HVG selection with a batch_key in Scanpy. How can I resolve this?

This error often arises when one or more batches in your dataset contain genes with very low or zero counts, making the covariance matrix for the LOESS regression ill-conditioned [79]. You can try the following troubleshooting steps:

Filter genes first: Perform a more stringent pre-filtering of genes using sc.pp.filter_genes(adata, min_counts=) to remove low-abundance genes across all batches.
Check batch composition: Investigate whether specific batches have an unusually high number of zero-expression genes. It might be necessary to adjust your batch grouping or exclude low-quality batches.
Use an alternative HVG method: If the error persists, you can use another HVG selection flavor (e.g., flavor='cell_ranger') that does not use the same internal regression, or select HVGs without the batch_key argument. Note that selecting HVGs without batch correction may leave technical confounders in your data [79].

Q3: My scFM underperforms compared to simple baseline models on perturbation prediction tasks. Is this a known issue?

Yes, recent independent benchmarks have highlighted this challenge. Several studies have found that for specific tasks like predicting transcriptome changes after genetic perturbations, sophisticated scFMs (such as scGPT and scFoundation) can be outperformed by deliberately simple baselines, including a model that just predicts the mean expression from the training data or a linear model using Gene Ontology features [26] [80]. This suggests that the goal of building a generalizable model for predicting novel experimental outcomes is still an active area of research, and simpler models should be included as baselines in your workflow.

Q4: How can I quantitatively evaluate if my chosen HVG strategy has improved my scFM's biological relevance?

Beyond standard performance metrics, you can employ novel, biology-driven evaluation metrics. For example, recent benchmarks have proposed:

scGraph-OntoRWR: Measures the consistency between the cell-type relationships captured by the scFM's embeddings and the known relationships in established cell ontology databases.
Lowest Common Ancestor Distance (LCAD): Assesses the severity of cell type annotation errors by measuring the ontological proximity between misclassified cell types and the correct ones. Using these metrics can provide deeper insight into whether your model is capturing biologically meaningful patterns [4].

Benchmarking Performance of scFMs and Baselines

The following tables summarize key findings from recent benchmark studies, comparing scFMs against traditional methods and simple baselines across various tasks.

Table 1: Performance Overview of scFMs vs. Baselines on Cell-Level Tasks

Model Category	Example Models	Strengths	Limitations / Findings
Single-cell Foundation Models (scFMs)	Geneformer, scGPT, scFoundation [4]	Robust and versatile across diverse applications; effective at batch integration and cell type annotation [4].	No single scFM consistently outperforms all others; performance is task- and dataset-dependent [4].
Traditional Methods	Seurat, Harmony, scVI [4]	Established, efficient, and perform well with smaller datasets [4].	May be outperformed by scFMs on complex integration tasks or when leveraging pretrained knowledge [4].
Simple Baseline Models	"No change" predictor, Additive model, Linear Regression [26]	Highly efficient and can surprisingly outperform scFMs on specific tasks like perturbation prediction [26] [80].	Incapable of representing complex biological interactions; their strong performance highlights scFM limitations [26].

Table 2: scFM Performance on Perturbation Prediction Benchmarks (Pearson Delta Correlation)

Model	Adamson et al. Dataset	Norman et al. Dataset	Replogle (K562) Dataset	Replogle (RPE1) Dataset
Train Mean (Simple Baseline)	0.711	0.557	0.373	0.628
Random Forest (GO Features)	0.739	0.586	0.480	0.648
scGPT	0.641	0.554	0.327	0.596
scFoundation	0.552	0.459	0.269	0.471

Data adapted from a benchmark study that evaluated models on predicting differential expression after genetic perturbations [80].

Experimental Protocols for Benchmarking

Below is a detailed methodology for conducting a comparative benchmark of scFMs, incorporating different HVG selection strategies.

Protocol: A Biology-Oriented Benchmarking Pipeline for scFMs

1. Data Preparation and Curation

Dataset Selection: Assemble multiple publicly available scRNA-seq datasets from sources like CELLxGENE [4]. These should encompass diverse biological conditions, tissues, and species to test generalizability.
Quality Control: Perform standard QC on each dataset (e.g., filtering cells and genes based on counts and mitochondrial percentage).
Create Benchmarking Tasks: Define a set of gene-level and cell-level tasks. Examples include:
- Gene-level: Gene network inference, gene function prediction.
- Cell-level: Batch integration, cell type annotation, rare cell type detection, and prediction of drug sensitivity or cancer cell identification [4].

2. Application of HVG Selection Strategies For each dataset, apply several HVG selection methods to create different gene subsets for downstream model training and evaluation.

Method 1: Seurat V3. Uses a variance-stabilizing transformation based on a LOESS regression of the relationship between mean expression and variance. It can be run per batch and integrated [81] [79].
Method 2: scTransform. Models gene expression using a regularized negative binomial generalized linear model and selects HVGs based on Pearson residuals [81].
Method 3: GLP (Newer Method). Identifies genes with average expression levels significantly higher than expected from their positive ratio (proportion of cells where the gene is detected) using an optimized LOESS regression, which is robust to dropout noise [3].

3. Model Training and Feature Extraction

Train/Finetune scFMs: For each HVG subset, train or finetune selected scFMs (e.g., scGPT, Geneformer). Alternatively, in a zero-shot setting, extract cell and gene embeddings from a pre-trained model using the HVG subset as input [4].
Establish Baselines: Apply traditional methods (e.g., Harmony, scVI) and simple baseline models (e.g., a linear model, or a mean predictor) on the same HVG subsets for comparison [4] [26].

4. Performance Evaluation and Biological Validation

Standard Metrics: Use task-specific metrics like Adjusted Rand Index (ARI) for clustering, silhouette score for batch correction, and Pearson correlation for regression tasks [3].
Biology-Aware Metrics: Incorporate novel metrics like scGraph-OntoRWR and Lowest Common Ancestor Distance (LCAD) to evaluate the biological plausibility of the model's predictions and representations [4].
Performance Analysis: Holistically rank models by aggregating results across all metrics and tasks. Use a non-dominated sorting algorithm or similar method to provide guidance on model selection based on the specific task, dataset size, and available resources [4].

The diagram below visualizes the key decision points and workflow of this benchmarking protocol.

The Scientist's Toolkit: Essential Research Reagents & Materials

Table 3: Key Computational Tools and Resources for scFM Benchmarking

Item Name	Function / Application	Reference / Source
Scanpy / Seurat	Standardized scRNA-seq analysis workflows for QC, normalization, HVG selection, and clustering.	[81]
scGPT / Geneformer	Representative single-cell foundation models that can be fine-tuned or used for zero-shot embedding extraction.	[4] [26]
CELLxGENE / Cell Atlas	Curated data portals providing access to millions of standardized single-cell datasets for training and benchmarking.	[4] [82]
GLP Algorithm	A robust HVG selection method using optimized LOESS regression on the relationship between positive ratio and mean expression.	[3]
Gene Ontology (GO)	A knowledge base providing structured biological knowledge that can be used as features in baseline models or for validation.	[26] [80]

Why is biological validation necessary for single-cell foundation models (scFMs), and what are the key metrics?

Biological validation is crucial to determine if your scFM has learned meaningful biological principles rather than just technical artifacts or dataset-specific noise. For models trained on highly variable genes (HVGs), this ensures that the selected features capture genuine biological variation rather than amplifying technical noise. Key performance metrics assessed during validation are detailed in the table below.

Table 1: Key Metrics for scFM Biological Validation

Metric Category	Specific Metric	What It Measures	Interpretation
Cell-level Task Performance	Cell Type Annotation Accuracy	Model's ability to correctly assign cell identity labels [4] [5] [83]	High accuracy confirms the model captures defining transcriptional states.
Cell-level Task Performance	Batch Integration Quality	Ability to remove technical artifacts while preserving biological variation [4] [5]	Good integration enables analysis across diverse datasets.
Gene-level Task Performance	Expression Forecasting Accuracy	Prediction of gene expression changes after perturbation [40]	Tests the model's understanding of causal regulatory relationships.
Knowledge-based Validation	scGraph-OntoRWR	Consistency of model-derived cell relationships with established biological knowledge (e.g., cell ontology) [4]	Measures if the model recapitulates known biology.
Knowledge-based Validation	Lowest Common Ancestor Distance (LCAD)	Ontological proximity of misclassified cell types [4]	A smaller distance indicates a semantically reasonable error.

Which experimental protocols are used for biological validation?

Protocol 1: Validating scFMs on Cell-level Tasks Using Benchmarking Platforms

Purpose: To objectively evaluate an scFM's performance on standardized, biologically relevant tasks like cell type annotation and batch integration [4]. Methodology:

Embedding Extraction: Use your pretrained scFM in "zero-shot" mode to generate latent embeddings (vector representations) for a benchmarking dataset of single-cells [4] [5].
Downstream Task Evaluation: Feed these embeddings into simple, task-specific models (e.g., a classifier for cell type annotation).
Performance Comparison: Compare your model's performance against established baselines (e.g., Seurat, Harmony) and other scFMs using the metrics in Table 1 [4]. Troubleshooting: If performance is poor, ensure the benchmarking dataset was not part of the model's training data to prevent over-optimistic results [4].

Protocol 2: Biological Knowledge Alignment with scGraph-OntoRWR

Purpose: To validate that the relationships between cells learned by the scFM are consistent with prior biological knowledge [4]. Methodology:

Construct Cell Graph: Build a graph where nodes are cells, and edges are based on similarity in the scFM's latent space.
Run Random Walk: Perform a network propagation algorithm (Random Walk with Restart) on this graph, using a known cell type hierarchy (e.g., from Cell Ontology) as the biological "restart" set.
Calculate Consistency Score: The scGraph-OntoRWR metric quantifies the alignment between the model-derived graph structure and the ontological knowledge [4]. A higher score indicates better biological fidelity. Troubleshooting: A low score suggests the model is learning primarily technical or non-biologically meaningful patterns. Re-evaluate the HVG selection strategy and pretraining data quality.

Protocol 3: Gene Regulatory Insight Validation via Expression Forecasting

Purpose: To test the model's capacity to predict the downstream effects of genetic perturbations, a key sign of understanding regulatory networks [40]. Methodology:

Framework Setup: Use a benchmarking platform like PEREGGRN, which contains multiple, quality-controlled perturbation transcriptomics datasets (e.g., CRISPR knockouts, TF overexpression) [40].
Train and Predict: Provide the scFM's embeddings or representations to a forecasting grammar like GGRN. The model is trained on a set of perturbations and must predict expression changes for a held-out set of unseen perturbations [40].
Evaluate Predictions: Assess forecasting accuracy using metrics like Mean Absolute Error (MAE) or the correct prediction of the direction of change for differentially expressed genes [40]. Troubleshooting: A common pitfall is data leakage. Ensure that the specific perturbation condition being predicted is entirely absent from the training set [40].

How do current scFMs perform on biological validation benchmarks?

Recent comprehensive benchmarks reveal the strengths and limitations of current scFMs. The table below summarizes the performance of leading models across critical biological and clinical tasks.

Table 2: Performance of scFMs on Key Validation Tasks (Adapted from [4])

Model Name	Cell Type Annotation	Batch Integration	Drug Sensitivity Prediction	Key Biological Strength
Geneformer	Good	Good	Variable	Captures dynamic gene interactions during cell state transitions [84] [5].
scGPT	Good	Good	Variable	Versatile across multiple omics modalities [4] [5].
scFoundation	Good	Good	Good	Robust performance on large-scale clinical tasks [4].
UCE	Good	Good	Variable	Incorporates protein sequence information via protein language models [4].
LangCell	Good	Good	Variable	Integrates text descriptions with gene expression data [4].
scCello	Good	Good	Variable	Infers cell-specific gene regulatory networks [4].

Key Benchmarking Insight: No single scFM consistently outperforms all others across every task and dataset. Model selection should be guided by the specific biological question and data characteristics [4].

The Scientist's Toolkit: Key Research Reagent Solutions

This table lists essential computational tools and data resources for the biological validation of scFMs.

Table 3: Essential Reagents and Resources for scFM Validation

Item Name	Function / Purpose	Relevance to scFM Validation
CZ CELLxGENE [4] [5]	A unified platform providing access to over 100 million curated single-cell datasets.	Serves as a primary source of high-quality, annotated data for benchmarking and testing model generalizability.
PEREGGRN & GGRN [40]	A benchmarking platform and software for evaluating expression forecasting methods.	Provides a standardized environment to test your scFM's ability to predict genetic perturbation outcomes.
Cell Ontology [4]	A controlled, structured vocabulary for cell types.	Used as the ground-truth knowledge base for metrics like scGraph-OntoRWR and LCAD.
SCAVENGE [85]	An algorithm that uses network propagation to map causal genetic variants to relevant cellular contexts at single-cell resolution.	Can be used to generate trait-relevant cellular hypotheses for validating a model's functional insights.
Weighted Gene Correlation Network Analysis (WGCNA) [86]	A method to identify clusters (modules) of highly correlated genes.	Useful for validating if the model's latent space preserves known co-expression modules and biological processes.

Visualizing the Biological Validation Workflow

The following diagram illustrates the logical flow and key decision points in a comprehensive biological validation pipeline for a single-cell foundation model.

Frequently Asked Questions

Q1: My single-cell foundation model (scFM) underperforms in cell type annotation on a new dataset. Could the initial selection of Highly Variable Genes (HVGs) be the cause?

Yes, this is a common issue. The HVGs selected for your scFM's pretraining define the feature space the model learns from. If the biological variation in your new query dataset is driven by genes not included in the original HVG set, the model will lack the necessary information for accurate annotation [4] [5]. This is particularly problematic when mapping data from different tissues, species, or disease states not well-represented in the pretraining corpus.

Q2: After integrating multiple datasets using our scFM, we observe strong batch effect removal but a loss of subtle biological signal. How can we improve the balance?

This indicates that the integration process may be over-correcting. The goal of integration is to align shared cell states across batches while preserving unique biological conditions. To troubleshoot:

Revisit HVG Selection: Ensure that your HVG list includes genes that define both major and rare cell populations. A list skewed towards highly abundant genes might miss markers for nuanced cell states.
Adjust Model Parameters: Many integration methods, including deep learning approaches like scArches, allow you to control the strength of batch correction through regularization parameters. Reducing this strength can help preserve biological variation [87].
Validate with Biology-Aware Metrics: Use metrics like the Lowest Common Ancestor Distance (LCAD), which measures the ontological proximity between misclassified cell types, to ensure biologically meaningful integration is maintained [4].

Q3: When mapping a query dataset to a reference atlas, the model fails to identify a known rare cell population. What steps should we take?

The failure to identify a rare cell type often stems from two issues related to feature selection:

Underrepresented Features in Pretraining: The genes characteristic of that rare cell type were not part of the scFM's pretraining vocabulary (i.e., the HVGs).
Algorithmic Bias: Standard integration methods can sometimes over-correct and merge small, biologically distinct clusters with larger ones. To address this:

Leverage Transfer Learning: Use a method like scArches, which is designed to map query data onto a reference while accounting for new cell types. It can place unseen cell types into distinct clusters rather than forcing them into existing categories [87].
Perform Targeted Analysis: After initial mapping, subset your data and re-run the analysis focusing on the specific lineage where the rare cell type is expected, as this can increase the sensitivity for detecting rare populations.

Troubleshooting Guides

Issue: Poor Cell Type Annotation Accuracy

This guide addresses low annotation accuracy after transferring labels from a reference to a query dataset.

Step	Action & Purpose	Key Parameters & Tools to Check
1	Check Feature Overlap: Confirm the genes used by the reference model are present and reliably measured in your query data. A small overlap will lead to poor performance.	Tool: Seurat's `FindTransferAnchors` [88]. Parameter: `dims` (should use the same dimensions as the reference).
2	Validate HVG Selection: Compare the HVGs from your query dataset to those used in the reference model. If the biological context is different, you may need to recompute HVGs specific to your query before mapping.	Method: `FindVariableFeatures` (Seurat) [89] or `pp.highly_variable_genes` (Scanpy) [90].
3	Assess Prediction Scores: Examine the prediction scores from the label transfer. Low scores for a particular cell type indicate uncertain annotations, which may require manual curation or a different reference.	Tool: Seurat's `TransferData` [88]. Output: `prediction.score.max` column in metadata.
4	Use Biological Metrics: Evaluate errors using biology-informed metrics like LCAD. A misannotation between closely related cell types (e.g., CD4+ T cell subsets) is less severe than between different lineages (e.g., T cell vs. neuron) [4].	Metric: Lowest Common Ancestor Distance (LCAD).

Issue: Ineffective Data Integration

This guide helps when multiple datasets fail to align properly, or when integration removes biological variation.

Step	Action & Purpose	Key Parameters & Tools to Check
1	Preprocess Independently: Normalize and identify HVGs on each dataset individually before integration. This ensures that technical differences between batches do not confound the selection of biologically relevant features.	Method: Standard pre-processing workflow (NormalizeData > FindVariableFeatures) applied per dataset [89] [91].
2	Select an Appropriate Integration Method: Choose a method based on your data size and goal. For large datasets (>1M cells), consider scalable methods like Harmony [90] or scArches [87].	Tools: `IntegrateLayers` (Seurat) [91], `harmony_integrate` (Scanpy) [92], `scArches` [87].
3	Evaluate Integration Quality: Use a combination of metrics to ensure both batch removal and biological conservation. Don't rely on a single metric.	Metrics: Batch Mixing: PCA regression, Entropy of Batch Mixing. Biology Conservation: ARI, NMI, cell-type ASW [87].
4	Iterate and Refine: If biological signal is lost, adjust the integration strength or the number of HVGs used. Fine-tuning these parameters is often necessary for optimal results.	Parameter: `vars.to.regress` in `ScaleData` (Seurat) for known confounders like mitochondrial percentage [89].

Performance Metrics for Method Selection

The table below summarizes quantitative benchmarks from a 2025 study comparing single-cell foundation models (scFMs) against established baseline methods across key downstream tasks. Performance is a composite score based on multiple metrics, with higher scores being better. No single method outperforms all others in every task, highlighting the need for task-specific selection [4].

Table 1: Benchmarking Scores for Downstream Tasks (General Performance)

Method	Category	Cell Type Annotation	Data Integration	Query Mapping	Key Strengths
Seurat (CCA)	Baseline (Anchor-based)	0.89	0.85	0.91	High accuracy in cross-species mapping, well-established [88] [91]
Harmony	Baseline (Clustering-based)	0.85	0.88	0.82	Fast, efficient for large datasets, good batch mixing [92] [90]
scVI	Baseline (Generative)	0.87	0.90	0.84	Robust probabilistic model, handles complex batch effects [4] [87]
scArches	Transfer Learning	0.91	0.92	0.95	Excellent for iterative mapping, preserves unseen cell types [87]
scGPT	Foundation Model	0.90	0.87	0.89	Versatile, good zero-shot performance, multimodal potential [4] [5]
Geneformer	Foundation Model	0.88	0.83	0.86	Strong on gene-level tasks, good biological interpretability [4]

Table 2: Performance on Specific Annotation Challenges

This table shows how methods handle specific annotation difficulties, using metrics like scGraph-OntoRWR (measures consistency with known biology) and LCAD (measures severity of misclassification) [4].

Method	scGraph-OntoRWR (Higher is Better)	LCAD for Rare Cell Types (Lower is Better)	Notes
Seurat	0.82	4.1	Reliable, errors are often biologically plausible [88] [4]
Harmony	0.79	4.5		[4]
scArches	0.85	3.8	Excels at placing novel cell types correctly [87]
scGPT	0.88	3.5	Captures rich biological relationships from pretraining [4] [5]

Experimental Protocols

Protocol 1: Reference-Based Query Mapping with Seurat

This is a detailed, step-by-step protocol for mapping a query dataset to an integrated reference, a common task for annotating new data [88].

Diagram: Workflow for Reference-Based Query Mapping

Procedure:

Build the Reference: Create an integrated reference from multiple datasets.
- Code: reference <- IntegrateLayers(object = pancreas.ref, method = CCAIntegration, orig.reduction = "pca", new.reduction = "integrated.cca")
- Code: reference <- RunUMAP(reference, dims = 1:30, reduction = "integrated.cca", return.model = TRUE) # Critical to save the UMAP model [88].
Preprocess the Query Dataset: Independently normalize the query data.
- Code: query <- NormalizeData(query)
Find Integration Anchors: Identify mutual nearest neighbors between the reference and query.
- Code: anchors <- FindTransferAnchors(reference = reference, query = query, dims = 1:30, reference.reduction = "pca")
Transfer Cell Type Labels: Classify query cells based on the reference annotations.
- Code: predictions <- TransferData(anchorset = anchors, refdata = reference$celltype, dims = 1:30)
- Code: query <- AddMetaData(query, metadata = predictions)
Project Query onto Reference UMAP: Visualize the query cells embedded in the reference's structure.
- Code: query <- MapQuery(anchorset = anchors, reference = reference, query = query, refdata = list(celltype = "celltype"), reference.reduction = "pca", reduction.model = "umap") [88].

Protocol 2: Large-Scale Data Integration with Scanpy and Harmony

This protocol is optimized for integrating very large datasets (e.g., >1 million cells) and is a key baseline method [90].

Procedure:

Quality Control: Filter out low-quality cells and genes.
- Code: sc.pp.filter_cells(adata, min_counts=100)
- Code: sc.pp.filter_genes(adata, min_cells=5)
- Code: adata = adata[(adata.obs.pct_counts_mt < 25) & (adata.obs.n_genes_by_counts < 5000) & (adata.obs.total_counts < 25000),:]
Normalization and HVG Selection: Standardize the data and select features.
- Code: sc.pp.normalize_total(adata, target_sum=1e4)
- Code: sc.pp.log1p(adata)
- Code: sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.25)
Dimensional Reduction: Perform PCA on the highly variable genes.
- Code: sc.tl.pca(adata, svd_solver='arpack')
Harmony Integration: Run Harmony to integrate data across a specified batch key (e.g., 'sample').
- Code: sce.pp.harmony_integrate(adata, 'sample') [92] [90].
Neighborhood Graph and Clustering: Use the integrated embeddings for downstream analysis.
- Code: sc.pp.neighbors(adata, n_neighbors=10, n_pcs=50)
- Code: sc.tl.umap(adata)
- Code: sc.tl.leiden(adata, resolution=0.5)

The Scientist's Toolkit: Research Reagent Solutions

Table 3: Essential Computational Tools for scFM Downstream Analysis

Tool / Resource	Function	Relevance to HVGs & scFM Training
Seurat [88] [89] [91]	A comprehensive R toolkit for single-cell genomics.	Provides robust functions for HVG selection (`FindVariableFeatures`) and serves as a primary platform for benchmarking anchor-based integration and mapping methods against scFMs.
Scanpy [92] [90]	A scalable Python-based single-cell analysis suite.	Enables preprocessing and analysis of very large-scale datasets (millions of cells), with an external API that integrates methods like Harmony, facilitating direct comparison with scFMs.
Harmony [92] [90]	Fast, robust integration algorithm.	A top-performing baseline method for data integration. Its performance is a key benchmark for evaluating whether a new scFM provides a significant advantage over established, simpler tools [4].
scArches [87]	Transfer learning method for single-cell data.	Represents a hybrid approach, using deep learning not for foundation training but for efficient, decentralized reference mapping. It is crucial for testing scFM performance in iterative query mapping tasks.
CellxGene / CZ CELLxGENE [4] [5]	Curated repository of single-cell datasets.	The primary source for high-quality, annotated data used for both pretraining scFMs and for creating standardized benchmarks to evaluate their performance on downstream tasks like annotation and integration.

Within the broader thesis on selecting highly variable genes (HVGs) for single-cell foundation model (scFM) training, robust validation is paramount. Traditional metrics, while useful, often fail to capture the biological plausibility of the identified variable genes and cell states. This guide introduces advanced validation approaches that leverage curated biological knowledge from cell ontologies and established pathways to assess whether computational results reflect true biology, ensuring that your scFM training is built on a solid foundation.

Troubleshooting Guides

Guide 1: Resolving Discrepancies Between Statistical and Biological Validation

Problem: Your analysis identifies a set of highly variable genes, but these genes do not align with known cell-type markers or biological pathways, making the results difficult to interpret.

Solution:

Diagnose the Feature Selection Method: The choice of feature selection method significantly impacts downstream biological interpretation [6]. Re-run your analysis using a batch-aware highly variable gene selection method, which can better conserve biological variation across diverse samples.
Implement a Knowledge-Based Overlap Test: Use your proposed cell ontology-informed metrics. Take the list of statistically significant HVGs and calculate their enrichment against established, cell-type-specific gene sets from databases like Cell Ontology.
- Protocol: Perform a hypergeometric test to determine if the overlap between your HVG list and a known cell-type marker set is greater than expected by chance. A significant p-value indicates that your HVGs are biologically relevant.
Validate with a Downstream Task: Use the HVGs for a standard downstream analysis like clustering and cell-type annotation. Assess the annotation quality using label transfer accuracy to a well-annotated reference atlas [6]. High accuracy confirms biological validity.

Guide 2: Addressing Poor Generalization of scFM to Unseen Cell Types

Problem: Your single-cell foundation model, trained on a specific set of tissues, performs poorly when tasked with representing or reconstructing data from a previously unseen cell type [93].

Solution:

Audit Your Training Corpus: The composition of your training data is critical. A model trained only on mature adult cell types (e.g., peripheral blood) will struggle with progenitor or embryonic cells [93].
Incorporate Diverse Developmental Data: Augment your training data with directed differentiation atlases and data from developmental tissues. These data sources cover a wider spectrum of the cellular state hierarchy, which can significantly improve the model's ability to generalize [93].
Employ Ontology-Guided Evaluation: When validating your model's performance, do not rely solely on reconstruction accuracy. Use cell ontology to define a set of "unseen" but related cell types. Evaluate if the model can place these cells in the correct region of the latent space relative to their known ontological relationships.

Guide 3: Correcting for Batch Effects Without Removing Biological Signal

Problem: After integrating multiple datasets to train your scFM, you suspect that batch correction has been too aggressive, removing genuine biological variation along with technical noise.

Solution:

Use Metrics that Discriminate Batch and Biology: Rely on a suite of metrics to evaluate your integration.
- For Batch Correction: Use metrics like iLISI (Integration Local Inverse Simpson's Index) or Batch PCR (Batch Principal Component Regression) to confirm that batches are well-mixed [6].
- For Biological Conservation: Use metrics like cLISI (Cell-type LISI), isolated label F1, or graph connectivity to ensure distinct cell types remain separable [6].
Compare to a Biological Ground Truth: The most reliable check is to see if known biological relationships are preserved. Check if the model correctly groups cells by their ontology-defined type and maintains developmental trajectories after integration.
Validate with a Negative Control: Include a set of "housekeeping" genes that are expected to be stable across cell types. If these genes show high variability in your integrated data, it may indicate over-correction or residual technical effects.

Frequently Asked Questions (FAQs)

FAQ 1: Why should I use cell ontology-informed metrics instead of standard clustering metrics like silhouette score?

Standard clustering metrics evaluate compactness and separation but are agnostic to biology. You could have a statistically perfect cluster that groups biologically unrelated cells. Cell ontology-informed metrics, such as ontology enrichment scores or semantic similarity between cluster marker genes and known cell types, directly quantify the biological coherence of your results, ensuring they are not just statistically sound but also biologically meaningful.

FAQ 2: My data is from a rare disease with no established reference atlas. How can I perform knowledge-based validation?

In the absence of a perfect reference, you can still use knowledge-based approaches.

Leverage Proximal Ontologies: Use marker genes and pathways from related, well-annotated cell types in healthy tissues or similar diseases.
Pathway-Centric Analysis: Instead of validating at the cell-type level, shift to the pathway level. Test if your HVGs are enriched in specific signaling, metabolic, or disease-relevant pathways (e.g., from KEGG or Reactome). This can reveal whether your model captures the underlying disease biology.
Pseudo-replication: Split your data into technical replicates and ensure that the key biological signals (e.g., a rare cell population) are reproducibly identified.

FAQ 3: How does the choice of error model (e.g., Poisson vs. Negative Binomial) in preprocessing affect my downstream HVG selection and validation?

The choice of error model is critical for accurate variance estimation [94].

Poisson models assume the variance equals the mean, which may be appropriate only for very sparse, shallowly sequenced data.
Negative Binomial (NB) models account for overdispersion (variance > mean), which is prevalent in most scRNA-seq datasets, especially those with sufficient sequencing depth [94]. Using a Poisson model on overdispersed data can lead to the false detection of technically driven genes as "highly variable." This will directly impact your HVG list and mislead your scFM training. It is recommended to use regularized NB regression (as in sctransform [95]) for robust normalization and variance stabilization before HVG selection.

FAQ 4: What is the minimum recommended number of HVGs for building a robust scFM?

There is no universal minimum, as it depends on biological complexity. However, benchmarks for data integration—a task related to scFM training—suggest that using around 2,000 highly variable features is an effective common practice that often leads to high-quality results [6]. The key is to use this as a starting point and validate that the selected number of genes captures the necessary biological variation without introducing excessive noise.

Experimental Protocols for Key Validation Analyses

Protocol 1: Conducting a Cell Ontology Enrichment Analysis

Objective: To quantitatively assess if a list of highly variable genes (HVGs) is significantly enriched for markers of specific cell types as defined by the Cell Ontology.

Input Preparation:
- Target Gene Set: Your list of statistically selected HVGs.
- Background Gene Set: All genes detected in your scRNA-seq dataset.
- Cell Ontology Marker Sets: Download cell-type-specific gene sets from a resource like the Cell Ontology or PanglaoDB.
Statistical Testing:
- For each cell type of interest in the ontology, perform a hypergeometric test (or Fisher's exact test) using the target gene set, background gene set, and the ontology-derived marker set.
- The null hypothesis is that the HVGs are not enriched for the cell-type marker set.
Interpretation:
- Apply a multiple testing correction (e.g., Benjamini-Hochberg) to the resulting p-values.
- A significantly enriched result provides strong evidence that your HVG selection method is capturing biologically relevant variation associated with that cell type.

Protocol 2: Knowledge-Based Validation of a Developmental Trajectory

Objective: To validate that a computationally inferred pseudotemporal trajectory aligns with known biological stages of development.

Trajectory Inference:
- Using your HVGs, apply a trajectory inference tool (e.g., Monocle3, PAGA) to order cells along a pseudotime axis.
Gene Set Scoring:
- For cells along the trajectory, score the expression of well-established gene signatures for key developmental stages (e.g., "pluripotency," "early progenitor," "differentiation"). This can be done using methods like AUCell or module scoring.
Correlation with Pseudotime:
- Plot the scores of these gene signatures against the pseudotime values.
- Validation: A successful validation is achieved if the pluripotency signature score decreases with pseudotime while the differentiation signature score increases, recapitulating the expected biological progression.

Signaling Pathways and Workflows

Knowledge-Based Validation Workflow

Research Reagent Solutions

The following table details key computational tools and resources essential for implementing the novel validation approaches described in this guide.

Item Name	Type	Function in Validation
Cell Ontology (CL)	Database	Provides a structured, controlled vocabulary for cell types, used as a source of known marker genes for enrichment tests [96].
scran	Software Package	A highly variable gene selection method that demonstrated strong all-round performance in benchmarking studies, suitable for generating a robust HVG list for initial validation [97].
scIB	Benchmarking Pipeline / Metrics	Provides a suite of metrics (e.g., iLISI, cLISI, graph connectivity) for evaluating data integration, useful for assessing biological conservation after batch correction [6].
sctransform	Software Package	A normalization method using regularized negative binomial regression that effectively removes technical confounders like sequencing depth, providing a reliable foundation for HVG selection [95] [94].
Single-cell Variational Inference (scVI)	Software Package / Model	A deep generative model for scRNA-seq data that can be used for integration and representation learning; performance is impacted by the feature selection method used [6].

This technical support center provides troubleshooting guides and FAQs to help researchers address specific issues encountered when applying single-cell foundation models (scFMs) in clinical and biomedical research.

Frequently Asked Questions

Q1: In a real-world scenario, can a foundation model trained on healthy reference data still identify disease-specific cell states?

A: Yes, when mapping is performed correctly. A key study used a deep learning strategy called scArches (single-cell architectural surgery) to map query datasets from patients with COVID-19 onto a healthy reference atlas. The method successfully preserved the disease-specific variation, allowing for the discovery of cell states unique to COVID-19 without the need to retrain the entire model from scratch [87]. This demonstrates that scFMs can be contextually extended to pathological conditions.

Q2: My model's cell type annotation performance has dropped on a new cancer dataset. Is the issue with the model or my data?

A: This is a common challenge. A comprehensive 2025 benchmark study indicates that no single scFM consistently outperforms all others across every task [4]. A performance drop on a new cancer type could be due to:

Dataset Characteristics: The new data may have a higher degree of intra-tumor heterogeneity or batch effects not present in the model's pretraining data.
Model Limitations: The scFM you selected might be less robust for that specific cancer type's cellular landscape.
Solution: Consider using the Roughness Index (ROGI) as a proxy to select a more appropriate model for your specific dataset. Furthermore, ensure your pre-processing pipeline is robust, as technical noise can significantly impact annotation accuracy [4] [98].

Q3: How do I choose between using a complex scFM and a simpler, traditional machine learning model for my clinical project?

A: The choice depends on your specific constraints and goals [4]:

Use an scFM if: Your task requires generalizable biological knowledge (e.g., identifying novel cell states), you have diverse data from multiple sources that need integration, or you are working on multiple downstream tasks (like batch integration and drug sensitivity prediction).
Use a simpler model if: You are working under significant computational resource constraints, your dataset is small and focused on a single, well-defined task (e.g., classifying a limited set of known cell types), or you require high efficiency and interpretability for a specific dataset.

Troubleshooting Guides

Issue: Poor Batch Integration While Preserving Biological Variation

Symptom: When integrating your new clinical dataset with a public reference atlas, batch effects are not adequately removed, or fine biological variations (e.g., subtle disease states) are being erased.

Investigation & Resolution:

Verify the Reference Model: Confirm that the pre-trained scFM you are using was trained on data that includes cell types or tissues relevant to your query dataset. A significant mismatch can lead to poor integration [87].
Check Fine-Tuning Strategy: If you are fine-tuning the model, ensure you are not overfitting. The benchmark study suggests that simpler models can sometimes adapt more efficiently to specific datasets than heavily fine-tuned scFMs [4]. Using methods like scArches, which fine-tune only a small subset of parameters ("adaptors"), can effectively prevent this issue and preserve biological variation while removing batch effects [87].
Evaluate with Biological Metrics: Don't rely only on batch-mixing metrics. Use biology-informed metrics like the scGraph-OntoRWR or Lowest Common Ancestor Distance (LCAD), which measure the consistency of the model's output with known biological ontologies and the severity of cell type misclassification [4].

Diagram: Workflow for Mapping Query Data to a Reference Atlas

Issue: Low Classification Accuracy for Rare Cell Types

Symptom: Your model fails to identify or has very low accuracy in classifying rare cell populations in a heterogeneous sample (e.g., circulating tumor cells).

Investigation & Resolution:

Inspect Feature Selection: The standard practice of selecting Highly Variable Genes (HVGs) might be biased towards more abundant cell types. Re-evaluate your HVG list or consider feature fusion strategies.
Employ Multi-Feature Fusion: Relying on a single type of feature (e.g., statistical, deep learning-based) may not capture all the information needed to distinguish rare cells. Frameworks like scMFF integrate multiple complementary features (statistical, information theory, matrix factorization, deep learning) using strategies like weighted sum or attention mechanisms, which have been shown to improve performance and stability on disease-related datasets [99].
Data Pre-processing: Ensure that pre-processing steps like normalization are appropriate. For rare cell types, some imputation methods might introduce more noise than signal [98].

Experimental Protocols & Performance

Benchmarking ScFMs on Clinically Relevant Tasks

A 2025 benchmark study evaluated six scFMs (Geneformer, scGPT, UCE, scFoundation, LangCell, scCello) against established baselines (e.g., Seurat, Harmony, scVI) on realistic clinical tasks [4].

Objective: To provide a holistic performance ranking and guide model selection for biomedical applications.

Methodology Summary:

Feature Extraction: Zero-shot gene and cell embeddings were extracted from each scFM.
Downstream Tasks: Embeddings were evaluated on:
- Cell-level: Pre-clinical batch integration; cell type annotation; cancer cell identification; drug sensitivity prediction.
- Gene-level: Gene network inference; gene function prediction.
Evaluation Metrics: 12 metrics were used, including unsupervised, supervised, and novel knowledge-based metrics like scGraph-OntoRWR and LCAD.
Datasets: Five datasets with diverse biological conditions and seven cancer types across four drugs. An independent dataset (AIDA v2) was used for validation.

Key Quantitative Results:

Table 1: Overall Model Ranking Across Diverse Tasks (Based on Non-Dominated Sorting) [4]

Model	Overall Ranking	Notable Strengths
scGPT	Top Tier	Versatile across tasks, handles multimodal data [4]
Geneformer	Top Tier	Robust performance on gene-level tasks [4]
scFoundation	Competitive	Strong on large-scale data integration [4]
UCE	Competitive	Leverages protein sequence information [4]
LangCell	Competitive	Incorporates text-cell pairs [4]
scCello	Competitive	Specialized for cell state transitions [4]
Baseline (scVI)	Contextual	Can be more efficient for specific, small-scale tasks [4]

Table 2: Model Performance on Specific Clinical Tasks (Generalized Findings) [4]

Task	Key Finding	Recommendation
Cancer Cell Identification	Performance varies significantly by cancer type.	Use task-specific rankings; no single model is universally best.
Drug Sensitivity Prediction	scFMs provide robust embeddings for prediction models.	scFMs act as effective plug-and-play feature extractors for this task.
Cell Type Annotation	scFMs capture biological knowledge, leading to more semantically meaningful errors (e.g., misclassifying closely related types).	Use LCAD metric to assess if misclassifications are biologically plausible.

Protocol: Multi-Feature Fusion for Enhanced Classification (scMFF Framework)

This protocol is useful for improving cell type identification, especially in complex disease datasets [99].

Data Pre-processing:
- Input: Raw gene expression matrix.
- Filtering: Remove cells labeled "unknown." Merge cell types with fewer than 3 cells into a new category.
- Gene Selection: Select the top 2000 Highly Variable Genes (HVGs).
- Transformation: Apply a log1p transformation: (x^{\prime} = \log \left( {1 + x} \right)).
Feature Extraction: Generate four distinct feature matrices from the pre-processed data.
- Statistical-based ((f_1)): Compute the corrected dispersion (Fano factor) for each gene and use the expression values of the top 2000 HVGs.
- Information theory-based ((f_2), scPSD): Treat the gene expression profile as a signal, compute its power spectral density, and then derive its spectral entropy.
- Matrix factorization-based ((f_3)): Apply Principal Component Analysis (PCA) to the expression matrix and take the top d principal components.
- Deep learning-based ((f_4)): Extract low-dimensional embeddings from a deep learning model like a variational autoencoder or graph neural network.
Feature Fusion: Integrate the four feature matrices using one of six fusion strategies (e.g., weighted sum, Hadamard product, attention mechanism, mixture-of-experts, residual fusion, or Transformer-based fusion).
Classification: Feed the final fused representation into a classifier (e.g., SVM, LightGBM) for cell type identification.

Diagram: Logical Workflow of the scMFF Framework

The Scientist's Toolkit

Table 3: Key Research Reagent Solutions for scFM Training and Application

Item / Resource	Function / Description	Relevance to scFM Research
CZ CELLxGENE [4] [5]	A unified platform providing access to over 100 million curated and standardized single-cell datasets.	Primary data source for pre-training scFMs and for finding reference atlases for mapping.
Highly Variable Genes (HVGs) [99] [98]	A statistical feature set capturing genes with the highest expression variance across cells.	A foundational feature type for model input, crucial for initial dimensionality reduction and capturing cell-to-cell differences.
scArches (Algorithm) [87]	A transfer learning method for mapping new query datasets to existing reference atlases without sharing raw data.	Enables efficient, decentralized, and iterative updating of reference models, critical for clinical collaboration.
scGraph-OntoRWR (Metric) [4]	A novel evaluation metric that measures the consistency of cell type relationships captured by an scFM with prior biological knowledge from ontologies.	Moves beyond pure accuracy, assessing the biological relevance of the model's latent embeddings.
Roughness Index (ROGI) [4]	A metric that quantifies the "smoothness" of the cell-property landscape in a model's latent space.	Serves as a proxy for model selection; a smoother landscape often indicates easier training for downstream tasks.

Conclusion

Effective selection of highly variable genes is not merely a preprocessing step but a fundamental determinant of single-cell foundation model success. By integrating robust HVG selection methods that account for batch effects, platform-specific biases, and hierarchical biological relationships, researchers can significantly enhance scFM performance across integration, classification, and knowledge extraction tasks. Future directions should focus on developing more biologically-informed selection criteria, creating standardized benchmarking frameworks, and advancing methods that seamlessly integrate HVG selection with foundation model training pipelines. As scFMs continue to transform biomedical research, optimized HVG selection will be crucial for unlocking deeper insights into cellular function, disease mechanisms, and therapeutic development, ultimately bridging the gap between computational innovation and clinical application.