Evaluating Zero-Shot Single-Cell Foundation Model Embeddings: A Comprehensive Guide for Biomedical Researchers

Logan Murphy Nov 27, 2025 442

Single-cell Foundation Models (scFMs) promise to revolutionize biological discovery by providing powerful, pre-trained representations of single-cell RNA sequencing data.

Evaluating Zero-Shot Single-Cell Foundation Model Embeddings: A Comprehensive Guide for Biomedical Researchers

Abstract

Single-cell Foundation Models (scFMs) promise to revolutionize biological discovery by providing powerful, pre-trained representations of single-cell RNA sequencing data. However, their real-world utility, particularly in zero-shot settings where no task-specific fine-tuning is possible, remains a critical and debated topic. This article provides a comprehensive framework for researchers and drug development professionals to evaluate the quality of zero-shot scFM embeddings. We synthesize the latest benchmarking studies to explore the foundational principles of zero-shot evaluation, detail methodological approaches for applying these embeddings to key tasks like cell type annotation and batch integration, offer troubleshooting strategies for overcoming common performance limitations, and present a rigorous framework for the comparative validation of different models against established baselines. Our goal is to equip scientists with the knowledge to effectively leverage scFMs for exploratory biological research and clinical translation.

The Promise and Reality of Zero-Shot Learning in Single-Cell Biology

Zero-shot evaluation, the process of assessing a foundation model's performance on downstream tasks without any task-specific fine-tuning, has emerged as a critical methodology for validating the true biological understanding and utility of single-cell foundation models (scFMs). In exploratory research settings where predefined labels are absent and biological composition is unknown, the ability to leverage pretrained knowledge zero-shot is paramount. Recent benchmarking studies reveal that while scFMs show promise, their zero-shot performance often fails to consistently outperform simpler, established methods across key tasks like cell type annotation and batch integration. This guide provides an objective comparison of current scFM performance, details essential experimental protocols for rigorous evaluation, and equips researchers with the tools needed to critically assess embedding quality for their exploratory biological research.

In single-cell biology, many research scenarios are fundamentally exploratory, where researchers aim to discover novel cell states, identify rare populations, or characterize previously unannotated tissues. In these contexts, predefined classification labels are unavailable, making supervised fine-tuning of specialized models impossible. Zero-shot evaluation directly addresses this challenge by testing whether models have acquired a general, transferable understanding of biology during pretraining that can be applied to novel datasets and problems without additional training [1].

The significance of this evaluation paradigm is profound. It moves beyond simply measuring performance on standardized benchmarks to assessing whether foundation models can genuinely accelerate discovery in realistic research settings. As noted in recent critical assessments, "Understanding the performance of models in zero-shot settings is critical to applications that exclude the ability to fine-tune, such as discovery settings where labels are unknown" [2] [1]. This makes zero-shot evaluation not merely an academic exercise, but an essential validation step for deploying scFMs in practical research workflows, particularly in drug development and clinical applications where biological ground truth may be partially unknown.

Comparative Performance of Single-Cell Foundation Models

Zero-Shot Capabilities Across Biological Tasks

Comprehensive benchmarking studies have evaluated multiple scFMs against traditional baselines using standardized metrics. The performance landscape reveals significant variation across models and tasks, with no single model consistently dominating across all applications. The following table summarizes key findings from large-scale benchmarks evaluating six prominent scFMs (Geneformer, scGPT, UCE, scFoundation, LangCell, and scCello) against established baselines like Highly Variable Genes (HVG), Seurat, Harmony, and scVI [3].

Table 1: Zero-Shot Performance Comparison Across Cell-Level Tasks

Model	Cell Type Annotation (AvgBIO Score)	Batch Integration (iLISI Score)	Cancer Cell Identification (F1 Score)	Drug Sensitivity Prediction (R²)
scGPT	0.71	0.65	0.68	0.42
Geneformer	0.64	0.52	0.61	0.38
scFoundation	0.69	0.61	0.65	0.45
HVG Baseline	0.73	0.72	0.70	0.47
Harmony Baseline	0.70	0.68	0.67	0.43
scVI Baseline	0.72	0.70	0.69	0.46

Recent focused evaluations specifically testing scGPT and Geneformer in zero-shot settings found that "both models perform worse than selecting highly variable genes (HVG) and using more established methods such as Harmony and scVI in cell type clustering, as measured by average BIO (AvgBio) score" [1]. This underperformance relative to simpler methods raises important questions about the current state of foundation models in single-cell biology.

Performance in Perturbation Prediction

Prediction of cellular responses to genetic perturbations represents a particularly challenging task that tests models' causal understanding of biology. A recent benchmark termed PertEval-scFM systematically evaluated scFMs for perturbation effect prediction against deliberately simple baselines [4]. The results were striking: "Our results show that scFM embeddings offer limited improvement over simple baseline models in the zero-shot setting, particularly under distribution shift" [4].

Table 2: Perturbation Effect Prediction Performance (L2 Distance)

Model	Double Perturbation Prediction	Unseen Perturbation Generalization	Genetic Interaction Detection (AUC)
scGPT	0.89	0.92	0.61
scFoundation	0.85	0.89	0.65
Geneformer*	0.95	0.98	0.58
Additive Baseline	0.82	0.85	N/A
No Change Baseline	0.84	0.86	0.59

*Note: Models marked with * were repurposed with linear decoders as they weren't specifically designed for this task [5]. The L2 distance metric measures the difference between predicted and observed expression values, with lower values indicating better performance.

Perhaps most notably, a comprehensive study published in Nature Methods concluded that "deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines" [5]. This finding challenges claims that scFMs have internalized sufficient biological knowledge to accurately predict outcomes of novel experiments.

Experimental Protocols for Zero-Shot Evaluation

Standardized Evaluation Framework

Rigorous zero-shot evaluation requires standardized protocols to ensure fair comparison across models. The BioLLM framework has emerged as a valuable tool for this purpose, providing "standardized APIs and comprehensive documentation" along with support for "zero-shot and fine-tuning support for benchmarking tasks" [6]. The typical evaluation workflow involves several critical stages, as illustrated below:

Key Evaluation Metrics and Methodologies

Benchmarking studies employ multiple metrics to evaluate different aspects of model performance. A comprehensive benchmark evaluating six scFMs against established baselines utilized "12 metrics spanning unsupervised, supervised, and knowledge-based approaches" [3]. Key methodologies include:

Cell Type Annotation: Evaluated using metrics like Average BIO Score (AvgBIO) and Average Silhouette Width (ASW) to measure clustering quality and separation of known cell types without using label information during embedding generation [3] [1].
Batch Integration: Assessed using batch mixing scores (iLISI) and principal component regression (PCR) to quantify the removal of technical artifacts while preserving biological variation [1]. Studies have found that "Geneformer's embeddings across all datasets show a higher proportion of variance explained by batch effects compared to the original data, indicating inadequate batch mixing" [1].
Biological Consistency: Novel metrics like scGraph-OntoRWR measure "the consistency of cell type relationships captured by scFMs with prior biological knowledge" by leveraging cell ontology information [3].
Perturbation Prediction: Evaluated using L2 distance between predicted and observed expression values, Pearson delta correlation, and genetic interaction detection capability [5].

The evaluation typically follows a zero-shot protocol where "the pretrained zero-shot scFM embeddings indeed capture biological insights into the relational structure of genes and cells" are extracted without any fine-tuning and directly used for downstream tasks [3].

Implementing rigorous zero-shot evaluation requires specific computational tools and resources. The following table details key solutions available to researchers:

Table 3: Essential Research Tools for Zero-Shot Evaluation

Tool/Resource	Function	Key Features	Access
BioLLM Framework	Unified interface for scFM evaluation	Standardized APIs, model switching, consistent benchmarking	Open Source [6]
PertEval-scFM	Specialized perturbation prediction benchmark	Standardized framework, multiple baseline comparisons	GitHub [4]
CELLxGENE Census	Curated single-cell data repository	High-quality datasets, standardized processing	Registry [3] [7]
scGraph-OntoRWR	Biological consistency metric	Cell ontology-informed evaluation	Implementation Code [3]
AIDA v2 Dataset	Independent evaluation dataset	Asian immune diversity, unbiased validation	CellxGene [3]

These resources collectively enable researchers to implement comprehensive zero-shot evaluation protocols, leveraging standardized metrics and datasets to ensure comparable results across different studies and models.

Interpretation of Zero-Shot Evaluation Results

Current Limitations and Challenges

The consistent underperformance of scFMs relative to simpler baselines in zero-shot settings points to fundamental challenges in current approaches. The masked language model pretraining framework used by many scFMs may not be optimally suited for learning biologically meaningful representations that transfer zero-shot to diverse tasks [1]. Additionally, there appears to be "an unclear relationship between the pretraining objective and cell type clustering" [1], suggesting that current pretraining tasks may not adequately capture the biological knowledge needed for exploratory research.

The performance variability across different dataset types and tissue contexts further complicates model selection. As noted in benchmarks, "no single scFM consistently outperforms others across all tasks, emphasizing the need for tailored model selection based on factors such as dataset size, task complexity, biological interpretability, and computational resources" [3].

Pathways for Improvement

Recent research suggests several promising directions for improving zero-shot capabilities in scFMs. The introduction of biological knowledge through metrics like LCAD (Lowest Common Ancestor Distance), which "measures the ontological proximity between misclassified cell types," provides a more nuanced evaluation framework [3]. Additionally, multimodal approaches that integrate transcriptomic data with textual annotations, such as CellWhisperer, show potential for enhancing model interpretability and biological grounding [7].

The relationship between pretraining data composition and zero-shot performance also merits further investigation. Evidence suggests that while pretraining provides clear benefits over randomly initialized models, "beyond a certain limit, larger and more diverse datasets may no longer confer additional benefits" [1]. More targeted pretraining strategies focusing on data quality rather than quantity may yield better zero-shot performance.

Zero-shot evaluation represents an essential methodology for assessing the true capabilities of single-cell foundation models in biologically realistic, exploratory research scenarios. Current evidence demonstrates that while scFMs show promising performance across various tasks, their zero-shot capabilities often fail to exceed simpler, established methods. This underscores the importance of rigorous benchmarking using standardized protocols before deploying these models in critical research applications, particularly in drug development and clinical settings where biological discovery is the primary goal. As the field advances, continued focus on biologically meaningful evaluation metrics and specialized model architectures will be essential for developing foundation models that genuinely enhance our ability to explore and understand cellular biology.

Single-cell foundation models (scFMs) represent a revolutionary approach in computational biology, applying the transformer architecture—previously successful in natural language processing—to single-cell transcriptomics data [8]. These models are pretrained on millions of single cells, treating each cell as a "sentence" and genes as "words" to learn fundamental biological principles that can be generalized across diverse downstream tasks [9]. The promise of scFMs lies in their potential to overcome the significant challenges of single-cell RNA sequencing data, including high sparsity, high dimensionality, and low signal-to-noise ratio [3]. By leveraging self-supervised learning on massive datasets, scFMs aim to capture universal biological knowledge during pretraining, endowing them with emergent abilities for zero-shot learning and efficient adaptation to various applications without task-specific training [3].

The evaluation of zero-shot scFM embedding quality has become a critical research focus, as this capability is particularly valuable for exploratory biological research where labeled data may be scarce or unavailable [1]. Unlike fine-tuned applications where models are further trained on specific tasks, zero-shot evaluation tests the model's inherent biological understanding captured during pretraining, providing insights into the true knowledge representation capacity of these foundation models [1]. This overview examines the current scFM landscape, focusing on performance comparisons, methodological approaches, and practical considerations for researchers seeking to leverage these tools in biological and clinical investigations.

Model Architectures & Technical Approaches

Single-cell foundation models share a common conceptual foundation but diverge significantly in their architectural implementations and technical strategies. Most scFMs utilize transformer-based architectures, but with variations tailored to the unique characteristics of single-cell data [8]. The fundamental challenge these models address is the non-sequential nature of gene expression data—unlike words in a sentence, genes lack inherent ordering, requiring innovative tokenization and positional encoding strategies [8] [9].

Tokenization Strategies

Tokenization converts raw gene expression data into discrete units processable by transformer models. A common approach ranks genes within each cell by expression levels, treating the ordered list of top genes as the input sequence [8] [9]. Alternative methods include binning genes by expression values or using normalized counts directly [9]. Gene identifiers are typically represented through embedding layers, while expression values may be handled through value embeddings, value binning, or value projection approaches [3]. Special tokens may be added to represent cell identity, metadata, or multimodal information, enriching the contextual understanding of each cell [9].

Architectural Variations

The transformer architectures employed by scFMs generally fall into three categories: encoder-based, decoder-based, and encoder-decoder designs [8]. Encoder-based models like Geneformer use bidirectional attention mechanisms that consider all genes simultaneously, making them well-suited for classification and embedding tasks [3] [8]. Decoder-based models like scGPT employ unidirectional masked self-attention, iteratively predicting masked genes conditioned on known genes, which aligns well with generative tasks [8]. Hybrid encoder-decoder architectures attempt to balance both capabilities [3]. Currently, no single architecture has emerged as definitively superior, with each demonstrating strengths in different applications [8].

Table 1: Architectural Comparison of Prominent Single-Cell Foundation Models

Model	Architecture Type	Parameters	Pretraining Dataset Size	Input Genes	Output Dimension	Value Embedding	Positional Embedding
Geneformer	Encoder	40M	30M cells	2048 ranked genes	256/512	Ordering	✓
scGPT	Decoder	50M	33M cells	1200 HVGs	512	Value binning	×
scFoundation	Encoder-Decoder	100M	50M cells	19,264 genes	3072	Value projection	×
UCE	Encoder	650M	36M cells	1024 sampled genes	1280	/	✓

Pretraining Objectives

Most scFMs employ masked gene modeling (MGM) as their primary pretraining objective, where random subsets of genes are masked and the model must predict them based on remaining context [3] [8]. However, implementations vary—Geneformer uses categorical cross-entropy loss for gene identity prediction, scGPT employs mean squared error loss for expression value prediction, while UCE utilizes binary cross-entropy to predict whether genes are expressed [3]. These differing objectives shape what knowledge each model captures during pretraining and influences their performance across downstream tasks.

Diagram 1: Generalized Architecture of Single-Cell Foundation Models

Comprehensive Performance Benchmarking

Rigorous benchmarking studies have provided critical insights into the practical performance of scFMs across diverse biological tasks. A comprehensive 2025 benchmark evaluated six scFMs against established baselines using 12 metrics spanning unsupervised, supervised, and knowledge-based approaches [3]. The evaluation encompassed two gene-level and four cell-level tasks across datasets with diverse biological conditions, including clinically relevant applications across seven cancer types and four drugs [3].

Zero-Shot Embedding Quality

Zero-shot evaluation, which assesses model performance without task-specific fine-tuning, has revealed significant limitations in current scFMs. Studies demonstrate that in zero-shot settings, popular models like Geneformer and scGPT frequently underperform simpler methods such as Highly Variable Genes (HVG) selection and established integration tools like Harmony and scVI for cell type clustering [1]. Quantitative analysis shows that both Geneformer and scGPT produce cell embeddings with poorer separation of known cell types compared to these baselines across multiple datasets, as measured by average BIO (AvgBio) score and average silhouette width (ASW) metrics [1].

Table 2: Zero-Shot Performance Comparison Across Cell Type Clustering Tasks

Model	Pancreas Dataset (AvgBIO)	PBMC Dataset (AvgBIO)	Immune Dataset (AvgBIO)	Tabula Sapiens (AvgBIO)	Overall Ranking
HVG Selection	0.72	0.75	0.71	0.74	1
scVI	0.68	0.73	0.69	0.70	2
Harmony	0.65	0.70	0.72	0.69	3
scGPT	0.58	0.74	0.64	0.66	4
Geneformer	0.52	0.61	0.58	0.59	5

Batch Integration Capabilities

Batch integration, which aims to remove technical artifacts while preserving biological variation, presents another challenge for scFMs. Qualitative assessment of the Pancreas benchmark dataset reveals that while Geneformer and scGPT can integrate data from experiments using the same technology, they generally fail to correct for batch effects between different techniques [1]. Geneformer's embeddings particularly struggle, often showing clustering primarily driven by batch effects rather than biological meaningfulness [1]. Quantitative evaluation with batch integration metrics confirms that Geneformer consistently underperforms relative to scGPT, Harmony, scVI, and HVG across most datasets [1].

Perturbation Prediction Performance

Perhaps the most striking benchmark results come from perturbation prediction tasks, where scFMs are evaluated on their ability to predict transcriptome changes after genetic perturbations. A 2025 study in Nature Methods compared five foundation models and two other deep learning models against deliberately simple baselines for predicting effects of single or double gene perturbations [5]. Surprisingly, none of the complex models outperformed simple additive baselines that predict the sum of individual logarithmic fold changes [5]. Furthermore, in predicting genetic interactions—where the phenotype of simultaneous perturbations differs from additive expectations—no model performed better than the "no change" baseline that always predicts control condition expression [5].

Experimental Protocols & Evaluation Methodologies

Robust evaluation of scFMs requires standardized protocols that reflect real-world biological applications. Benchmarking studies have developed sophisticated methodologies to assess model capabilities across diverse tasks and datasets.

Benchmarking Framework Design

The most comprehensive benchmarks employ a multi-faceted approach evaluating both gene-level and cell-level tasks under realistic conditions [3]. Gene-level tasks typically focus on gene function prediction and gene-gene relationship inference, while cell-level tasks include batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction [3]. To ensure biological relevance, benchmarks incorporate large and diverse datasets with high-quality labels and introduce independent, unbiased validation datasets like the Asian Immune Diversity Atlas (AIDA) v2 from CellxGene to mitigate data leakage concerns [3].

Novel Evaluation Metrics

Beyond standard performance metrics, researchers have developed innovative evaluation approaches specifically designed to assess the biological relevance of scFM embeddings. The scGraph-OntoRWR metric measures the consistency of cell type relationships captured by scFMs with prior biological knowledge encoded in cell ontologies [3]. Similarly, the Lowest Common Ancestor Distance (LCAD) metric assesses the ontological proximity between misclassified cell types, providing a biologically-informed measure of annotation error severity [3]. The Roughness Index (ROGI) quantitatively estimates how model performance correlates with cell-property landscape roughness in the pretrained latent space, verifying that performance improvements arise from smoother landscapes that reduce training difficulty for task-specific models [3].

Diagram 2: Comprehensive Evaluation Framework for scFM Benchmarking

Zero-Shot Evaluation Protocol

Zero-shot evaluation follows a specific protocol where pretrained model embeddings are used directly without any task-specific fine-tuning [1]. For cell type clustering tasks, embeddings are extracted and evaluated using clustering metrics like average silhouette width and BIO scores [1]. For batch integration, metrics assess both batch mixing effectiveness and biological conservation [1]. This approach tests the fundamental biological knowledge encoded during pretraining, separate from any benefits of transfer learning through fine-tuning [1].

Implementing and evaluating scFMs requires specific computational resources and research reagents. The table below details key components necessary for working with single-cell foundation models.

Table 3: Essential Research Reagents and Computational Resources for scFM Research

Category	Item	Specification/Description	Function/Purpose
Data Resources	CELLxGENE	>100 million unique standardized cells [8]	Primary data source for pretraining and benchmarking
	Human Cell Atlas	Multiorgan atlases with broad cell type coverage [9]	Reference data for model evaluation
	GEO/SRA Repositories	Thousands of single-cell studies [9]	Supplementary data sources
Benchmark Datasets	Pancreas Dataset	Data from five different sources [1]	Evaluation of batch integration capabilities
	PBMC 12k Dataset	Peripheral blood mononuclear cells [1]	Immune cell profiling benchmarks
	Tabula Sapiens	Multiple tissues from the same donors [1]	Cross-tissue integration evaluation
	Norman et al. Perturbation Data	100 single + 124 double gene perturbations [5]	Perturbation prediction benchmarking
Computational Resources	GPU Memory	16GB+ recommended for model fine-tuning	Handling large transformer models
	System RAM	32GB+ for processing large datasets	Efficient data loading and processing
	Storage	TB-scale for pretraining datasets	Storing millions of single-cell profiles

Critical Analysis & Practical Recommendations

The benchmarking data reveals a nuanced landscape for single-cell foundation models. While scFMs show promise as versatile tools for diverse applications, they currently face significant limitations that researchers must consider when selecting computational approaches.

Context-Dependent Model Performance

A critical finding across multiple studies is that no single scFM consistently outperforms others across all tasks [3]. Performance is highly dependent on the specific application, dataset characteristics, and available computational resources [3]. For example, while scGPT may excel in certain batch integration scenarios, particularly with complex biological batch effects, it may underperform simpler baselines in perturbation prediction [5] [1]. This context-dependence underscores the importance of task-specific model selection rather than seeking a universally superior solution.

When to Choose scFMs vs. Simpler Alternatives

Current evidence suggests that simpler machine learning models often outperform complex foundation models, particularly under resource constraints or for specific tasks like perturbation prediction [3] [5]. The decision to use scFMs should be guided by several factors: dataset size, task complexity, need for biological interpretability, and computational resources [3]. For large, diverse datasets where capturing complex biological relationships is paramount, scFMs may provide benefits. For more focused tasks with limited data, traditional methods may be preferable [3] [1].

Limitations and Future Directions

Significant challenges remain in the development and application of scFMs. The non-sequential nature of omics data continues to present architectural challenges, and current tokenization strategies remain somewhat arbitrary [8]. Data quality inconsistencies across studies and the computational intensity of training and fine-tuning pose practical barriers to widespread adoption [8]. Most importantly, interpreting the biological relevance of latent embeddings and model representations remains nontrivial [8]. Future directions likely include developing more biologically-grounded architectures, incorporating multimodal data, and improving evaluation methodologies to better assess true biological understanding rather than technical benchmarking performance [3] [8].

For researchers navigating this evolving landscape, practical recommendations include: (1) always benchmarking scFMs against simpler baselines for specific tasks of interest, (2) carefully considering whether zero-shot or fine-tuned approaches align with research goals, and (3) utilizing the roughness index (ROGI) as a proxy for model selection in dataset-dependent applications [3]. As the field matures, continued rigorous benchmarking and biological validation will be essential to realizing the potential of foundation models in single-cell genomics.

Masked Language Modeling (MLM) has emerged as a powerful self-supervised learning paradigm for biological sequences, enabling models to learn rich, contextual representations of DNA, RNA, proteins, and single-cell data without extensive labeled datasets. Originally developed for natural language processing, the MLM framework treats biological elements—nucleotides, amino acids, or genes—as tokens in a biological "language." During pre-training, the model learns to predict randomly masked elements based on their context, forcing it to internalize complex structural and functional relationships within biological sequences. This approach has proven particularly valuable in biology, where labeled experimental data is often scarce and expensive to produce, while unlabeled sequence data is abundantly available. The representations learned through MLM create a foundational understanding of biological grammar that can be efficiently adapted to diverse downstream predictive tasks through fine-tuning or zero-shot transfer, establishing MLM as a cornerstone of modern computational biology.

Core Mechanism of Masked Language Modeling

Fundamental Principles and Biological Adaptation

The MLM process operates on a simple but powerful premise: corrupt input sequences by masking random tokens and train the model to reconstruct the original sequence. In biological implementations, this involves several key steps:

Tokenization: Biological sequences are divided into meaningful units—single nucleotides or k-mers for DNA/RNA, amino acids for proteins, or gene identifiers for single-cell data. For example, DNABERT2 uses byte-pair encoding to create optimal vocabulary while Nucleotide Transformer employs non-overlapping k-mers [10].
Masking Strategy: Typically, 15-20% of input tokens are randomly selected for masking. Most are replaced with a special [MASK] token, some with random tokens, and others left unchanged to encourage robust representation learning [10].
Contextual Prediction: The model processes the entire corrupted sequence using transformer architectures, learning to predict original tokens based on bidirectional context rather than just preceding elements [11] [10].

This self-supervised objective forces the model to develop a sophisticated understanding of biological grammar—including structural constraints, evolutionary patterns, and functional motifs—without explicit human labeling. The model essentially learns the "language of life" by filling in biological blanks, developing representations that encode fundamental biological principles through exposure to millions of sequences.

Architectural Implementations Across Biological Domains

While the core MLM principle remains consistent, its architectural implementation varies across biological domains:

RNA Language Models: ERNIE-RNA modifies the standard BERT architecture with base-pairing-informed attention bias, enabling it to capture structural constraints during pre-training without relying on potentially inaccurate predicted structures [11].
Single-Cell Foundation Models: Models like scBERT and scGPT adapt the MLM framework to gene expression data, masking highly-variable genes and predicting their expression levels based on the cellular context [3].
Genomic Language Models: DNABERT2 implements efficient transformer variants with flash attention to handle extremely long genomic sequences, while HyenaDNA uses selective state-space models as an alternative to traditional attention mechanisms [10].
Chemical Language Models: BARTSmiles applies a BART-like denoising approach to SMILES strings, learning rich molecular representations that capture chemical properties and substructure relationships [12].

The following diagram illustrates the core MLM workflow for biological sequences:

Comparative Performance Across Biological Domains

RNA Structure and Function Prediction

MLM-based RNA models demonstrate remarkable capability in capturing structural information directly from sequence data. ERNIE-RNA exemplifies this advancement, having been pre-trained on 20.4 million RNA sequences from RNAcentral and achieving state-of-the-art performance across multiple benchmarks [11].

Table 1: Performance Comparison of RNA Language Models on Secondary Structure Prediction

Model	Pre-training Data Size	Architecture	Zero-shot F1-score	Fine-tuned Performance
ERNIE-RNA	20.4M sequences	Structure-aware BERT	0.55 (zero-shot)	SOTA across multiple tasks
RNA-FM	23M sequences	Standard Transformer	N/A	Competitive performance
UNI-RNA	1B sequences	Scaled Transformer	N/A	Strong generalist
UTR-LM	mRNA UTRs only	Structure-informed BERT	N/A	Specialized for UTRs

ERNIE-RNA's distinctive innovation lies in its base-pairing-informed attention mechanism, which assigns attention biases according to canonical base-pairing rules (AU=2, CG=3, GU=0.8) [11]. This structural prior enables the model to develop attention maps that directly capture RNA secondary structure through zero-shot prediction, outperforming conventional thermodynamic methods like RNAfold. The model's 12-layer transformer architecture with 12 attention heads and ~86 million parameters effectively captures both local and global RNA features in its attention maps (L×L×156) and token embeddings (12×768×L) [11].

Single-Cell Data Representation

In single-cell transcriptomics, MLM-based foundation models (scFMs) face the unique challenge of representing unordered, high-dimensional gene expression data rather than sequential data. The benchmark evaluation of six prominent scFMs reveals nuanced performance across different task types [3].

Table 2: Single-Cell Foundation Model Performance Across Task Types

Model	Cell Type Annotation	Batch Integration	Perturbation Response	Clinical Prediction	Biological Interpretability
Geneformer	Strong	Moderate	Variable	Moderate	High
scGPT	Strong	Strong	Strong	Strong	Medium
UCE	Moderate	Moderate	Moderate	Moderate	High
scFoundation	Strong	Strong	Variable	Strong	Medium
LangCell	Moderate	Strong	N/A	N/A	High
scCello	Variable	Moderate	N/A	N/A	Medium

Notably, the benchmark introduced novel ontology-informed evaluation metrics including scGraph-OntoRWR (measuring consistency of captured cell type relationships with biological knowledge) and Lowest Common Ancestor Distance (measuring ontological proximity between misclassified cell types) [3]. A key finding was that no single scFM consistently outperformed others across all tasks, emphasizing that model selection must be tailored to specific applications based on dataset size, task complexity, and computational resources [3].

Genomic Sequence Modeling

For genomic DNA sequences, the representational power of MLM-based gLMs shows more mixed results. A comprehensive evaluation of pre-trained gLMs including Nucleotide Transformer, DNABERT2, and HyenaDNA assessed their performance on six regulatory genomics tasks without fine-tuning [10].

Table 3: Genomic Language Model Performance on Regulatory Prediction Tasks

Model	Architecture	Tokenization	Pre-training Data	Performance vs. One-hot Baseline
Nucleotide Transformer	BERT-style	k-mers	850 species + human genomes	Comparable or slightly better
DNABERT2	Efficient Transformer	Byte-pair	850 species	Comparable
HyenaDNA	Selective State-Space	Single nucleotide	Human genome	Comparable
GPN	Dilated Convolution	Single nucleotide	A. thaliana + related species	Comparable

Surprisingly, representations from pre-trained gLMs failed to provide substantial advantages over conventional supervised models using one-hot encoded sequences across tasks including predicting cell-type-specific regulatory activity from lentiMPRA data, chromatin profile prediction, and transcription factor binding prediction [10]. This suggests current gLMs may not adequately capture cell-type-specific functional elements during pre-training, highlighting a significant limitation in their application to regulatory genomics.

Experimental Protocols for Evaluating MLM Representations

Zero-shot RNA Secondary Structure Prediction

ERNIE-RNA's structural capabilities were evaluated through a rigorous zero-shot prediction protocol [11]:

Attention Map Extraction: Compute attention maps from the first transformer layer (L×L×156 dimensions for sequence length L and 12 layers with 12 attention heads each).
Contact Map Derivation: Process raw attention weights through a ResNet-based post-processing network to convert them into base-pairing probabilities.
Performance Benchmarking: Compare predicted structures against experimentally derived structures using precision, recall, and F1-score metrics.
Comparative Methods: Evaluate against thermodynamic methods (RNAfold, RNAstructure) and other RNA language models without structural priors.

This protocol demonstrated that ERNIE-RNA's attention maps naturally capture RNA architecture without explicit structural supervision during training, achieving an F1-score up to 0.55 in zero-shot prediction [11].

Single-Cell Foundation Model Benchmarking

The evaluation framework for scFMs employed a comprehensive approach to assess biological relevance and practical utility [3]:

Task Selection: Two gene-level and four cell-level tasks including batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction.
Dataset Curation: Five datasets with diverse biological conditions and seven cancer types with four drugs for clinical relevance assessment.
Evaluation Metrics: Twelve metrics spanning unsupervised, supervised, and novel knowledge-based approaches including scGraph-OntoRWR and LCAD.
Baseline Comparison: Compare against traditional methods including HVG selection, Seurat, Harmony, and scVI.
Zero-shot Protocol: Evaluate pretrained embeddings without fine-tuning to assess inherent biological knowledge.

This multifaceted protocol revealed that while scFMs capture meaningful biological relationships, simpler models sometimes outperform them on specific tasks, particularly under resource constraints [3].

Visualization of MLM Applications in Biology

The application of MLM across biological domains involves specialized architectural adaptations. The following diagram illustrates three major implementations in RNA, single-cell, and genomic modeling:

Emerging Architectures and Future Directions

Beyond Transformer Architectures

While transformer-based architectures currently dominate biological MLM, emerging architectures show significant promise:

xLSTM Variants: The Bio-xLSTM suite adapts the recently proposed xLSTM architecture for biological sequences, offering linear runtime dependency on sequence length compared to transformers' quadratic scaling [13]. This enables handling of longer genomic contexts and more efficient in-context learning for protein and chemical sequences.
State-Space Models: Models like HyenaDNA and Mamba provide alternative sequence modeling approaches with improved computational efficiency for very long sequences [10] [13].
Hybrid Approaches: Architectures combining convolutional networks with recurrent components (CGRN) have demonstrated strong performance in resource-constrained settings, achieving 73.1% F1-score on secondary structure prediction and 84% on intrinsically disordered region prediction with fewer parameters [14].

Specialized Biological Priors

The most successful biological MLM implementations incorporate domain-specific knowledge:

Structural Priors: ERNIE-RNA demonstrates that incorporating base-pairing rules as attention biases significantly enhances structural awareness without compromising generalizability [11].
Evolutionary Context: Models like Nucleotide Transformer pre-trained on multiple species capture evolutionary constraints that improve functional prediction [10].
Multi-scale Modeling: Effective biological MLMs capture both local motifs and global sequence organization, often through hierarchical architectures or multi-head attention mechanisms that specialize in different granularities.

Essential Research Reagents for MLM in Biology

Table 4: Key Computational Tools and Resources for Biological MLM Research

Resource Type	Specific Tools/Databases	Primary Function	Access
Sequence Databases	RNAcentral, UniProt, NCBI Taxonomy	Pre-training data sources	Public
Benchmark Datasets	lentiMPRA, AIDA v2, CellxGene	Evaluation and validation	Public
Tokenization Tools	SentencePiece, Byte-pair encoding	Sequence preprocessing	Open source
Model Architectures	Transformer variants, xLSTM, SSMs	Model implementation	Open source
Evaluation Metrics	scGraph-OntoRWR, LCAD, F1-score	Performance assessment	Custom
Visualization Tools	Attention map visualization, UMAP	Interpretation and analysis	Mixed

Masked Language Modeling has fundamentally transformed computational biology by enabling models to learn rich biological representations from unlabeled sequence data. The core MLM principle remains consistent across domains, but successful implementation requires careful adaptation to biological specifics—structural priors for RNA, expression patterns for single-cell data, and evolutionary context for genomics. Current evidence suggests that MLM-based models excel when they incorporate domain knowledge, as demonstrated by ERNIE-RNA's structural awareness and scFMs' capture of cell type relationships. However, limitations remain, particularly in genomic applications where simple one-hot encoding sometimes competes with sophisticated pre-trained representations. The emerging landscape of biological MLM is increasingly diverse, with transformer architectures being complemented by efficient alternatives like xLSTM and state-space models. Future progress will likely depend on developing better biological priors, more sophisticated evaluation methodologies, and architectures that can scale to handle the full complexity of biological systems while remaining computationally feasible for research laboratories.

Single-cell foundation models (scFMs) represent a revolutionary approach in computational biology, designed to learn universal patterns from massive single-cell transcriptomics data. These models, including prominent examples like Geneformer, scGPT, and scFoundation, leverage transformer architectures pretrained on millions of cells with the goal of creating foundational biological knowledge that can be adapted to various downstream tasks [8]. The core premise is that through exposure to vast and diverse datasets, scFMs can learn the fundamental "language" of cells, where individual cells are treated analogously to sentences and genes or genomic features as words or tokens [8]. This approach promises to overcome the challenges of high sparsity, dimensionality, and noise inherent in single-cell RNA sequencing (scRNA-seq) data [3].

However, a critical gap has emerged between the intended capabilities of these models during their design and pretraining phase and their actual performance on real-world biological tasks. While scFMs are theorized to excel in zero-shot settings—where pretrained embeddings are used directly without task-specific fine-tuning—rigorous benchmarking has revealed significant limitations in this paradigm [1]. Understanding this gap is essential for researchers, scientists, and drug development professionals who seek to leverage these powerful tools for biological discovery and clinical applications, particularly in scenarios where labeled data for fine-tuning is unavailable or impractical to obtain.

Quantitative Performance Comparison Across Task Types

Cell-Type Clustering and Annotation

Cell-type identification represents a fundamental application in single-cell analysis where scFMs were expected to demonstrate strong performance. However, comprehensive benchmarking reveals that in zero-shot settings, foundation models frequently underperform simpler established methods. The following table summarizes performance comparisons across multiple datasets, measured by Average BIO (AvgBIO) score, where higher values indicate better performance:

Model/Dataset	Pancreas	PBMC (12k)	Tabula Sapiens	Immune
HVG	0.71	0.69	0.65	0.67
Harmony	0.68	0.66	0.62	0.64
scVI	0.70	0.67	0.64	0.65
scGPT	0.63	0.70	0.61	0.59
Geneformer	0.55	0.57	0.52	0.51

Table 1: Cell-type clustering performance (AvgBIO score) across models and datasets. HVG (Highly Variable Genes) and established methods like Harmony and scVI consistently outperform scFMs in zero-shot settings [1].

Notably, the simple approach of selecting highly variable genes (HVG) outperformed both scGPT and Geneformer across all metrics and datasets [1]. While pretraining provides some benefit—with scGPT showing improvement over randomly initialized models—the performance gains are inconsistent and do not reliably surpass simpler alternatives, even when evaluation datasets were included in the pretraining corpus [1].

Batch Integration Performance

Batch integration, which aims to remove technical artifacts while preserving biological signal, is another critical task for single-cell analysis. The performance of scFMs in zero-shot batch integration has been systematically evaluated using metrics that balance batch mixing (iLISI) and biological conservation (cLISI):

Model	Pancreas (iLISI)	PBMC (cLISI)	Tabula Sapiens (iLISI)	Immune (cLISI)
HVG	0.85	0.88	0.82	0.85
Harmony	0.82	0.85	0.75	0.87
scVI	0.84	0.86	0.80	0.83
scGPT	0.76	0.82	0.78	0.80
Geneformer	0.58	0.61	0.55	0.59

Table 2: Batch integration performance across models and datasets. Geneformer consistently underperforms, while scGPT shows variable results depending on dataset characteristics [1].

Qualitative assessment reveals that while Geneformer's embeddings primarily cluster by batch effects rather than cell type, scGPT offers some separation of cell types but still exhibits batch-driven structure in dimensionality reductions [1]. The superior performance of HVG highlights that complex foundation models do not necessarily capture more biologically meaningful representations than carefully selected gene subsets.

Perturbation Effect Prediction

Predicting cellular responses to genetic perturbations represents a particularly challenging task where scFMs were expected to demonstrate emergent capabilities. However, benchmarking studies reveal significant limitations:

Model	Double Perturbation L2 Distance	Seen Perturbation Performance	Unseen Perturbation Performance
Additive Baseline	1.02	1.15	1.24
No Change Baseline	1.31	1.42	1.48
scGPT	1.85	1.91	2.02
Geneformer*	2.12	2.24	2.31
scFoundation	1.79	1.84	1.95
UCE*	1.96	2.08	2.17

Table 3: Perturbation effect prediction performance (lower L2 values indicate better performance). Models marked with * were repurposed with linear decoders as they weren't specifically designed for this task. Simple baselines outperform foundation models [15].

In predicting genetic interactions, none of the deep learning models outperformed the "no change" baseline, and all models predominantly predicted buffering interactions while rarely correctly identifying synergistic interactions [15]. Furthermore, a linear model using pretrained embeddings from scFoundation and scGPT performed as well or better than the full foundation models with their built-in decoders, suggesting that the learned representations provide limited additional value for this task [15].

Experimental Protocols and Evaluation Methodologies

Zero-Shot Evaluation Framework

The critical evaluation of scFMs requires rigorous experimental protocols that assess their performance without fine-tuning, as this most accurately reflects many real-world discovery scenarios. The standard zero-shot evaluation protocol involves:

Embedding Extraction: Precomputed cell embeddings are generated from the frozen pretrained models without any parameter updates or fine-tuning [1]. This assesses the intrinsic quality of representations learned during pretraining.
Task-Specific Evaluation: The embeddings are used directly for downstream tasks including:
- Cell-type clustering: Evaluating separation of known cell types using metrics like Average BIO score and Average Silhouette Width (ASW) [1].
- Batch integration: Assessing removal of technical artifacts while preserving biological variation using metrics like iLISI and cLISI [1].
- Perturbation prediction: Measuring ability to predict transcriptomic changes after genetic perturbations using L2 distance between predicted and observed expression [15].
Baseline Comparison: Performance is compared against established methods including HVG selection, Harmony, and scVI, as well as simple mathematical baselines for perturbation prediction [1] [15].

This evaluation strategy has proven essential for revealing limitations not apparent in fine-tuned scenarios, providing a more realistic assessment of model capabilities for exploratory research where labels are unavailable [1].

Benchmarking Datasets and Quality Control

Robust benchmarking requires diverse, high-quality datasets with validated annotations. Key datasets used in scFM evaluation include:

Pancreas Dataset: Combines data from five different sources with known batch effects [1].
Tabula Sapiens: A comprehensive reference atlas with carefully annotated cell types across tissues [1].
Immune Datasets: Including PBMC (Peripheral Blood Mononuclear Cell) datasets that capture immune cell diversity [1].
Perturbation Datasets: CRISPR-based perturbation data from Norman et al. and Replogle et al. for evaluating perturbation prediction [15].

To mitigate the risk of data leakage and ensure fair evaluation, independent unbiased datasets like the Asian Immune Diversity Atlas (AIDA) v2 from CellxGene are increasingly used [3]. Additionally, the Pretraining Data Influence analysis examines performance on datasets that were either included or excluded from model pretraining to assess generalization versus memorization [1].

Diagram 1: ScFM evaluation workflow showing the performance gap between intended and actual use.

Architectural Foundations and Technical Implementation

Model Architectures and Pretraining Strategies

Current scFMs predominantly leverage transformer architectures, but with significant modifications to accommodate the unique characteristics of single-cell data:

Model	Architecture Type	Pretraining Data Scale	Tokenization Strategy	Positional Encoding
Geneformer	Encoder	30 million cells	Rank-based gene ordering	Yes
scGPT	Decoder	33 million cells	Value binning + HVGs	No
UCE	Encoder	36 million cells	Genomic position ordering	Yes
scFoundation	Encoder-Decoder	50 million cells	Full gene set	No

Table 4: Architectural variations across prominent single-cell foundation models [3].

Unlike natural language, gene expression data lacks inherent sequential ordering, necessitating specialized tokenization approaches. Common strategies include ranking genes by expression levels within each cell, binning expression values, or using normalized counts directly [8]. These approaches represent fundamental engineering decisions that significantly impact model performance and biological relevance.

Attention Mechanisms and Biological Interpretation

The attention mechanisms in transformer architectures theoretically enable scFMs to learn gene-gene interactions and regulatory relationships. However, interpreting these attention weights as direct biological pathways remains challenging [8]. The discrepancy between theoretical capability and practical interpretability represents a significant hurdle in bridging the intended-actual use gap.

Specialized evaluation metrics like scGraph-OntoRWR have been developed to quantitatively assess whether the relational structure of cell types captured by scFM embeddings aligns with established biological knowledge from cell ontologies [3]. Similarly, the Lowest Common Ancestor Distance (LCAD) metric measures the ontological proximity between misclassified cell types, providing a biologically grounded assessment of error severity [3].

Diagram 2: Architectural components and information flow in single-cell foundation models.

Resource Category	Specific Examples	Function/Role in scFM Research
Data Repositories	CELLxGENE, Human Cell Atlas, PanglaoDB, GEO/SRA	Provide standardized, annotated single-cell datasets for model pretraining and evaluation [8]
Benchmarking Frameworks	PertEval-scFM, scGraph-OntoRWR, LCAD metrics	Enable standardized evaluation of model performance across diverse biological tasks [3] [16]
Baseline Methods	HVG selection, Harmony, scVI, additive perturbation models	Provide critical performance comparisons to assess actual value of complex scFMs [1] [15]
Biological Ontologies	Cell Ontology, Gene Ontology, Protein-protein interaction networks	Offer prior biological knowledge for designing biologically meaningful evaluation metrics [3]
Computational Infrastructure	GPU clusters (A100), high-performance computing environments	Enable training and evaluation of large-scale models with millions of parameters and training examples [3]

Table 5: Essential research resources for scFM development and evaluation.

The comprehensive benchmarking of single-cell foundation models reveals a significant gap between their intended capabilities during pretraining and their actual performance on real-world biological tasks. While scFMs represent a theoretically promising approach for learning universal biological representations, their zero-shot performance frequently fails to surpass simpler, established methods across critical tasks including cell-type annotation, batch integration, and perturbation prediction [1] [15].

This performance gap underscores several critical considerations for researchers and drug development professionals:

Task-Specific Model Selection: No single scFM consistently outperforms others across all tasks, emphasizing the need for tailored model selection based on dataset size, task complexity, and available computational resources [3].
Practical Recommendations: For standard analyses like cell-type clustering and batch correction, established methods like Harmony and scVI currently provide more reliable performance. For perturbation prediction, simple linear baselines remain surprisingly competitive [1] [15].
Future Development Directions: Bridging the intention-reality gap will require innovations in pretraining objectives that better align with biological reasoning, improved evaluation metrics that capture biological relevance, and architectural refinements that more effectively model gene regulatory networks [3] [8].

As the field matures, rigorous zero-shot evaluation must become standard practice to accurately assess the true capabilities and limitations of these powerful models. By maintaining a critical perspective and grounding expectations in empirical evidence, the research community can progressively narrow the gap between the intended and actual utility of single-cell foundation models in biological discovery and therapeutic development.

Single-cell foundation models (scFMs) are large-scale deep learning models, typically based on transformer architectures, pretrained on millions of single-cell transcriptomes to learn the fundamental 'language' of cells [9] [8]. A core promise of these models is their potential for zero-shot learning—the ability to extract biologically meaningful insights from new data using only their pretrained representations, without any task-specific fine-tuning [3]. This capability is crucial for applications where labeled data is scarce, such as in the study of rare cell types or early-stage drug discovery. However, as the field rapidly advances, a critical question remains: do the latent embeddings generated by scFMs in a zero-shot setting truly capture meaningful biology, or are they merely sophisticated technical artifacts? This guide objectively compares the zero-shot performance of leading scFMs against traditional methods and simpler baselines, synthesizing evidence from recent comprehensive benchmarks to address this pivotal question.

Quantitative Benchmarking of Zero-Shot Performance

Independent benchmarking studies have systematically evaluated the zero-shot embeddings of scFMs across a range of biologically relevant tasks. The table below summarizes the performance of several prominent models against established baseline methods.

Table 1: Performance of scFMs and Baselines on Cell-Level Tasks (Summarized from [3])

Model / Baseline	Batch Integration	Cell Type Annotation	Cancer Cell Identification	Drug Sensitivity Prediction
scGPT	Robust	High Accuracy	Strong	Strong
Geneformer	Moderate	Moderate	Variable	Moderate
scFoundation	Moderate	Moderate	Strong	Strong (Gene-level)
UCE	Variable	Variable	Not Top Performer	Not Top Performer
LangCell	Variable	Variable	Not Top Performer	Not Top Performer
scCello	Variable	Variable	Not Top Performer	Not Top Performer
Seurat (Baseline)	Effective	N/A	N/A	N/A
Harmony (Baseline)	Effective	N/A	N/A	N/A
scVI (Baseline)	Effective	N/A	N/A	N/A

A key finding across benchmarks is that no single scFM consistently outperforms all others across every task [3]. Model performance is highly dependent on the specific application, dataset size, and biological context. While top-performing scFMs like scGPT demonstrate robust and versatile capabilities, simpler machine learning models can be more efficient and equally effective for specific datasets, particularly under computational constraints [3].

Performance on Gene-Level and Perturbation Tasks

Benchmarks evaluating the prediction of gene-gene relationships and perturbation responses reveal critical limitations of current scFMs.

Table 2: Performance on Gene-Level and In Silico Perturbation Tasks

Task	Top Performing Models	Key Finding	Study
Double Perturbation Effect Prediction	Simple 'Additive' Baseline (Sum of LFCs)	No deep learning model outperformed the simple additive baseline.	[5]
Genetic Interaction Prediction	Simple 'No Change' Baseline	No model improved upon the 'no change' baseline for predicting genetic interactions.	[5]
Unseen Single Perturbation Prediction	Linear Model with Pretrained Embeddings	A simple linear model using embeddings from perturbation data outperformed foundation models.	[5]
Gene Function & Relationship Analysis	Geneformer, scFoundation	Effective for gene-level tasks, benefiting from pretraining.	[3] [6]

Notably, a landmark study in Nature Methods concluded that for predicting transcriptome changes after genetic perturbations, "current foundation models did not perform better than deliberately simplistic linear prediction models," despite requiring significantly more computational resources for fine-tuning [5].

Evaluating Biological Relevance: Methodologies and Metrics

To determine if scFMs capture meaningful biology, researchers have moved beyond simple accuracy metrics to develop novel evaluation protocols that directly probe the biological coherence of model embeddings.

Core Experimental Protocols for Zero-Shot Evaluation

The following workflow outlines a standardized protocol for evaluating the biological relevance of zero-shot scFM embeddings, as employed in recent benchmarks [3].

Workflow for Zero-Shot Biological Evaluation

Feature Extraction: The scFM processes a held-out single-cell dataset to generate latent embeddings for each cell and/or gene without any fine-tuning. The model's internal knowledge is thus probed in a zero-shot manner [3].
Application to Downstream Tasks: These frozen embeddings are directly used for various cell-level and gene-level tasks, such as:
- Cell Type Annotation: Clustering the embeddings and assessing if known cell types group together.
- Batch Integration: Evaluating how well the embeddings mix cells from different technical batches while preserving biological separation.
- Clinically Relevant Predictions: Testing performance on tasks like cancer cell identification or drug sensitivity prediction [3].
Quantitative Evaluation with Novel Metrics:
- scGraph-OntoRWR: This metric measures the consistency between the relational structure of cell types captured by the scFM embeddings and the known relationships in established biological ontologies (like the Cell Ontology). A high score indicates the model has learned biologically meaningful relationships between cell types [3].
- Lowest Common Ancestor Distance (LCAD): For cell type annotation errors, this metric assesses the severity of the error by measuring the ontological proximity between the misclassified cell type and the correct one. A smaller LCAD indicates a less severe, more biologically plausible error [3].
- Roughness Index (ROGI): This index quantifies the smoothness of the cell-property landscape in the latent space. A smoother landscape (lower roughness) suggests that the embeddings capture continuous biological transitions, making it easier for downstream models to learn and generalize [3].

Key Findings from Biological Evaluation

The application of these sophisticated metrics has yielded critical insights:

Captured Biological Relationships: Zero-shot scFM embeddings demonstrably capture biologically meaningful insights into the relational structure of genes and cells. The scGraph-OntoRWR metric confirms that the relationships between cell types in the embedding space align with prior biological knowledge [3].
Smoother Latent Landscapes: The performance improvement of scFMs in downstream tasks is linked to the creation of a smoother latent landscape. The ROGI metric verifies that this reduced complexity makes it less difficult for task-specific models to learn effectively [3].
Limitations in Complex Prediction: Despite capturing biological relationships, scFMs struggle with the complex, nonlinear task of predicting genetic perturbation effects, often being outperformed by simple additive models of logarithmic fold changes [5].

The Scientist's Toolkit: Essential Research Reagents

The following table details key resources and computational tools essential for conducting rigorous evaluations of scFM biological relevance.

Table 3: Essential Resources for scFM Evaluation

Resource / Reagent	Function in Evaluation	Specific Examples / Notes
Annotated Single-Cell Atlases	Provide high-quality, biologically diverse benchmark datasets with ground-truth labels.	CZ CELLxGENE Census [9] [7], Human Cell Atlas [9], Asian Immune Diversity Atlas (AIDA) v2 [3]
Perturbation Datasets	Enable benchmarking of in silico perturbation predictions against experimental data.	CRISPRa/i screens (e.g., from Replogle et al.) [5], Perturb-seq data [17]
Standardized Benchmarking Frameworks	Offer unified APIs and protocols for fair model comparison, mitigating effects of coding heterogeneity.	BioLLM framework [6]
Biological Ontologies	Provide the formal, hierarchical knowledge of gene/cell relationships required for novel metrics.	Cell Ontology, Gene Ontology [3] [5]
Visualization & Analysis Suites	Allow researchers to interactively explore embeddings and model predictions.	CELLxGENE Explorer [7], Integrated chat-based tools like CellWhisperer [7]

The evidence from comprehensive benchmarks indicates that the answer to the critical question is nuanced. Yes, zero-shot scFMs do capture meaningful biology, as evidenced by their ability to encode biologically consistent cell-type relationships and create latent spaces that facilitate various downstream tasks [3]. However, this capability is not universal nor superior in all contexts. Specifically, their performance in predicting genetic perturbation effects is currently lagging and can be matched or exceeded by simpler, non-foundation model approaches [5].

The field is moving toward more biologically grounded evaluation. The introduction of ontology-informed metrics like scGraph-OntoRWR and LCAD represents a significant advance beyond purely technical benchmarks. For researchers and drug development professionals, selecting an scFM requires careful consideration of the specific biological question, dataset size, and available resources. Tools like the BioLLM framework and the ROGI index can provide practical guidance for this model selection process [3] [6]. Future progress will likely depend on richer pretraining data that incorporates multimodal information and the development of more sophisticated model architectures designed to better capture causal biological mechanisms.

A Practical Framework for Applying Zero-Shot scFM Embeddings

In single-cell genomics, foundation models (scFMs) are trained on millions of cells to learn universal biological principles, treating individual cells as sentences and genes or genomic features as words or tokens [9] [8]. The latent representations, or embeddings, generated by these models serve as the foundational layer for a wide range of downstream analytical tasks. The evaluation of these zero-shot embeddings—those derived directly from pretrained models without task-specific fine-tuning—is crucial for assessing the intrinsic biological knowledge a model has captured. This guide provides a systematic framework for extracting and benchmarking embeddings from prominent single-cell foundation models, enabling researchers to quantitatively evaluate embedding quality within the broader context of zero-shot scFM research.

The field of single-cell foundation models has seen the rapid development of several architecturally distinct models. The table below summarizes the key characteristics of six prominent scFMs, which represent the current state-of-the-art and are the primary subjects of this embedding extraction guide.

Table 1: Key Single-Cell Foundation Models for Embedding Extraction

Model Name	Omics Modalities	Model Parameters	Pretraining Dataset Scale	# Input Genes	# Output Dimension	Primary Architecture
Geneformer [18] [3]	scRNA-seq	40 M	30 M cells	2048 (ranked)	256 / 512	Transformer Encoder
scGPT [18] [3]	scRNA-seq, scATAC-seq, CITE-seq, Spatial	50 M	33 M cells	1200 HVGs	512	Transformer Encoder with attention mask
UCE [18] [3]	scRNA-seq	650 M	36 M cells	1024 (sampled & ordered)	1280	Transformer Encoder
scFoundation [18] [3]	scRNA-seq	100 M	50 M cells	~19,000	3072	Asymmetric Encoder-Decoder
LangCell [3]	scRNA-seq	40 M	27.5 M scRNA-text pairs	2048 (ranked)	256	Transformer Encoder
scCello [18]	scRNA-seq	Information Missing	Information Missing	Information Missing	Information Missing	Information Missing

Core Architectural and Tokenization Concepts

A fundamental challenge in applying transformer architectures to single-cell data is the non-sequential nature of gene expression data. Unlike words in a sentence, genes in a cell have no inherent ordering [9] [8]. To overcome this, scFMs employ various tokenization strategies that define how a cell's genetic information is converted into a sequence of model inputs:

Ranking by Expression: A common strategy is to rank genes within each cell by their expression levels and feed the ordered list of top genes as the 'sentence' (e.g., Geneformer, LangCell) [9] [8].
Value Binning/Projection: Other models, like scGPT, partition gene expression values into bins or use projections to create the input tokens [3].
Genomic Position: The UCE model uniquely orders sampled genes by their genomic positions [3].

The input layers of these scFMs typically combine gene embeddings (analogous to word embeddings), value embeddings (for expression levels), and sometimes positional embeddings to represent the imposed gene order [18] [3].

Quantitative Benchmarking of Zero-Shot Embedding Performance

A comprehensive benchmark evaluating the zero-shot embedding performance of the six scFMs against established baseline methods provides critical quantitative data for model selection [18] [3]. The evaluation encompassed two gene-level and four cell-level tasks, assessed using 12 metrics.

Performance on Gene-Level and Cell-Level Tasks

Table 2: Benchmarking Results of scFM Zero-Shot Embeddings Across Key Tasks

Model	Gene Function Prediction (GO Terms)	Tissue Specificity Prediction	Batch Integration (Pre-clinical)	Cell Type Annotation	Cancer Cell Identification	Drug Sensitivity Prediction
Geneformer	Moderate	Moderate	Good	Good	Information Missing	Information Missing
scGPT	Good	Good	Good	Good	Information Missing	Information Missing
UCE	Information Missing	Information Missing	Information Missing	Information Missing	Information Missing	Information Missing
scFoundation	Information Missing	Information Missing	Information Missing	Information Missing	Information Missing	Information Missing
LangCell	Information Missing	Information Missing	Information Missing	Information Missing	Information Missing	Information Missing
scCello	Information Missing	Information Missing	Information Missing	Information Missing	Information Missing	Information Missing
Traditional Baseline (e.g., scVI, Seurat)	Simpler models can be more efficient and adaptive for specific, resource-constrained tasks.

Key Benchmarking Findings: The study revealed that no single scFM consistently outperformed all others across every task, emphasizing that model selection must be tailored to the specific application [18] [3]. Pretrained zero-shot scFM embeddings were confirmed to capture meaningful biological insights into the relational structure of genes and cells, which benefits downstream tasks. The performance improvement was linked to a smoother cell-property landscape in the pretrained latent space, which reduces the difficulty of training task-specific models [18] [3].

Novel Biological Evaluation Metrics

To move beyond technical metrics, the benchmark introduced novel cell ontology-informed metrics to provide a biologically grounded perspective on embedding quality [18] [3]:

scGraph-OntoRWR: Measures the consistency of cell type relationships captured by scFM embeddings with prior biological knowledge encoded in cell ontologies.
Lowest Common Ancestor Distance (LCAD): Assesses the severity of errors in cell type annotation by measuring the ontological proximity between misclassified cell types and their true labels. A smaller LCAD indicates a less severe error (e.g., confusing two types of T cells vs. confusing a T cell with a neuron).

Step-by-Step Embedding Extraction Protocols

This section details the practical methodology for extracting zero-shot cell and gene embeddings from the featured scFMs, following a standardized workflow.

Diagram 1: Workflow for Extracting Embeddings from scFMs

Data Preprocessing and Tokenization

The first step involves preparing your raw single-cell RNA sequencing count matrix for model input. This typically includes normalization and log-transformation to make the data suitable for the model. Following preprocessing, the critical step of tokenization begins, where models differ significantly in their approach [18] [3]:

For Geneformer/LangCell: For each cell, genes are ranked by their expression value. The top 2048 genes (or another specified number) are selected and input to the model in this rank order. The expression value is often incorporated via an ordering mechanism or a separate value embedding [18] [3] [9].
For scGPT: The 1200 most highly variable genes (HVGs) across the dataset are selected. The expression values for these genes are then often transformed using a binning strategy before being fed into the model [3].
For UCE: A unique two-step process is used: 1) 1024 non-unique genes are sampled with probability proportional to their expression, and 2) these genes are then ordered by their genomic positions to create the input sequence [3].
For scFoundation: The model is designed to accept a comprehensive input of all ~19,000 human protein-encoding genes, using a value projection to handle the expression levels [3].

Model Inference and Embedding Extraction

After tokenization, a zero-shot forward pass is performed through the pretrained model. The extraction point for the embeddings varies by model architecture and intended use:

Cell Embedding Extraction: Most models produce a dedicated embedding vector for the entire cell. This is often located at a special [CLS]-type token prepended to the gene sequence, or is derived from the pooled (e.g., mean) output of all gene token embeddings for that cell [18] [9] [8].
Gene Embedding Extraction: Gene embeddings can typically be extracted from the input embedding layer of the model or from the output of the first transformer layer. These embeddings represent the model's learned representation of individual genes, contextualized by the massive and diverse pretraining data [18].

The Researcher's Toolkit for scFM Embedding Analysis

Table 3: Essential Research Reagents and Computational Tools for scFM Embedding Evaluation

Tool / Resource	Type	Primary Function in Evaluation	Key Consideration
Pretrained scFM Weights	Software	Provides the core model for generating zero-shot embeddings.	Ensure model compatibility with your organism (e.g., human/mouse) and omics type.
CZ CELLxGENE / Cell Atlas [9] [8]	Data Resource	Provides standardized, high-quality datasets for benchmarking and external validation.	Crucial for mitigating data leakage risks and testing generalizability.
scGraph-OntoRWR Metric [18] [3]	Evaluation Metric	Quantifies biological consistency of embeddings using cell ontology.	Requires a well-structured reference ontology (e.g., Cell Ontology).
Lowest Common Ancestor Distance (LCAD) [18] [3]	Evaluation Metric	Measures semantic severity of cell type misannotation errors.	Provides a more biologically meaningful error analysis than simple accuracy.
Roughness Index (ROGI) [18]	Evaluation Metric	Acts as a proxy for downstream task performance by measuring landscape smoothness.	Can help with dataset-specific model selection without running full benchmarks.

The extraction and evaluation of embeddings are fundamental to understanding and leveraging the power of single-cell foundation models. This guide has outlined the methodologies for obtaining these embeddings from key scFMs and has presented a framework for their rigorous, biologically grounded assessment. The benchmark data confirms that while scFMs are robust and versatile tools that capture profound biological insights, the choice of model is context-dependent. Researchers are encouraged to use the provided protocols and metrics—particularly the novel ontology-based measures—to guide their model selection based on specific task requirements, dataset size, and the critical need for biological interpretability. As the field progresses, future developments will likely focus on standardizing these evaluation practices and improving the interpretability of the latent spaces these powerful models create.

Single-cell foundation models (scFMs) represent a paradigm shift in the analysis of single-cell RNA sequencing (scRNA-seq) data. Trained on millions of single-cell transcriptomes, these models promise to learn universal biological principles that can be applied to diverse downstream tasks without task-specific training, an capability known as zero-shot learning [8]. For researchers and drug development professionals, the potential to accurately identify cell types and states in new datasets without manual annotation or fine-tuning could dramatically accelerate discoveries in cellular biology and therapeutic development. This guide provides an objective comparison of current scFMs, focusing specifically on their zero-shot performance for the core tasks of cell type clustering and annotation, synthesizing evidence from recent rigorous benchmarking studies to inform model selection and application.

Performance Comparison of Single-Cell Foundation Models

Recent comprehensive benchmarks reveal a nuanced landscape where no single scFM consistently outperforms all others across every task and dataset. Performance is highly dependent on factors such as dataset size, biological complexity, and the specific evaluation metric employed [3].

Table 1: Zero-Shot Performance Comparison of scFMs and Baselines on Core Tasks

Model / Baseline	Cell Type Clustering (AvgBIO Score)	Batch Integration (iLISI Score)	Computational Intensity	Key Strengths
scGPT	Variable (0.4-0.7)	Moderate	High	Handles technical & biological batch effects [1]
Geneformer	Below baseline (0.3-0.5)	Poor	Medium	Gene network analysis [3] [1]
scFoundation	Not consistently superior	Not consistently superior	Very High	Large model capacity [3]
LangCell	Moderate with refinement	Information Limited	Medium	Zero-shot annotation via text alignment [19]
HVG (Baseline)	Good (0.5-0.8)	Good	Low	Simple, efficient, strong performance [1]
scVI (Baseline)	Good (0.5-0.8)	Good	Medium	Reliable batch correction [1]
Harmony (Baseline)	Good (0.5-0.8)	Moderate	Low	Fast integration [1]

A holistic ranking from a major 2025 benchmark study that evaluated six scFMs against established baselines using 12 different metrics concluded that while scFMs are robust and versatile tools, simpler machine learning models often adapt more efficiently to specific datasets, particularly under resource constraints [3]. Notably, the study found that no single scFM consistently outperformed others across all tasks, emphasizing the need for tailored model selection based on dataset size, task complexity, and available computational resources [3].

Zero-Shot Limitations in Real-World Applications

Despite their theoretical promise, scFMs face significant challenges in zero-shot settings. A 2025 evaluation of scGPT and Geneformer revealed that both models "underperform simpler methods" in zero-shot cell type clustering and batch integration, with Geneformer particularly struggling to retain cell type information while correcting for batch effects [1].

In perturbation prediction, a critical task for drug discovery, the performance gap is even more pronounced. A recent benchmark (PertEval-scFM) found that "scFM embeddings offer limited improvement over simple baseline models in the zero-shot setting, particularly under distribution shift" [4]. Similarly, a Nature Methods study concluded that for predicting genetic perturbation effects, "none [of the foundation models] outperformed the baselines," which included deliberately simple approaches like an additive model of single perturbation effects [5].

Experimental Protocols for Benchmarking

Standardized Evaluation Frameworks

Rigorous benchmarking of scFMs requires standardized protocols that reflect biologically meaningful scenarios. The most comprehensive benchmarks evaluate models across multiple task categories using high-quality datasets with minimal data leakage [3] [1].

Table 2: Key Experimental Protocols in scFM Benchmarking

Protocol Component	Description	Purpose
Dataset Curation	Use of diverse, high-quality datasets (e.g., from CELLxGENE) with careful handling of duplicate cells to prevent data leakage [20].	Ensures fair evaluation and generalizable results.
Task Selection	Evaluation across gene-level and cell-level tasks, including clustering, annotation, and batch integration [3].	Tests model versatility and biological relevance.
Metric Selection	Combination of unsupervised (e.g., ASW), supervised (e.g., accuracy), and knowledge-based metrics (e.g., scGraph-OntoRWR) [3].	Provides comprehensive performance assessment.
Baseline Comparison	Inclusion of established methods like HVG selection, Harmony, and scVI [1].	Contextualizes scFM performance against standard approaches.
Zero-Shot Protocol	Direct use of pretrained model embeddings without task-specific fine-tuning [1].	Tests true generalization capability.

The benchmarking workflow typically begins with dataset preparation, followed by feature extraction using zero-shot scFM embeddings, and evaluation across multiple downstream tasks using biologically relevant metrics [3]. This pipeline ensures that models are tested under realistic conditions that mirror actual research applications.

Novel Evaluation Metrics

Beyond traditional metrics, researchers have developed novel evaluation approaches that better capture biological meaningfulness. The scGraph-OntoRWR metric measures the consistency of cell type relationships captured by scFMs with prior biological knowledge from cell ontologies [3]. Similarly, the Lowest Common Ancestor Distance (LCAD) metric assesses the ontological proximity between misclassified cell types, providing a more nuanced view of annotation errors than simple accuracy [3].

The Roughness Index (ROGI) serves as a proxy for model selection by quantitatively estimating how model performance correlates with cell-property landscape roughness in the pretrained latent space, verifying that performance improvement arises from a smoother landscape that reduces the difficulty of training task-specific models [3].

Benchmarking Workflow Visualization

The following diagram illustrates the standardized workflow for benchmarking single-cell foundation models in zero-shot mode, as implemented in recent comprehensive studies:

Zero-shot scFM Benchmarking Workflow

This workflow illustrates the comprehensive evaluation pipeline used in recent benchmarks, beginning with raw data processing and culminating in model ranking based on multiple performance metrics [3] [1].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Essential Research Reagents for scFM Benchmarking

Resource	Type	Function	Access
CELLxGENE Census	Data Repository	Provides standardized, curated single-cell datasets for training and evaluation [20].	Public
scGPT	Foundation Model	Transformer-based model for single-cell analysis; can be used for clustering and perturbation prediction [1] [8].	Open-source
Geneformer	Foundation Model	Transformer model trained on gene ranks for network analysis and cell state characterization [1] [8].	Open-source
LangCell	Foundation Model	CLIP-style model aligning scRNA-seq profiles with text for zero-shot annotation [19].	Open-source
Harmony	Integration Algorithm	Fast integration method for batch correction; commonly used as baseline [1].	Open-source
scVI	Probabilistic Model	Deep generative model for scRNA-seq data; standard baseline for integration and imputation [1].	Open-source
PertEval-scFM	Benchmark Framework	Standardized framework for evaluating perturbation effect prediction [4].	Open-source

Implementation Considerations

When implementing these tools, researchers should consider that even the most advanced scFMs may not outperform simpler baselines for specific tasks. For cell type clustering, starting with HVG selection followed by standard integration methods like Harmony or scVI often provides strong performance with lower computational cost [1]. For annotation tasks, LangCell's zero-shot capabilities can be enhanced with graph-based refinement methods like GRIT, which improves accuracy by enforcing local consistency over PCA-based k-NN graphs [19].

For perturbation prediction, where current scFMs show significant limitations, simple linear models or even the "no change" baseline may outperform complex foundation models, highlighting the importance of always including appropriate baselines in evaluations [5].

Current single-cell foundation models represent promising but imperfect tools for zero-shot cell type clustering and annotation. While they offer the potential for generalizable biological insights and reduced reliance on task-specific training, their performance does not consistently surpass simpler, established methods across critical benchmarking tasks. Researchers should carefully consider their specific analytical needs, dataset characteristics, and computational resources when selecting approaches for single-cell analysis. The rapid evolution of this field suggests that future model iterations may overcome current limitations, but rigorous benchmarking remains essential for tracking progress and guiding methodological development.

Batch effects represent one of the most significant technical challenges in single-cell RNA sequencing (scRNA-seq) analysis, introducing systematic variations that can obscure biological signals and lead to misleading scientific conclusions [21]. These technical variations arise from multiple sources including differences in experimental conditions, sequencing protocols, reagent batches, and laboratory personnel [22] [21]. In large-scale omics studies, batch effects can profoundly impact data quality, potentially leading to irreproducible findings and even incorrect clinical interpretations [21]. The pressing need to address these challenges has catalyzed the development of numerous computational integration methods, each employing distinct strategies to remove technical artifacts while preserving biological variability.

Single-cell foundation models (scFMs) have emerged as powerful tools leveraging massive datasets and self-supervised learning to capture universal biological knowledge [3]. These models, including Geneformer, scGPT, UCE, scFoundation, LangCell, and scCello, utilize transformer architectures adapted to scRNA-seq data characteristics [3]. The "pre-train then fine-tune" paradigm promises efficient adaptation to various downstream tasks, including batch integration [3]. However, recent benchmarking studies reveal that scFMs do not consistently outperform traditional methods across all scenarios, raising questions about their practical utility for batch integration tasks [3] [16] [4].

This assessment provides a comprehensive evaluation of batch integration capabilities across method classes, focusing on performance metrics, methodological approaches, and practical considerations for researchers. By synthesizing evidence from recent benchmarks and methodological advances, we aim to guide scientists in selecting appropriate integration strategies for their specific research contexts, particularly within drug development and clinical applications where batch effects can significantly impact conclusions.

Methodological Landscape of Batch Integration

Classification of Integration Approaches

Batch integration methods for scRNA-seq data can be categorized into four distinct classes based on their underlying algorithms and correction strategies [23]. Global models, originating from bulk transcriptomics, treat batch effects as consistent additive and/or multiplicative artifacts across all cells. Methods like ComBat implement this approach, applying uniform corrections across entire datasets [23]. Linear embedding models, including pioneering methods like Mutual Nearest Neighbors (MNN), Seurat integration, Scanorama, FastMNN, and Harmony, utilize dimensionality reduction followed by local alignment of similar cells across batches [23]. These methods typically employ singular value decomposition variants and represent the most common integration approach.

Graph-based methods such as BBKNN (Batch-Balanced k-Nearest Neighbor) construct nearest-neighbor graphs within batches then enforce connections between batches while allowing compositional differences through edge pruning [23]. These approaches prioritize computational efficiency. Deep learning approaches, the most recent category, utilize autoencoder architectures including conditional variational autoencoders (CVAE) to learn batch-invariant representations. Prominent examples include scVI, scANVI, scGen, and DeepBID, which often require substantial data but demonstrate strong performance on complex integration tasks [24] [23].

Evolution of Single-Cell Foundation Models

scFMs represent a paradigm shift in single-cell analysis, adapting transformer architectures originally developed for natural language processing to scRNA-seq data [3]. These models employ different strategies for handling gene tokens, expression values, and positional information. For instance, Geneformer uses a ranked-gene approach with positional embeddings, while scGPT employs value binning without positional information [3]. UCE uniquely incorporates protein embeddings from ESM-2, connecting transcriptomic and proteomic information [3].

The fundamental innovation of scFMs lies in their pretraining on massive datasets (millions of cells) using self-supervised objectives like masked gene modeling, enabling them to learn universal biological principles [3]. This pretraining theoretically equips them with emergent capabilities for zero-shot learning and efficient adaptation to downstream tasks, including batch integration. However, recent benchmarks indicate that their practical performance for batch integration does not consistently exceed traditional methods, particularly in zero-shot settings [3] [16].

Figure 1: Batch effects originate from multiple experimental stages and are addressed by different computational approaches.

Experimental Frameworks for Benchmarking

Evaluation Metrics and Protocols

Rigorous benchmarking of batch integration methods requires comprehensive evaluation protocols and specialized metrics. Recent benchmarks employ 12+ metrics spanning unsupervised, supervised, and knowledge-based approaches to assess different aspects of integration quality [3]. The average silhouette width (ASW) quantifies separation quality based on both biological labels and batch origins, with values closer to 1 indicating better separation of biological groups and mixing of batches [25]. The k-nearest-neighbor Batch-Effect Test (kBET) evaluates batch mixing by testing whether local neighborhoods contain cells from all batches in expected proportions [23].

Novel ontology-informed metrics provide biological relevance assessments. The scGraph-OntoRWR measures consistency between cell type relationships captured by scFMs and established biological knowledge in cell ontologies [3]. The Lowest Common Ancestor Distance (LCAD) evaluates the severity of cell type misclassification by measuring ontological proximity between predicted and actual cell types [3]. The roughness index (ROGI) serves as a proxy for dataset-specific model performance, quantifying the smoothness of cell-property landscapes in latent spaces [3].

Benchmarking studies typically employ two integration scenarios with varying complexity [23]. Batch correction addresses simpler cases where batches share similar cell type compositions and effects are quasi-linear. Data integration tackles more challenging scenarios with nested batch effects, different protocols, and potentially non-overlapping cell identities across datasets [23].

Benchmarking Datasets and Experimental Design

Comprehensive benchmarks utilize diverse datasets representing various biological contexts and technical challenges. The benchmarking study by [3] employs five datasets with diverse biological conditions for preclinical batch integration and cell type annotation, plus seven cancer types and four drugs for clinically relevant tasks like cancer cell identification and drug sensitivity prediction. To mitigate data leakage concerns, independent datasets like the Asian Immune Diversity Atlas (AIDA) v2 from CellxGene are incorporated [3] [7].

Critical benchmarking tasks include cross-tissue integration assessing models' ability to distinguish biologically distinct cell types while removing technical variations, and intra-tumor heterogeneity resolution evaluating preservation of subtle biological variations within complex populations [3]. The PertEval-scFM framework specifically tests perturbation effect prediction, measuring models' ability to predict transcriptional responses to genetic perturbations in zero-shot settings [16] [4].

Figure 2: Comprehensive benchmarking evaluates methods across multiple tasks using specialized metrics.

Performance Comparison Across Method Classes

Quantitative Benchmark Results

Table 1: Performance comparison of batch integration methods across key metrics

Method Class	Representative Methods	ASW Biological (↑)	ASW Batch (↓)	Runtime Efficiency	Data Retention	Complex Task Performance
Global Models	ComBat, limma	0.45-0.65	0.15-0.35	High	Moderate	Poor
Linear Embedding	Harmony, Seurat, Scanorama	0.55-0.75	0.08-0.25	Moderate	High	Moderate
Graph-Based	BBKNN	0.50-0.70	0.10-0.20	Very High	High	Moderate
Deep Learning	scVI, scANVI, DeepBID	0.65-0.85	0.05-0.15	Low	High	Excellent
Foundation Models	scGPT, Geneformer	0.40-0.70	0.10-0.30	Very Low	High	Variable

Recent large-scale benchmarking reveals distinct performance patterns across method classes. Deep learning approaches, particularly scVI and scANVI, demonstrate superior performance on complex integration tasks, achieving the highest biological preservation (ASW biological: 0.65-0.85) and effective batch mixing (ASW batch: 0.05-0.15) [23]. Linear embedding methods like Harmony and Seurat show strong performance on less complex tasks with moderate computational requirements [23]. The recently introduced DeepBID method demonstrates exceptional capabilities by simultaneously performing batch correction, dimensionality reduction, and cell clustering through a negative binomial-based autoencoder with dual Kullback-Leibler divergence losses [24].

Batch-Effect Reduction Trees (BERT), a novel high-performance method for integrating incomplete omic profiles, demonstrates significant advantages in data retention and computational efficiency compared to existing approaches [25]. BERT retains up to five orders of magnitude more numeric values than HarmonizR (the only previous method handling arbitrarily incomplete omic data) and achieves 11× runtime improvement by leveraging multi-core and distributed-memory systems [25].

Single-Cell Foundation Models Performance

scFMs show variable performance in batch integration tasks. In comprehensive benchmarks evaluating six scFMs against established baselines, no single foundation model consistently outperformed others across all tasks [3]. While scFMs demonstrate robustness and versatility, simpler machine learning models often adapt more efficiently to specific datasets, particularly under resource constraints [3].

The BioLLM framework provides standardized evaluation of scFMs, revealing distinct architectural strengths [6]. scGPT demonstrates robust performance across diverse tasks including zero-shot and fine-tuning scenarios, while Geneformer and scFoundation excel in gene-level tasks due to effective pretraining strategies [6]. However, in perturbation prediction tasks assessed by PertEval-scFM, zero-shot scFM embeddings show limited improvement over simple baseline models, particularly under distribution shift [16] [4].

Table 2: Single-cell foundation model capabilities and limitations

Model	Pretraining Data	Architecture	Strength Areas	Batch Integration Performance
Geneformer	30M cells	Transformer with ranked genes	Gene-level tasks, network inference	Moderate, requires fine-tuning
scGPT	33M cells	Transformer with value binning	Multi-task, cross-modal integration	Strong with fine-tuning
scFoundation	50M cells	Asymmetric encoder-decoder	Scalability, gene expression modeling	Moderate in zero-shot
UCE	36M cells	Protein-informed transformer	Cross-omics integration	Variable across tasks
CellWhisperer	1M+ transcriptomes	Multimodal contrastive learning	Natural language query, annotation	Not primarily designed for integration

Practical Applications and Case Studies

Biological Discovery Applications

Effective batch integration enables significant biological discoveries by allowing robust analysis of combined datasets. In Alzheimer's disease research, DeepBID demonstrated remarkable capabilities by improving cell clustering in integrated scRNA-seq datasets from multiple patients, enabling better annotation of unidentified cells and detection of cell-specific differentially expressed genes [24]. The method's coordinated optimization of dimensionality reduction, integration, and clustering facilitated identification of previously obscured cell populations and their potential roles in disease progression.

CellWhisperer represents an innovative approach connecting transcriptomes and natural language through multimodal AI [7]. While not primarily a batch integration method, it demonstrates how integrated datasets can enable novel exploration paradigms. By creating joint embeddings of over 1 million transcriptomes and their textual descriptions, CellWhisperer allows researchers to query scRNA-seq data using natural language, facilitating biological discovery through intuitive data interaction [7].

Clinical and Translational Applications

Batch integration methods show particular promise in clinical and translational settings where combining datasets across institutions, platforms, and timepoints is essential for robust biomarker discovery and validation. In cancer research, effective integration of multi-center datasets has enabled identification of consistent cell states across patient cohorts, improving cell type annotation and rare population detection [3] [23]. The clinical relevance of integration methods is further demonstrated through tasks like cancer cell identification and drug sensitivity prediction across multiple cancer types [3].

The profound consequences of inadequate batch management are highlighted by a clinical trial case where batch effects from an RNA-extraction solution change led to incorrect risk classification for 162 patients, 28 of whom received inappropriate chemotherapy regimens [21]. Such examples underscore the critical importance of robust batch integration in clinical applications where decisions directly impact patient care.

Implementation Considerations

The Researcher's Toolkit for Batch Integration

Table 3: Essential tools and resources for batch integration evaluation

Tool/Resource	Function	Application Context	Key Features
scIB	Integration metric calculation	Performance benchmarking	Comprehensive metric suite
batchbench	Method comparison	Pipeline evaluation	Standardized benchmarking
BioLLM	scFM unification	Foundation model evaluation	Unified API framework
PertEval-scFM	Perturbation prediction assessment	scFM capability testing	Standardized perturbation metrics
CELLxGENE	Data repository	Real-world validation	Diverse, annotated datasets

Successful batch integration requires careful consideration of multiple implementation factors. The choice of batch covariate significantly impacts integration outcomes, with finer resolutions removing more variation but potentially eliminating biologically meaningful signals [23]. Researchers must define whether integration should remove variation between samples, donors, datasets, or other groupings based on their specific biological questions.

Computational resource requirements vary substantially between method classes. Deep learning approaches like scVI and DeepBID typically demand greater computational resources and training time but excel with complex, large-scale integration tasks [24] [23]. Linear methods like Harmony and graph-based approaches like BBKNN offer faster processing for routine integration tasks [23]. The data format requirements also differ, with some methods outputting corrected gene expression matrices while others only produce integrated embeddings [23].

Method Selection Guidelines

Method selection should be guided by specific research contexts and data characteristics. For simple batch correction with consistent cell type compositions across a few batches, linear embedding methods (Harmony, Seurat) and global models (ComBat) provide efficient, effective solutions [23]. For complex data integration involving multiple protocols, nested effects, and potentially non-overlapping cell types, deep learning approaches (scVI, scANVI, DeepBID) and advanced linear methods (Scanorama) demonstrate superior performance [23].

When considering scFMs for integration tasks, researchers should evaluate whether the model's pretraining data aligns with their biological context, as this significantly impacts zero-shot performance [3]. Fine-tuning may be necessary to adapt scFMs to specific integration tasks, particularly for novel cell types or conditions not well-represented in pretraining data [3] [6]. The roughness index (ROGI) can serve as a useful proxy for predicting model performance on specific datasets without extensive benchmarking [3].

Emerging Trends and Development Needs

The batch integration landscape continues evolving with several promising directions. Multimodal integration approaches that simultaneously handle multiple data modalities (transcriptomics, proteomics, epigenetics) show tremendous potential for comprehensive biological characterization [7] [6]. Transfer learning frameworks that leverage knowledge from large-scale atlases to smaller studies address resource disparity challenges [3]. Automated method selection tools that recommend optimal integration strategies based on dataset characteristics would significantly enhance accessibility [23].

Substantial needs remain for standardized benchmarking frameworks enabling fair method comparison across diverse biological contexts [3] [6]. The BioLLM framework represents progress toward this goal by providing unified APIs for diverse scFMs [6]. Improved biological ground truthing through better ontology integration and functional validation would enhance method evaluation beyond technical metrics [3]. The development of specialized perturbation-aware models could address current limitations in predicting cellular responses to genetic and chemical perturbations [16] [4].

Batch integration remains a challenging but essential component of single-cell research, particularly as studies increase in scale and complexity. No single method universally outperforms others across all scenarios, necessitating careful selection based on data characteristics, research questions, and computational resources. While scFMs represent exciting developments with substantial potential, their current batch integration capabilities in zero-shot settings often fail to exceed traditional methods, particularly for specialized tasks like perturbation prediction.

Deep learning approaches consistently demonstrate superior performance on complex integration tasks but require substantial computational resources. Linear embedding methods offer practical solutions for routine batch correction, while emerging methods like BERT and DeepBID address specific challenges like incomplete data profiles and coordinated analysis optimization. As the field advances, increased standardization, biological validation, and multimodal integration will further enhance our ability to extract meaningful insights from integrated single-cell datasets, ultimately advancing biological discovery and therapeutic development.

Single-cell RNA sequencing (scRNA-seq) has revolutionized biology by providing an unprecedented granular view of transcriptional states at the individual cell level. While much attention has focused on cell-level analysis using single-cell foundation models (scFMs), a parallel and equally important advancement has occurred in the realm of gene-level embeddings. These embeddings—compact, numerical representations of genes learned from vast biological datasets—hold immense potential for extracting functional insights that extend beyond cellular classification. Gene-level embeddings aim to capture fundamental biological properties and relationships between genes, transforming our understanding of gene function, regulation, and interaction in both health and disease states.

The evaluation of these embeddings presents significant challenges, particularly within the context of zero-shot learning, where models must generalize to unseen data without task-specific training. As scFMs are increasingly applied to biological and clinical questions, assessing the quality of their gene-level representations becomes paramount for ensuring they capture biologically meaningful information rather than merely reflecting technical artifacts or dataset-specific biases. This comparison guide provides an objective assessment of current gene-level embedding approaches, their performance across diverse functional tasks, and the experimental frameworks needed to evaluate their biological relevance.

Comparative Performance of Gene Embedding Approaches

Table 1: Key Characteristics of Major Gene Embedding Approaches

Model Category	Training Data Source	Key Strengths	Inherent Limitations	Representative Models
Text-Based Models	Biomedical literature, gene descriptions	Superior performance on genomic properties & regulatory functions [26]	Limited by current knowledge in literature	GenePT, Cell2Sentence, scInterpreter
Expression-Based Models	scRNA-seq data, multi-omics data	Excellence in localization tasks and disease-related predictions [26] [27]	May miss sequence-level properties	scGPT, Geneformer, scFoundation
Sequence-Based Models	Amino acid sequences, DNA sequences	Outstanding performance in functional and genetic interaction predictions [27]	Limited contextual biological knowledge	Protein Language Models, DNA Foundation Models
Network-Based Models	Protein-protein interaction networks	Effective capture of functional relationships and pathways [28]	Dependent on network completeness	node2vec, NetMF, Structure-Preserving Autoencoders

Quantitative Performance Across Task Categories

Table 2: Performance Comparison Across Gene Property Task Families

Model Category	Genomic Properties (7 tasks)	Regulatory Functions (6 tasks)	Localization (30 tasks)	Biological Processes (29 tasks)	Protein Properties
Text-Based Models	Strong performance	Strong performance	Moderate performance	Strong performance	Variable performance
Expression-Based Models	Moderate performance	Moderate performance	Superior performance [26]	Strong performance	Limited performance
Sequence-Based Models	Superior performance [27]	Superior performance [27]	Moderate performance	Moderate performance	Superior performance
Network-Based Models	Limited performance	Moderate performance	Limited performance	Strong performance [28]	Moderate performance

Recent benchmarking efforts involving 38 classic and state-of-the-art gene embedding methods reveal that the type of training data has a greater influence on performance than the specific embedding construction method, with embedding dimensionality having only minimal impact [27]. Notably, no single model category dominates across all task types, highlighting the importance of selecting embeddings based on the specific biological question.

Experimental Protocols for Evaluating Gene Embeddings

Benchmarking Framework and Task Selection

A comprehensive benchmark for evaluating gene embeddings should encompass multiple task types and biological domains to provide a holistic assessment of model capabilities. The Gene Benchmark framework proposes evaluating embeddings across hundreds of tasks based on ground-truth gene properties collected from professionally curated bioinformatics databases [26]. These tasks are organized into five primary families that collectively capture diverse aspects of gene function:

Genomic Properties Tasks: Evaluate the ability to predict sequence-based characteristics, including genes that can be methylated or those that are dose-dependent.
Regulatory Functions Tasks: Assess how genes participate in cellular regulatory processes, including transcription factor identification and regulatory network connectivity.
Localization Tasks: Test identification of differential expression across tissues and sub-cellular localization patterns.
Biological Processes Tasks: Evaluate gene involvement in pathways, disease associations, and prognostic value.
Protein Properties Tasks: Focus on protein-level characteristics including functional domains and post-translational modifications.

For zero-shot evaluation, embeddings are extracted from pre-trained models without additional fine-tuning, and simple predictive models (e.g., linear classifiers) are trained on these representations to assess their inherent biological content [26].

Gene Set Comparison Using ANDES Methodology

The Algorithm for Network Data Embedding and Similarity (ANDES) provides a novel approach for comparing gene sets in embedding spaces while accounting for functional diversity within sets [28]. Unlike centroid-based methods that collapse set information into a single average vector, ANDES identifies best-matching genes between sets reciprocally, calculating similarity based on embedding distances between these best matches.

The ANDES workflow comprises four key steps:

For each gene in Set A, identify the most similar gene in Set B based on embedding distance
Reciprocally, for each gene in Set B, identify the most similar gene in Set A
Calculate similarity scores based on these best-match distances
Estimate statistical significance through Monte Carlo sampling to account for gene set cardinalities

This approach has demonstrated superior performance in recovering functionally matched gene sets from different databases (KEGG pathways and GO biological processes) compared to mean embedding, mean score, and corrected t-score methods [28].

Diagram 1: ANDES Gene Set Comparison Methodology. This workflow illustrates the reciprocal best-match approach used by ANDES to calculate similarity between gene sets while accounting for functional diversity.

Evaluation Metrics for Biological Relevance

Beyond traditional performance metrics, specialized evaluation approaches are needed to assess the biological relevance of gene embeddings. Recent benchmarking efforts have introduced several novel metrics:

scGraph-OntoRWR: Measures consistency between cell type relationships captured by scFMs and prior biological knowledge from ontologies [3]
Lowest Common Ancestor Distance (LCAD): Assesses ontological proximity between misclassified cell types to evaluate the severity of annotation errors [3]
Roughness Index (ROGI): Quantifies landscape roughness in pretrained latent space, with smoother landscapes correlating with better downstream task performance [3]

These biologically-informed metrics provide crucial insights into whether models are capturing meaningful biological relationships rather than merely optimizing for technical benchmarks.

Research Reagent Solutions for Gene Embedding Evaluation

Table 3: Essential Research Resources for Gene Embedding Benchmarking

Resource Category	Specific Resources	Primary Function	Key Features
Benchmarking Frameworks	BioLLM [6], Gene Benchmark [26]	Standardized evaluation pipelines	Unified APIs, support for zero-shot and fine-tuning, multi-modal model integration
Data Resources	Reactome [26], Human Protein Atlas [26], Open Targets [26]	Source of ground-truth gene properties	Manually curated, professionally maintained, regularly updated
Analysis Tools	ANDES [28], scGraph-OntoRWR [3]	Specialized analysis of embedding spaces	Gene set comparison, ontology-informed evaluation
Model Architectures	scGPT [3] [6], Geneformer [3] [6], scFoundation [3]	Pre-trained models for generating embeddings	Various pretraining strategies, different architectural approaches

Interpretation Guidelines and Practical Recommendations

Model Selection Framework

Selecting the appropriate gene embedding approach requires careful consideration of the specific biological question and available resources. The following decision framework emerges from current benchmarking studies:

For tasks involving genomic properties and regulatory functions: Text-based models and sequence-based models generally outperform other approaches [26] [27]
For localization and disease-related tasks: Expression-based models demonstrate particular strength [26]
For pathway analysis and gene set enrichment: ANDES applied to network-based embeddings provides state-of-the-art performance [28]
Under resource constraints: Simpler machine learning models often adapt more efficiently to specific datasets than complex foundation models [3]

Notably, no single scFM consistently outperforms others across all tasks, emphasizing the need for task-specific model selection [3]. Factors such as dataset size, task complexity, need for biological interpretability, and computational resources should all inform model choice.

Future Directions in Gene Embedding Evaluation

As the field advances, several emerging approaches promise to enhance our ability to evaluate gene-level embeddings:

Integration of multi-omics data: Combining information from transcriptomics, proteomics, and epigenomics to create more comprehensive gene representations
Temporal and spatial context: Incorporating dynamic and spatial information to better capture gene function in development and disease
Cross-species knowledge transfer: Leveraging conserved biological principles to improve embeddings for less-studied organisms [28]
Causal inference capabilities: Moving beyond correlative relationships to embed causal biological knowledge

The rapid evolution of benchmarking frameworks like BioLLM [6] and the introduction of biologically-informed evaluation metrics [3] are paving the way for more rigorous and meaningful assessment of gene-level embeddings, ultimately accelerating their translation to biological discovery and therapeutic development.

The emergence of single-cell foundation models (scFMs) has revolutionized the analysis of single-cell RNA sequencing (scRNA-seq) data, offering the potential to learn universal biological representations from vast datasets. However, the field faces a significant challenge: the heterogeneous architectures and coding standards of various scFMs create substantial barriers to consistent application and rigorous evaluation [6] [29]. This fragmentation impedes researchers' ability to objectively compare model performance and select the optimal tool for specific biological questions.

To address this critical need, unified frameworks like BioLLM (Biological Large Language Model) have been developed. BioLLM provides standardized APIs and comprehensive documentation that streamline model access, enable seamless model switching, and support consistent benchmarking across diverse scFMs [6] [29]. This guide provides an objective comparison of scFM performance facilitated by such frameworks, contextualized within broader research on zero-shot scFM embedding quality, and delivers actionable insights for researchers, scientists, and drug development professionals.

The Need for Standardization in scFM Evaluation

Single-cell foundation models have demonstrated remarkable potential in learning meaningful representations of genes and cells from massive scRNA-seq datasets [8]. These models, typically built on transformer architectures, treat individual cells as "sentences" and genes or genomic features as "words" or "tokens" [8]. Despite this common conceptual foundation, implementation diversity creates substantial evaluation challenges:

Architectural Heterogeneity: scFMs employ different transformer variants (encoder-based, decoder-based, or hybrid designs) with customized modifications [8]
Pretraining Diversity: Models are trained on different datasets (ranging from 27.5M to 50M cells) with varying self-supervised objectives [18]
Input Representation: Tokenization strategies differ significantly, with models using gene ranking, value binning, or other approaches for input structuring [18]

Without standardized evaluation frameworks, comparing model performance across studies becomes problematic, with results heavily influenced by implementation details rather than fundamental capabilities [18] [5]. BioLLM addresses this by providing a unified interface that eliminates architectural and coding inconsistencies, enabling fair comparisons and more reliable assessments of model strengths and limitations [6] [29].

Experimental Frameworks for scFM Evaluation

Standardized Benchmarking Platforms

Rigorous evaluation of scFMs requires comprehensive benchmarking frameworks that assess model performance across diverse tasks and datasets. Two primary approaches have emerged:

Task-Oriented Benchmarks: These evaluate scFMs on specific biological applications. The benchmark by [18] assesses six scFMs across two gene-level and four cell-level tasks using 12 different metrics. Similarly, PertEval-scFM provides a standardized framework specifically designed for evaluating perturbation effect prediction [4].
Unified Software Frameworks: BioLLM provides a comprehensive system that integrates diverse scFMs through standardized APIs, supporting both zero-shot and fine-tuning evaluations [6] [29]. This approach enables streamlined model switching and consistent performance assessment across multiple tasks.

Key Evaluation Metrics and Methodologies

Table 1: Evaluation Metrics for scFM Assessment

Metric Category	Specific Metrics	Purpose	Biological Interpretation
Gene-Level Tasks	Tissue specificity prediction, GO term prediction	Assess functional gene embeddings	Measures capture of biological relationships
Cell-Level Tasks	Batch integration, cell type annotation	Evaluate cellular representation quality	Tests preservation of biological variation
Knowledge-Based Metrics	scGraph-OntoRWR, LCAD	Quantify biological relevance	Compares with prior biological knowledge
Perturbation Prediction	L2 distance, Pearson delta, genetic interaction detection	Assess predictive capability	Tests model generalization to novel conditions

Evaluation protocols typically employ a zero-shot setting to assess the intrinsic quality of learned representations without task-specific fine-tuning [18] [4]. This approach is particularly valuable for determining whether scFMs truly learn fundamental biological principles during pretraining rather than simply memorizing dataset-specific patterns.

For perturbation prediction tasks, special data splitting strategies are crucial, where no perturbation condition occurs in both training and test sets [30]. This prevents models from exploiting dataset-specific correlations and provides a more realistic assessment of generalization capability.

Comparative Performance Analysis of scFMs

Task-Specific Model Performance

Unified frameworks like BioLLM have enabled comprehensive comparisons across diverse scFMs, revealing distinct performance patterns across task types:

Table 2: scFM Performance Across Task Categories

Model	Gene-Level Tasks	Cell-Level Tasks	Perturbation Prediction	Zero-Shot Capability
scGPT	Strong	Robust across tasks	Limited improvement over baselines	Strong
Geneformer	Strong	Variable	Limited improvement over baselines	Moderate
scFoundation	Strong	Moderate	Limited improvement over baselines	Moderate
UCE	Moderate	Moderate	Limited improvement over baselines	Moderate
scBERT	Weaker	Weaker	Limited improvement over baselines	Weaker

BioLLM's evaluation revealed that scGPT demonstrates robust performance across all tasks, including both zero-shot and fine-tuning scenarios, while Geneformer and scFoundation show strong capabilities in gene-level tasks, benefiting from their effective pretraining strategies [6] [29]. In contrast, scBERT typically lags behind, likely due to its smaller model size and limited training data [29].

Notably, benchmark studies consistently show that no single scFM consistently outperforms others across all tasks and datasets [18]. This emphasizes the importance of task-specific model selection rather than seeking a universal "best" model.

Performance on Perturbation Prediction

Perturbation effect prediction represents a particularly challenging task for scFMs. Recent benchmarks reveal significant limitations in current approaches:

Double Perturbation Prediction: When predicting expression changes after double perturbations, no scFM outperformed a simple additive baseline that sums individual logarithmic fold changes [5]
Genetic Interaction Detection: In identifying genetic interactions (where double perturbation effects deviate from additive expectations), scFMs performed no better than a "no change" baseline that always predicts control condition expression [5]
Unseen Perturbation Prediction: For predicting effects of completely novel perturbations, linear models using pretrained embeddings from perturbation data outperformed scFMs pretrained on single-cell atlas data [5]

These results suggest that pretraining on large-scale single-cell atlas data provides only limited benefit for perturbation prediction tasks, while pretraining on perturbation data itself significantly increases predictive performance [5].

Zero-Shot Embedding Quality

The quality of zero-shot embeddings (representations generated without task-specific fine-tuning) varies considerably across models and tasks:

Biological Relevance: Models like scGPT and Geneformer produce embeddings that capture meaningful biological relationships, as measured by ontology-informed metrics like scGraph-OntoRWR [18]
Batch Integration: Several scFMs demonstrate robust performance in removing technical batch effects while preserving biological variation [18]
Smoothness of Latent Space: Performance improvements in downstream tasks correlate with smoother cell-property landscapes in the pretrained latent space, which reduces the difficulty of training task-specific models [18]

Experimental Protocols for scFM Assessment

Standardized Evaluation Workflow

The following diagram illustrates a comprehensive experimental workflow for scFM evaluation, integrating best practices from multiple benchmarking studies:

Key Experimental Components

Dataset Selection and Preparation

Benchmarking studies emphasize the importance of diverse, high-quality datasets for rigorous evaluation:

Dataset Diversity: Evaluation should include 5+ datasets with diverse biological conditions, batch effects, and tissue types [18]
Clinical Relevance: Inclusion of clinically relevant tasks such as cancer cell identification and drug sensitivity prediction across multiple cancer types [18]
Independent Validation: Use of completely independent datasets like AIDA v2 from CellxGene to mitigate data leakage risks [18]

For perturbation prediction benchmarks, datasets should include large-scale perturbation studies with both single and double perturbations, using non-standard data splits where no perturbation condition appears in both training and test sets [30] [5].

Evaluation Metrics and Protocols

Comprehensive evaluation requires multiple metric types:

Traditional Metrics: Standard metrics like MSE, MAE, and Spearman correlation provide baseline performance measures [30]
Biological Relevance Metrics: Novel ontology-informed metrics like scGraph-OntoRWR and LCAD assess consistency with prior biological knowledge [18]
Task-Specific Metrics: For perturbation prediction, genetic interaction detection capability and performance on top differentially expressed genes provide specialized assessment [30] [5]

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Research Reagents and Computational Tools for scFM Evaluation

Tool/Resource	Type	Function	Relevance to Evaluation
BioLLM Framework	Software Framework	Unified API for scFM integration	Standardizes model access and switching
CellxGene Atlas	Data Resource	Curated single-cell datasets	Provides independent validation data
PertEval-scFM	Benchmarking Tool	Specialized perturbation evaluation	Assesses perturbation prediction capability
scGraph-OntoRWR	Evaluation Metric	Ontology-based consistency measure	Quantifies biological relevance of embeddings
PEREGGRN	Benchmarking Platform	Expression forecasting evaluation	Tests prediction of genetic perturbation effects

Unified frameworks like BioLLM play a crucial role in standardizing the evaluation of single-cell foundation models, enabling fair comparisons and revealing distinct model strengths and limitations. Through comprehensive benchmarking, several key findings emerge:

First, current scFMs demonstrate robust performance on standard tasks like batch integration and cell type annotation, with scGPT emerging as a consistently strong performer across multiple tasks [6] [29]. Second, perturbation effect prediction remains a significant challenge, with simple baselines often outperforming sophisticated foundation models [4] [5]. Third, biological relevance metrics reveal that scFMs can capture meaningful biological relationships, particularly in their zero-shot embeddings [18].

For researchers and drug development professionals, these findings suggest a pragmatic approach to model selection: choose scGPT for general-purpose applications, Geneformer or scFoundation for gene-level tasks, and consider simple baselines or specialized models for perturbation prediction. As the field evolves, standardized frameworks will continue to be essential for tracking progress, identifying limitations, and guiding the development of more capable and biologically meaningful foundation models.

Diagnosing and Overcoming Limitations in Zero-Shot Performance

Single-cell foundation models (scFMs) represent a groundbreaking advance in computational biology, applying large-scale, self-supervised learning to massive single-cell transcriptomics datasets. Trained on millions of cells, these models aim to learn universal biological principles that can be adapted to various downstream tasks like cell type annotation, batch integration, and drug sensitivity prediction [8]. However, their practical utility in discovery settings often depends on zero-shot performance—using pretrained model embeddings without any task-specific fine-tuning. This evaluation is crucial for exploratory research where true labels are unknown and fine-tuning is impossible [1]. Recent rigorous benchmarking reveals that in numerous realistic scenarios, simpler baseline methods, particularly Highly Variable Genes (HVG) selection, consistently match or surpass sophisticated scFMs in zero-shot settings. This guide objectively examines the experimental evidence documenting these performance gaps, analyzes the underlying causes, and provides researchers with methodologies for selecting appropriate tools based on their specific analytical tasks.

Quantitative Performance Comparison: scFMs vs. Baselines

Independent benchmark studies have systematically evaluated scFMs against traditional methods across fundamental single-cell analysis tasks. The tables below summarize key quantitative findings.

Table 1: Zero-Shot Cell Type Clustering Performance (AvgBIO Score)

Dataset	HVG	scVI	Harmony	scGPT	Geneformer
Pancreas	0.71	0.67	0.62	0.59	0.45
Immune	0.65	0.61	0.58	0.55	0.42
Tabula Sapiens	0.59	0.55	0.52	0.50	0.38
PBMC (12k)	0.62	0.60	0.59	0.63	0.46

Table 2: Batch Integration Performance (Batch Mixing Score)

Dataset	HVG	scVI	Harmony	scGPT	Geneformer
Pancreas	0.88	0.85	0.81	0.75	0.52
PBMC	0.85	0.82	0.79	0.78	0.55
Tabula Sapiens	0.80	0.75	0.65	0.77	0.48

Table 3: Challenging Task - Rare Cell Type Identification (ARI)

Feature Selection Method	Number of Features	Adjusted Rand Index (ARI)
HVGs	350	0.75
Random Gene Selection	350	~0.05
Random Gene Selection	16,985 (All Genes)	~0.05

The data consistently shows that HVG selection is a robust baseline, outperforming or matching scFMs and other integration methods in both clustering and batch integration [1]. The failure of random gene selection, even with the entire transcriptome, in identifying rare T-regulatory cells highlights that success in difficult tasks requires not just including predictive features but actively excluding non-informative ones [31] [32].

Experimental Protocols for Benchmarking

To ensure reproducibility and fair comparison, benchmarking studies follow rigorous standardized protocols.

Zero-Shot Embedding Evaluation Protocol

Embedding Extraction: Pre-trained scFMs (e.g., scGPT, Geneformer) process held-out test datasets without any fine-tuning. The cell-level embedding vectors are extracted from the model's output layer.
Baseline Generation: For HVG baseline, expression data is subset to the top ~2,000 highly variable genes. Traditional methods like scVI and Harmony generate embeddings using their standard algorithms.
Downstream Task Application: All embeddings are evaluated on the same downstream tasks:
- Cell Type Clustering: Embeddings are used for unsupervised clustering (e.g., Leiden algorithm). Results are compared to ground-truth labels using metrics like Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI).
- Batch Integration: The ability to mix cells from different technical batches while preserving biological separation is quantified using metrics like batch mixing score and principal component regression (PCR) [1].
Quantitative Ranking: Models are ranked across multiple datasets and metrics. Aggregate scores are computed to provide holistic performance assessments [3] [18].

Protocol for Evaluating Feature Selection on Subtle Tasks

Dataset Construction: A common dataset (e.g., 10x Genomics PBMC) is clustered using default HVGs to establish a ground-truth label for a major cell population (e.g., CD4+ T cells). This population is subsetted for a more challenging task (e.g., identifying T-regulatory cells) [32].
Feature Selection Comparison: Different feature selection methods (HVGs, random selection, BigSur) are applied, using an identical number of genes.
Clustering and Validation: The selected feature sets are used for clustering. Performance is measured by the ability to recover the known rare cell population, using metrics like ARI and NMI against the expert-annotated labels [31].

Figure 1: Experimental workflow for evaluating feature selection methods on challenging tasks like rare cell type identification.

When and Why Simple Methods Excel

Experimental evidence points to specific failure modes of scFMs and corresponding strengths of simpler approaches.

Key Failure Modes of Single-Cell Foundation Models

Ineffective Zero-Shot Embeddings: The core pretraining objective of masked gene prediction does not guarantee the production of cell embeddings that are linearly separable by cell type. In zero-shot settings, the embeddings from models like Geneformer often retain strong batch effects and fail to capture biologically relevant cluster structure [1].
Task-Architecture Misalignment: The transformer architecture, designed for sequential data, is applied to non-sequential gene expression data. The need to impose an artificial gene order (e.g., by expression level) may not optimally represent biological reality, limiting the model's ability to learn generalizable cellular representations [3] [8].
Overwhelming Signal from Non-Predictive Features: Unlike curated feature selection, scFM embeddings incorporate information from all input genes. In complex tasks, signals from irrelevant genes can swamp subtle biological signals, reducing clustering accuracy—a phenomenon known as the "curse of dimensionality" [31] [32].

Inherent Strengths of Simple Baselines

Direct Dimensionality Reduction: HVG selection directly addresses the high-dimensionality and high-sparsity of scRNA-seq data by filtering out genes with low signal-to-noise ratio, creating a more manageable and informative feature space for clustering [32].
Computational Efficiency and Transparency: Methods like HVG selection are computationally cheap and their operation is transparent. They avoid the "black box" nature of large neural networks, making them easier to debug and interpret for researchers.
Optimal Performance on Common Tasks: For standard analyses involving abundant, well-separated cell types, the biological signal in the data is so strong that complex models offer little marginal benefit. Simple methods are sufficient and often more reliable [31].

A Researcher's Guide to Model Selection

Choosing between an scFM and a simpler baseline depends on the specific research context. The following diagram and table provide a practical guide.

Figure 2: A decision workflow to guide the choice between scFMs and simpler baseline methods.

Table 4: The Scientist's Toolkit - Key Reagents and Resources

Resource Name	Type	Primary Function in Evaluation
CELLxGENE Database	Data Platform	Provides standardized, annotated single-cell datasets for model pretraining and benchmarking [3].
Highly Variable Genes (HVG)	Computational Method	A robust baseline for feature selection, used to create a simplified feature space for clustering [1].
scVI	Generative Model	A established baseline for batch integration and representation learning, using a probabilistic deep learning framework [1].
Harmony	Integration Algorithm	A high-performing baseline for integrating datasets across different technical batches and conditions [3].
Adjusted Rand Index (ARI)	Evaluation Metric	Measures the similarity between two clusterings (e.g., predicted vs. true labels), adjusted for chance [31].
Normalized Mutual Information (NMI)	Evaluation Metric	Quantifies the mutual information between clusterings, normalized to a 0-1 scale [32].

The development of single-cell foundation models is a promising and rapidly evolving field. However, current evidence indicates that their zero-shot embeddings do not consistently provide a superior analytical foundation compared to well-established, simpler methods. Highly Variable Genes selection remains a surprisingly powerful and robust baseline for standard tasks like cell type clustering and batch integration.

Researchers should be aware of the specific failure modes of scFMs, particularly in zero-shot settings on complex or subtle tasks. The recommended practice is to rigorously validate the performance of any scFM against simple baselines within the context of a specific dataset and analytical goal. As the field progresses, future model generations that better align pretraining objectives with biological intuition and downstream use cases may close this performance gap, but for now, a measured and evidence-based approach to tool selection is paramount.

Foundation models for single-cell transcriptomics (scFMs) promise to revolutionize biological research by providing powerful, general-purpose models that can be adapted to various downstream tasks with minimal additional training [8]. These models, typically built on transformer architectures, are pretrained on massive collections of single-cell data with the goal of learning universal patterns of cellular biology [8]. The ultimate objective is to create models that generate high-quality cell embeddings capable of enabling accurate zero-shot performance—where the pretrained model is applied directly to new datasets without any task-specific fine-tuning [1].

However, emerging research reveals a critical challenge: the composition of pretraining data fundamentally shapes model performance, and mismatches between training and application contexts can severely limit generalization [1] [33]. This article examines the empirical evidence demonstrating how dataset composition impacts scFM generalization, with particular focus on zero-shot performance where these effects are most pronounced.

Experimental Evidence: Performance Gaps in Zero-Shot Evaluation

Comparative Performance on Cell Type Identification

Rigorous zero-shot evaluation exposes significant limitations in current scFMs. When tested on their ability to separate known cell types across multiple datasets without any fine-tuning, proposed foundation models frequently underperform simpler established methods [1].

Table 1: Zero-Shot Cell Type Clustering Performance (AvgBIO Score)

Model/Dataset	Tabula Sapiens	Pancreas	PBMC (12k)	Immune
scGPT	0.41	0.35	0.62	0.38
Geneformer	0.32	0.28	0.45	0.31
scVI	0.58	0.52	0.55	0.51
Harmony	0.49	0.48	0.53	0.46
HVG	0.61	0.56	0.64	0.55

As shown in Table 1, selecting Highly Variable Genes (HVG) consistently outperforms both scGPT and Geneformer across all evaluated datasets, with established methods like scVI and Harmony also demonstrating superior performance [1]. This performance gap is particularly striking because HVG represents a relatively simple baseline approach without the complex architecture of foundation models.

Batch Integration Capabilities

Batch integration—correcting for technical variations while preserving biological signals—is another critical task where scFMs show inconsistent zero-shot performance.

Table 2: Batch Integration Performance (Batch Mixing Score)

Model/Dataset	Pancreas	PBMC	Tabula Sapiens	Immune
scGPT	0.46	0.52	0.61	0.58
Geneformer	0.32	0.38	0.42	0.39
scVI	0.72	0.68	0.59	0.52
Harmony	0.65	0.63	0.48	0.62
HVG	0.69	0.66	0.65	0.60

Quantitative evaluation reveals that Geneformer consistently underperforms across most datasets, while scGPT shows variable performance—excelling on some datasets but lagging on others [1]. Visual inspection of embeddings confirms these quantitative findings, with Geneformer's cell embedding space often failing to retain meaningful cell type information and instead clustering primarily by batch effects [1].

The Data Composition Effect: Systematic Investigations

Impact of Training Data Diversity on Reconstruction Accuracy

Beyond architectural considerations, research systematically investigating training data composition reveals how dramatically pretraining corpora affect fundamental model capabilities. Studies using linearly decoded variational autoencoders (LDVAE) trained on different data compositions show clear patterns in reconstruction accuracy across cell types [33].

Table 3: Impact of Training Data on Reconstruction Accuracy (R²)

Training Corpus	Blood Cells	Bone Marrow	AML Cells	SCC Cells	Neurons
Blood (Baseline)	0.69	0.38	0.33	0.24	0.00
Bone Marrow	0.45	0.62	0.41	0.29	0.00
Blood + Bone Marrow	0.67	0.59	0.49	0.31	0.00
Blood + TF Atlas	0.68	0.61	0.52	0.43	0.18

The data reveals three critical patterns: (1) Models generalize poorly to unseen cell types, showing dramatically reduced reconstruction accuracy; (2) Including cancer data in training doesn't necessarily improve performance on unseen cancer types; and (3) Incorporating directed differentiation atlases (TF Atlas) significantly enhances performance on out-of-distribution cell types [33].

Tissue-Specific Pretraining and Scaling Effects

Investigations into scGPT variants pretrained on different datasets reveal nuanced relationships between data composition and model performance. While pretraining generally improves performance over randomly initialized models, the benefits are not uniform across datasets [1].

Surprisingly, scGPT pretrained exclusively on blood and bone marrow cells (scGPT-blood) sometimes outperforms the larger scGPT-human model trained on 33 million non-cancerous human cells, even for datasets involving tissue types beyond blood [1]. This suggests that simply scaling dataset size without strategic composition may not yield consistent improvements, challenging the prevailing "bigger is better" paradigm in foundation model development.

Experimental Protocols and Methodologies

Zero-Shot Evaluation Framework

The experimental protocol for evaluating zero-shot performance follows a standardized framework [1]:

Model Acquisition: Download pretrained scGPT and Geneformer models from official repositories
Embedding Generation: Process evaluation datasets through models without any fine-tuning to generate cell embeddings
Task Evaluation: Apply embeddings to specific tasks including:
- Cell type clustering using Leiden algorithm on embeddings
- Batch integration using multiple metrics (batch mixing score, PCR)
- Visualization via UMAP projection
Metric Calculation: Compute quantitative metrics including:
- Average BIO (AvgBIO) score for cell type separation
- Average silhouette width (ASW) for cluster compactness
- Batch mixing scores for integration quality
- Principal component regression (PCR) for batch effect quantification
Baseline Comparison: Compare against established methods (scVI, Harmony) and simple approaches (HVG selection)

Data Composition Experiments

The systematic investigation of training data effects employs this methodology [33]:

Corpus Construction: Create specialized training datasets representing different regions of developmental hierarchy:
- Blood (Baseline): ~100k healthy peripheral blood cells
- Bone Marrow: ~100k bone marrow cells
- Blood Cancer: ~100k malignant hematological cells
- TF Atlas: ~100k cells from directed differentiation experiments
Model Training: Train LDVAE models on individual and combined corpora
Evaluation Framework: Test reconstruction accuracy (R²) across five evaluation datasets representing different distances from training distribution
Statistical Robustness: Repeat all experiments across five random seeds for reliability

The Scientist's Toolkit: Essential Research Reagents and Solutions

Table 4: Key Research Reagents and Computational Tools

Resource	Type	Primary Function	Application Context
scGPT	Foundation Model	Single-cell analysis via transformer architecture	Zero-shot evaluation, fine-tuning tasks, perturbation prediction
Geneformer	Foundation Model	Transcriptomics-focused transformer model	Cell embedding generation, transfer learning applications
scVI	Probabilistic Model	Variational inference for single-cell data	Baseline comparison, probabilistic modeling, batch correction
Harmony	Integration Algorithm	Principal component-based batch integration	Benchmarking integration performance, removing technical variation
CELLxGENE	Data Platform	Curated single-cell data repository	Data sourcing, pretraining corpus assembly, benchmark creation
LDVAE	Interpretable Model	Linearly decoded variational autoencoder	Training composition studies, interpretable latent spaces
HVG	Selection Method	Highly variable gene identification	Baseline method for feature selection, performance comparison

The collective evidence demonstrates that pretraining data composition fundamentally constrains scFM generalization capabilities. The "pretraining data mismatch" problem manifests most clearly in zero-shot settings, where models cannot rely on task-specific fine-tuning to adapt to new data distributions [1] [33].

These findings have crucial implications for both developers and users of single-cell foundation models. For developers, strategic data curation focusing on developmental hierarchy coverage may be more important than simply maximizing dataset size [33]. For users, understanding the composition of a model's pretraining data is essential for predicting its performance on specific applications. Zero-shot evaluation must become a standard assessment practice, as fine-tuning alone can mask fundamental limitations in model generalization [1].

Future progress in scFMs will require more sophisticated approaches to training data design, moving beyond the current paradigm of indiscriminate data aggregation toward strategically constructed corpora that comprehensively represent cellular state spaces.

The emergence of single-cell foundation models (scFMs) has promised a revolution in biological discovery, offering tools to integrate heterogeneous datasets and explore biological systems at an unprecedented scale. However, as these models grow in complexity, a critical question remains: how can we effectively evaluate their ability to capture biologically meaningful patterns beyond technical benchmarks? Traditional computational metrics often fail to assess whether model outputs align with established biological knowledge, creating a significant gap between computational performance and biological relevance. This evaluation challenge is particularly pronounced in zero-shot settings, where models are applied without task-specific fine-tuning—a common scenario in exploratory biological research where predefined labels are unavailable [1].

The limitations of conventional evaluation approaches have become increasingly apparent. Embeddings that score highly on technical metrics may still distort fundamental biological relationships, potentially misleading downstream analysis [34]. To address this critical gap, researchers have introduced novel ontology-informed metrics specifically designed to evaluate whether scFMs capture biologically meaningful relationships. These metrics, including scGraph-OntoRWR and Lowest Common Ancestor Distance (LCAD), leverage formal biological ontologies to ground model evaluation in established biological knowledge, providing a crucial bridge between computational outputs and biological interpretation [3] [18].

Understanding the Metric Framework

The Foundation: Biological Ontologies

Biological ontologies provide the foundational framework that enables rigorous biological evaluation of scFMs. Unlike simple dictionaries or databases, ontologies are formal, explicit specifications of shared conceptualizations within the biological domain. They capture not just biological terms but the rich network of relationships between them, enabling both humans and computers to reason about biological concepts in sophisticated ways [35].

Two fundamental concept types form the bedrock of most biological ontologies. Continuants are entities that persist through time while maintaining their identity, such as molecules, cells, tissues, and organs. Occurrents represent time-dependent entities including processes, actions, and states—for example, biochemical reactions, cell division, or disease progression [35]. This distinction is crucial for proper biological modeling, as it helps avoid common errors such as confusing a physical structure with the processes it participates in.

Initiatives like the Open Biological and Biomedical Ontology (OBO) Foundry represent major community efforts to coordinate ontology development across biological sciences. The OBO Foundry establishes best practices for ontology creation and has developed standardized relationship sets (such as is_a, part_of, and participates_in) with clearly defined logical properties, ensuring consistent interpretation across different domains [35].

The Novel Evaluation Metrics

scGraph-OntoRWR: Evaluating Relational Consistency

The scGraph-OntoRWR metric is designed to measure the consistency of cell type relationships captured by scFMs with prior biological knowledge encoded in cell ontologies [3] [18]. The metric operates through a sophisticated computational workflow that compares model-derived relationships with established ontological relationships.

Methodological Protocol:

Graph Construction: Extract cell embeddings from a scFM and construct a cell-cell similarity graph based on distances in the latent space.
Ontological Reference: Access the relevant cell ontology to obtain the formal hierarchical relationships between cell types.
Random Walk with Restart: Perform Random Walk with Restart (RWR) algorithms on both the model-derived graph and the ontology-based graph. RWR simulates a random traversal that, at each step, either moves to a neighboring node or restarts from the origin node, effectively capturing local network structure and connectivity patterns.
Consistency Measurement: Compare the steady-state distributions of the RWR processes on both graphs to quantify how well the model's representation of cell type relationships aligns with established biological knowledge.
Score Calculation: Compute a final similarity score indicating the degree of alignment between the model-derived relationships and the ontological ground truth [3].

This metric effectively evaluates whether functionally related cell types (e.g., different types of T-cells) are positioned closer in the embedding space compared to biologically distant cell types (e.g., neurons vs. immune cells), as would be expected from biological principles.

LCAD: Assessing Error Biological Significance

The Lowest Common Ancestor Distance (LCAD) metric introduces a crucial biological perspective to error analysis in cell type annotation tasks. Unlike simple accuracy metrics that treat all misclassifications equally, LCAD measures the ontological proximity between misclassified cell types and their correct labels, thereby assessing the biological severity of annotation errors [3] [18].

Methodological Protocol:

Error Identification: Identify misclassified cells from cell type annotation tasks.
Ontological Mapping: Map both the predicted cell type and the actual cell type to their corresponding positions in the cell ontology hierarchy.
Common Ancestor Detection: For each misclassification, identify the Lowest Common Ancestor (LCA) of the predicted and actual cell types within the ontological hierarchy.
Distance Calculation: Compute the ontological distance between the misclassified cell type and the LCA, and between the true cell type and the LCA. These distances are combined to reflect the biological severity of the error.
Severity Assessment: A smaller LCAD indicates that the misclassification occurred between biologically similar cell types (e.g., two subtypes of macrophages), while a larger LCAD indicates a more biologically severe error (e.g., confusing a blood cell with a neuron) [3].

This approach recognizes that not all errors are equally problematic from a biological perspective and provides a more nuanced evaluation of model performance.

Comparative Performance Analysis

Experimental Framework and Benchmark Design

The evaluation of these novel metrics comes from a comprehensive benchmark study published in Genome Biology (2025) that assessed six prominent scFMs (Geneformer, scGPT, UCE, scFoundation, LangCell, and scCello) against established baseline methods [3] [18]. The benchmarking was designed to reflect realistic research scenarios across both gene-level and cell-level tasks.

Experimental Protocol Overview:

Model Selection: Six scFMs representing different architectural approaches and pretraining strategies were selected [3].
Task Design: The evaluation encompassed two gene-level tasks (tissue specificity and Gene Ontology term prediction) and four cell-level tasks (batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction) [3] [18].
Dataset Curation: Multiple high-quality datasets with manual annotations were utilized, varying in size and diversity and containing multiple sources of batch effects (inter-patient, inter-platform, and inter-tissue variations) [3].
Evaluation Framework: Performance was assessed using 12 metrics spanning unsupervised, supervised, and knowledge-based approaches, including the novel ontology-informed metrics [3].

To ensure rigorous validation, the study adopted a zero-shot protocol and introduced an independent, unbiased dataset—the Asian Immune Diversity Atlas (AIDA) v2 from CellxGene—to mitigate the risk of data leakage and validate conclusions [3].

Performance Across Biological Tasks

The benchmark results demonstrated that ontology-informed metrics provided crucial insights that traditional computational metrics missed, revealing significant differences in how scFMs capture biological relationships.

Table 1: Performance Ranking of Single-Cell Foundation Models Across Biological Tasks

Model	Batch Integration	Cell Type Annotation	Cancer Cell Identification	Drug Sensitivity Prediction	Overall Biological Ranking
Geneformer	2	3	1	2	2
scGPT	3	2	3	3	3
UCE	1	4	4	4	4
scFoundation	4	1	2	1	1
Traditional ML	5	5	5	5	6
HVG Selection	6	6	6	6	5

Note: Rankings based on comprehensive evaluation including ontology-informed metrics (1=best performance). Adapted from benchmark study [35].

The evaluation revealed several key findings. First, no single scFM consistently outperformed others across all tasks, emphasizing that model selection must be tailored to specific applications and data characteristics [3] [18]. Second, foundation models demonstrated remarkable robustness and versatility across diverse applications, while simpler machine learning models sometimes adapted more efficiently to specific datasets, particularly under resource constraints [3]. Third, the pretrained zero-shot scFM embeddings captured meaningful biological insights into the relational structure of genes and cells, which proved beneficial for downstream tasks [3].

When assessed with traditional metrics alone, some models appeared to perform adequately. However, the ontology-informed metrics revealed significant differences in biological relevance. For instance, the benchmark study found that "the pretrained zero-shot scFM embeddings indeed capture biological insights into the relational structure of genes and cells, which is beneficial to downstream tasks" [3]. Additionally, the study verified that "performance improvement arises from a smoother landscape, which reduces the difficulty of training task-specific models" [3].

Table 2: Metric Comparison Between Traditional and Novel Evaluation Approaches

Evaluation Aspect	Traditional Metrics	Ontology-Informed Metrics	Key Advantages
Error Assessment	Treats all misclassifications equally (e.g., accuracy)	LCAD quantifies biological severity of errors	Reflects biological plausibility of errors
Relationship Capture	Measures cluster compactness and separation	scGraph-OntoRWR evaluates consistency with known biology	Grounds evaluation in established knowledge
Interpretability	Provides quantitative scores without biological context	Links model performance to biological hierarchies	Enables meaningful biological interpretation
Generalization	May overfit to technical aspects of data	Assesses fundamental biological understanding	More robust across diverse biological contexts

Implementation Workflows

Integrated Evaluation Pipeline

Implementing a comprehensive evaluation strategy for scFMs requires integrating both traditional and biology-aware metrics. The following workflow illustrates the complete experimental pipeline for evaluating scFMs, from embedding extraction to biological interpretation.

Figure 1: Comprehensive Workflow for Evaluating Single-Cell Foundation Models

scGraph-OntoRWR Implementation Workflow

The implementation of scGraph-OntoRWR requires careful integration of computational and biological resources. The following workflow details the specific steps for calculating this metric.

Figure 2: scGraph-OntoRWR Calculation Workflow

The Scientist's Toolkit

Implementing rigorous biological evaluation of scFMs requires both computational tools and biological knowledge resources. The following table outlines essential components of the evaluation toolkit.

Table 3: Essential Research Reagents and Resources for Biological Evaluation

Reagent/Resource	Function	Biological Significance
Gene Embeddings	Numerical representations of genes in latent space	Capture functional similarities between genes based on co-expression patterns across diverse cellular contexts [35]
Cell Ontologies	Structured vocabularies defining cell types and relationships	Provide ground truth for evaluating biological relevance of model outputs [35]
Attention Mechanisms	Model components that identify important relationships between inputs	Reveal gene-gene interactions and regulatory relationships learned from data [3]
Benchmark Datasets	Curated single-cell data with high-quality annotations	Enable standardized evaluation and comparison of different modeling approaches [3] [18]
GO Term Annotations	Gene Ontology functional classifications	Serve as biological prior knowledge for validating gene embeddings [35]
Ligand-Receptor Pairs	Known pairs of interacting molecules	Enable construction of cell-cell communication networks for validation [36]

The introduction of ontology-informed metrics like scGraph-OntoRWR and LCAD represents a significant advancement in the evaluation of single-cell foundation models. These metrics provide a crucial bridge between computational performance and biological relevance, enabling researchers to assess whether sophisticated models truly capture meaningful biological relationships rather than merely optimizing technical benchmarks.

The comprehensive benchmark studies reveal that while scFMs show remarkable promise in capturing biological insights, their performance varies significantly across tasks and datasets. No single model consistently outperforms all others, emphasizing the need for careful model selection based on specific research goals and biological contexts [3] [18]. The novel biological metrics provide essential guidance for this selection process, revealing dimensions of model performance that traditional metrics cannot capture.

As single-cell technologies continue to evolve, generating increasingly complex multimodal datasets, the role of biologically grounded evaluation metrics will become even more critical. These metrics not only enable more rigorous model assessment but also drive the development of more biologically aware algorithms—ultimately accelerating the translation of computational advances into genuine biological insights and therapeutic breakthroughs.

Single-cell foundation models (scFMs) represent a transformative approach in computational biology, leveraging large-scale deep learning pretrained on vast datasets to interpret single-cell genomics data through self-supervised learning [8]. These models have emerged as powerful tools for integrating heterogeneous datasets and exploring biological systems, inspired by the remarkable success of foundation models in natural language processing and computer vision [8] [3]. The fundamental premise of scFMs is that by exposing a model to millions of cells encompassing diverse tissues and conditions, the model can learn universal principles of cellular biology that generalize to new datasets and downstream tasks without task-specific training [8] [3].

Despite their theoretical promise, practical application reveals significant complexities. A comprehensive 2025 benchmark study demonstrates that no single scFM consistently outperforms others across all tasks, emphasizing the critical need for tailored model selection based on specific dataset characteristics and research objectives [3]. The intricate relationship between single-cell sequencing data and underlying biological insights has created three fundamental challenges for researchers: assessing the biological relevance of scFM embeddings, choosing between complex foundation models and simpler alternatives, and systematically selecting models for specific tasks and datasets [3]. This guide addresses these challenges by providing an evidence-based framework for strategic model selection, grounded in recent benchmarking studies and experimental data.

scFM Architectures: Technical Foundations and Implementation

Core Architectural Components and Pretraining Strategies

Single-cell foundation models employ varied architectural approaches to process high-dimensional, sparse single-cell data. Most scFMs utilize transformer architectures, which employ attention mechanisms to model relationships between genes within individual cells [8]. These architectures can be broadly categorized into encoder-based models like Geneformer and scBERT, decoder-based models like scGPT, and hybrid encoder-decoder designs such as scFoundation [8] [3]. A critical differentiator among models lies in their input representation strategies, which must address the fundamental challenge that gene expression data lacks natural sequential ordering unlike text or speech.

The table below summarizes the architectural characteristics of prominent scFMs:

Table 1: Architectural Components of Major Single-Cell Foundation Models

Model Name	Omics Modalities	Model Parameters	# Input Genes	Value Embedding	Positional Embedding	Architecture	Pretraining Tasks
Geneformer	scRNA-seq	40 M	2048 ranked genes	Ordering	✓	Encoder	MGM with CE loss
scGPT	scRNA-seq, scATAC-seq, CITE-seq, spatial	50 M	1200 HVGs	Value binning	×	Encoder with attention mask	Iterative MGM with MSE loss
UCE	scRNA-Seq	650 M	1024 non-unique genes	/	✓	Encoder	Modified MGM
scFoundation	scRNA-Seq	100 M	19,264 genes	Value projection	×	Asymmetric encoder-decoder	Read-depth-aware MGM
LangCell	scRNA-Seq	40 M	2048 ranked genes	Ordering	✓	Encoder	MGM with cell type labels

Tokenization strategies represent a crucial architectural decision, with different models employing gene ranking by expression levels, value binning, or genomic position ordering [8]. Similarly, positional embedding approaches vary significantly, with some models implementing sophisticated ranking systems while others omit positional information entirely [3]. These architectural decisions fundamentally influence how models capture biological relationships and prioritize genetic information.

Input Representation and Tokenization Strategies

The process of converting raw single-cell data into model-interpretable tokens represents a critical challenge in scFM design. As gene expression data lacks inherent sequential structure, researchers have developed various tokenization strategies to create artificial sequences for transformer processing [8]. Common approaches include ranking genes within each cell by expression levels, partitioning genes into expression value bins, or using normalized counts directly [8].

Geneformer employs a deterministic ranking system that orders genes by expression magnitude, creating a consistent input structure across cells [3]. In contrast, scGPT utilizes value binning combined with highly variable gene selection to focus computational resources on biologically informative features [3]. scFoundation takes a more comprehensive approach, incorporating nearly all protein-encoding genes without ranking or binning [3]. Each strategy carries distinct advantages: ranking prioritizes highly expressed genes, binning reduces noise from low expression values, and comprehensive inclusion preserves potentially meaningful biological signals from lowly expressed genes.

Special tokens represent another important aspect of input representation, with some models incorporating cell identity tokens, modality indicators, and batch information to provide biological context [8]. The embedding process typically combines gene identifier embeddings with expression value representations, creating rich input vectors that capture both identity and quantitative information [8].

Figure 1: scFM Input Processing and Tokenization Workflow - This diagram illustrates the transformation of raw single-cell data into model-ready tokens through various tokenization strategies and embedding components.

Benchmarking Framework: Experimental Design and Evaluation Metrics

Comprehensive Evaluation Protocol for Zero-Shot Embeddings

The benchmarking methodology for evaluating scFMs employs a rigorous, multi-faceted approach designed to assess model performance under realistic biological scenarios. This framework evaluates zero-shot gene and cell embeddings learned during pretraining without task-specific fine-tuning, providing insights into the fundamental biological knowledge captured by each model [3]. The evaluation encompasses two gene-level tasks (gene function prediction and gene-gene interaction) and four cell-level tasks (batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction) across diverse datasets with high-quality labels [3].

A critical innovation in recent benchmarking efforts is the introduction of biologically-grounded evaluation metrics that move beyond technical performance to assess biological relevance. The scGraph-OntoRWR metric measures the consistency between cell type relationships captured by scFMs and established biological knowledge in cell ontologies [3]. Similarly, the Lowest Common Ancestor Distance (LCAD) metric evaluates the severity of cell type misclassification errors by measuring ontological proximity between predicted and actual cell types [3]. These metrics address the fundamental question of whether scFMs capture meaningful biological relationships rather than merely optimizing technical benchmarks.

To ensure robust evaluation, the benchmarking protocol incorporates multiple performance metrics spanning unsupervised, supervised, and knowledge-based approaches [3]. This multi-dimensional assessment strategy prevents over-reliance on any single metric and provides a holistic view of model capabilities. The protocol also introduces the Roughness Index (ROGI) as a proxy for model selection, quantifying the smoothness of cell-property landscapes in latent embeddings, which correlates with downstream task performance [3].

Table 2: Essential Research Reagents and Computational Resources for scFM Evaluation

Resource Category	Specific Examples	Function/Purpose	Implementation Notes
Benchmark Datasets	AIDA v2, CELLxGENE Census, PanglaoDB, Human Cell Atlas	Provide standardized, biologically diverse data for model training and evaluation	Ensure dataset diversity across tissues, species, and experimental conditions
Evaluation Metrics	scGraph-OntoRWR, LCAD, ARI, ASW, Cell-type F1	Quantify performance across technical and biological dimensions	Combine multiple metrics for holistic assessment
Baseline Methods	HVG selection, Seurat, Harmony, scVI	Establish performance benchmarks for traditional approaches	Essential for contextualizing scFM performance gains
Computational Infrastructure	GPU clusters, High-memory nodes, Distributed training frameworks	Enable model training and inference at scale	scFMs require substantial computational resources (e.g., 40M-650M parameters)
Biological Knowledge Bases	Cell Ontology, Gene Ontology, Protein-protein interactions	Provide ground truth for biological relevance assessment	Critical for ontology-informed metrics

Comparative Performance Analysis: Quantitative Results Across Tasks

Task-Specific Model Performance and Rankings

Comprehensive benchmarking across six scFMs and multiple baseline methods reveals distinct performance patterns across different biological tasks. The evaluation demonstrates that model performance is highly task-dependent, with different architectures excelling in specific applications [3]. For cell type annotation, encoder-based models like Geneformer and scBERT generally outperform alternatives, particularly for novel cell type identification [3]. Conversely, for batch integration tasks, models with explicit batch correction mechanisms such as scGPT achieve superior results in preserving biological variation while removing technical artifacts [3].

The table below summarizes relative performance rankings across major task categories:

Table 3: Task-Specific Performance Rankings of Single-Cell Foundation Models

Model	Cell Type Annotation	Batch Integration	Cancer Cell Identification	Drug Sensitivity Prediction	Novel Cell Type Detection	Overall Ranking
Geneformer	1	3	2	4	1	2
scGPT	2	1	3	2	3	1
scFoundation	4	2	1	3	4	3
UCE	3	4	4	1	2	4
LangCell	5	5	5	5	5	6
scCello	6	6	6	6	6	5

For clinically relevant tasks such as cancer cell identification and drug sensitivity prediction, the benchmarking results reveal intriguing patterns. scFoundation demonstrates particular strength in cancer cell identification across seven cancer types, likely due to its comprehensive gene coverage [3]. For drug sensitivity prediction across four therapeutic agents, UCE achieves superior performance, potentially attributable to its protein-based embeddings that capture functional relationships [3]. These findings underscore the importance of matching model architecture to specific application requirements rather than seeking a universal solution.

Performance Relative to Traditional Methods

A critical finding from recent benchmarking efforts is that scFMs do not universally outperform traditional methods in all scenarios. For specific tasks on smaller datasets, simpler approaches like highly variable gene selection combined with standard machine learning classifiers can achieve competitive performance with significantly reduced computational requirements [3]. However, as dataset complexity and size increase, scFMs demonstrate clear advantages in capturing biological relationships and generalizing to novel cell types and conditions [3].

The performance advantage of scFMs becomes most pronounced in challenging biological scenarios including cross-tissue integration, novel cell type identification, and rare cell population detection [3]. In these contexts, the biological knowledge encoded during large-scale pretraining enables zero-shot identification of relationships that would require extensive manual curation in traditional pipelines. This capability is particularly valuable for clinical applications where comprehensive annotation may be unavailable.

Figure 2: scFM Selection Framework Based on Dataset Characteristics and Task Requirements - This decision diagram illustrates how different dataset factors and task requirements should guide model selection.

Practical Implementation Guidelines: From Theory to Application

Strategic Model Selection Protocol

Implementing an effective scFM selection strategy requires systematic assessment of both dataset characteristics and research objectives. Based on comprehensive benchmarking results, we propose a four-stage protocol for optimal model selection:

Dataset Characterization: Quantify key dataset properties including cell count, gene coverage, cell type diversity, batch complexity, and presence of rare populations. Calculate the Roughness Index (ROGI) for candidate embeddings as an indicator of latent space quality [3].
Task Requirement Analysis: Classify primary tasks along dimensions of complexity, biological interpretability needs, and performance requirements. Distinguish between standard operations (cell type annotation) and challenging scenarios (novel cell type discovery) [3].
Resource Assessment: Evaluate available computational resources, including GPU memory, training time constraints, and inference latency requirements. Smaller models like Geneformer (40M parameters) offer practical advantages in resource-constrained environments [3].
Iterative Validation: Implement a validation pipeline with biological ground truth, using ontology-informed metrics like scGraph-OntoRWR to verify biological relevance beyond technical performance [3].

For projects with specific requirements, targeted recommendations emerge from benchmarking data. Cancer studies with focus on tumor microenvironment characterization benefit from scFoundation's comprehensive gene coverage [3]. Cross-species integration tasks achieve better performance with protein-based embedding approaches like UCE [3]. For multi-omic integration, scGPT's modality-aware architecture provides distinct advantages [8] [3].

Future Directions and Emerging Trends

The field of single-cell foundation models continues to evolve rapidly, with several promising directions emerging. Multi-modal integration represents a frontier area, with models increasingly incorporating ATAC-seq, proteomics, and spatial data to create more comprehensive cellular representations [8]. Interpretability enhancements through attention mechanism analysis provide pathways for biological insight discovery directly from model weights [8] [3]. Scalability improvements via efficient transformer architectures address computational barriers for widespread adoption [8].

For research groups implementing scFMs, we recommend establishing modular pipelines that facilitate model comparison and integration. The rapid pace of development necessitates flexible frameworks that can incorporate new architectures as they emerge. Additionally, investment in biological validation workflows remains essential—technical performance metrics must be complemented with experimental verification to ensure biological relevance and clinical utility.

The benchmarking evidence clearly indicates that the scFM landscape will remain diverse rather than converging on a single dominant architecture. This diversity reflects the multifaceted nature of biological questions and the varied characteristics of single-cell datasets. By adopting a strategic, task-driven approach to model selection, researchers can maximize insights from their single-cell genomics investments while positioning themselves to leverage continued advancements in foundation model technologies.

Single-cell foundation models (scFMs) represent a paradigm shift in the analysis of single-cell RNA sequencing (scRNA-seq) data, promising to unlock a universal understanding of cellular biology. These models, pre-trained on vast datasets comprising millions of cells, are designed to generate meaningful cell and gene embeddings that can be adapted to a wide array of downstream tasks with little to no additional training, an approach known as zero-shot application [9] [18]. The efficacy of these embeddings is paramount, particularly in discovery-driven research where predefined labels are unavailable, making fine-tuning infeasible [1]. The quality of these zero-shot representations is not accidental; it is meticulously engineered through three primary optimization levers: the model's architectural size, the scale and diversity of the pre-training corpus, and the strategies used for prompting or tokenizing the input data [3] [18]. This guide objectively compares the performance of current scFMs by synthesizing recent benchmarking studies, providing researchers and drug development professionals with a data-driven framework for model selection based on these critical levers.

Experimental Protocols for Benchmarking scFMs

To ensure a fair and rigorous comparison of scFMs, benchmarking studies have adopted comprehensive evaluation protocols. These protocols are designed to assess the quality of the latent representations produced by the models in a zero-shot setting, meaning the pre-trained models are used without any further task-specific fine-tuning [1] [18].

The general workflow involves extracting cell or gene embeddings from each model and using these embeddings to perform a variety of downstream tasks. Model performance is then quantified using a suite of metrics that capture both technical and biological validity. Common tasks include:

Cell-level tasks: These assess the model's ability to generate embeddings that meaningfully represent cellular states. Key tasks include:
- Cell Type Annotation/Clustering: Evaluating how well the embeddings separate known cell types, measured by metrics like Average BIO (AvgBIO) score and Average Silhouette Width (ASW) [1] [18].
- Batch Integration: Assessing the model's capacity to remove technical batch effects while preserving biological variation, using metrics such as batch mixing scores and principal component regression (PCR) [1] [18].
Gene-level tasks: These evaluate the biological relevance of the learned gene embeddings, for instance, by measuring their ability to predict Gene Ontology (GO) terms or tissue specificity [18].

A significant advancement in recent benchmarks is the introduction of biology-driven metrics. These include novel tools like scGraph-OntoRWR, which measures the consistency of cell-type relationships in the embedding space with established biological knowledge from cell ontologies, and the Lowest Common Ancestor Distance (LCAD), which evaluates the severity of cell type misannotation by considering their ontological proximity [3] [18]. These metrics provide a crucial link between computational outputs and biological plausibility.

The following diagram illustrates the typical workflow of a comprehensive scFM benchmarking study.

The Impact of Model Size and Architecture

Model size, often measured by the number of parameters, is a foundational lever for performance. Larger models theoretically possess a greater capacity to learn complex patterns from the pre-training data. However, benchmarking reveals a nuanced reality where no single model dominates, and architectural choices are equally critical [3] [18].

The table below summarizes key architectural features of several prominent scFMs.

Table 1: Architectural Overview of Selected Single-Cell Foundation Models

Model Name	Model Parameters	Pre-training Dataset Size	Key Architectural & Tokenization Features
Geneformer [3]	40 Million	30 million cells	Uses a ranked list of 2048 genes; includes positional embeddings.
scGPT [3]	50 Million	33 million cells	Uses 1200 Highly Variable Genes (HVGs); employs value binning.
scFoundation [3]	100 Million	50 million cells	Uses a vast gene vocabulary (~19k); asymmetric encoder-decoder.
UCE [3]	650 Million	36 million cells	Incorporates protein sequence embeddings (from ESM-2).

A holistic ranking that aggregates performance across multiple tasks and metrics shows that while larger models like scFoundation can be powerful, they do not consistently outperform smaller models like Geneformer or scGPT in every scenario [18]. The optimal model is highly task-dependent. For instance, some models excel at gene-level functional prediction, while others are more robust for clinical tasks like drug sensitivity prediction [18]. This indicates that the quality of the architecture and the pre-training objective is as important as raw parameter count. A model with a well-designed tokenization strategy and pre-training task can often outperform a larger, less optimally configured model.

The Critical Role of the Pre-training Corpus

The scale and diversity of the pre-training corpus are arguably as critical as the model architecture itself. The corpus serves as the model's "textbook," from which it learns the fundamental language of biology. Benchmarking studies have systematically investigated this by comparing variants of the same model pre-trained on datasets of different sizes and tissue compositions [1] [18].

Evidence indicates that increasing corpus size and diversity generally improves zero-shot performance, but with diminishing returns. For example, one study evaluated scGPT models pre-trained on a kidney-specific dataset (~814,000 cells), a blood and bone marrow dataset (~10.3 million cells), and a diverse human cell atlas (~33 million cells) [1]. The results showed that the blood and human models significantly outperformed the kidney-specific model, even on non-blood tissues, demonstrating the value of diversity [1]. Surprisingly, the more diverse human atlas model sometimes slightly underperformed compared to the blood-specific one, suggesting that beyond a certain point, simply adding more data may not confer additional benefits if not matched with an appropriate model capacity or pre-training objective [1].

The relationship between corpus composition and performance is also task-specific. Models pre-trained on broad atlases are generally more versatile. However, for specialized tasks focused on a specific tissue type, such as immune cell characterization, a model pre-trained on a large, relevant corpus (like blood and bone marrow) can be highly competitive [1]. This underscores the importance of aligning the pre-training corpus with the intended application.

Table 2: Impact of Pre-training Corpus on Zero-Shot Performance

Pre-training Corpus Scenario	Impact on Model Performance	Practical Implication for Selection
Small, Tissue-Specific Corpus	Limited generalizability; poor performance on unseen cell types or tissues.	Suitable only for tasks confined to that specific tissue.
Large, Diverse Corpus	Robust and versatile performance across a wide range of tasks and tissues [18].	Recommended for general-purpose use and exploratory research.
Mismatched Corpus & Task	Suboptimal performance; model may lack relevant biological context.	Select a model pre-trained on data biologically relevant to your task.

Prompting and Tokenization Strategies

In natural language processing, "prompting" guides a model's behavior through carefully crafted input. In scFMs, the analogous process is tokenization—the method of converting a cell's gene expression profile into a sequence of discrete tokens that the transformer model can process [9]. This is a critical lever because, unlike words in a sentence, genes have no natural sequential order.

The chosen tokenization strategy directly impacts how the model perceives biological data. Common approaches include:

Ranking by Expression: Genes are ordered from highest to lowest expression within each cell, creating a deterministic "sentence" (used by Geneformer and LangCell) [3] [9].
Using Highly Variable Genes (HVGs): Selecting a subset of genes with the highest variance across the dataset, often ordered by expression magnitude (used by scGPT) [3].
Genomic Position Ordering: Ordering genes based on their physical location in the genome (explored by UCE) [3].
Incorporating Value and Positional Embeddings: Beyond the gene identity, models add information about expression levels (value embeddings) and the gene's position in the input sequence (positional embeddings) [3].

While a standardized benchmark comparing all tokenization strategies is not available, model-specific reports offer insights. For instance, some models find that complex ranking strategies offer no clear advantage over simpler approaches, while others are built upon the hypothesis that expression-based ranking is fundamental [9]. The integration of additional biological context, such as protein sequences or gene ontology information during tokenization (as in UCE), is an emerging technique to enrich the input and potentially lead to more biologically-grounded embeddings [3].

Comparative Performance Analysis

When placed in direct competition with established, simpler baseline methods, the zero-shot performance of scFMs is mixed. A core finding across multiple independent benchmarks is that no single scFM consistently outperforms all others across every task [3] [1] [18]. Furthermore, in many cases, simpler methods can be surprisingly strong competitors.

Table 3: Comparative Performance of scFMs vs. Baseline Methods on Common Tasks

Task	Top Performing Methods	Key Finding	Supporting Data
Cell Type Clustering	HVG, scVI, Harmony	scFMs (Geneformer, scGPT) often underperform simpler baselines like HVG selection in separating known cell types.	HVG outperformed Geneformer and scGPT on all metrics in one study [1].
Batch Integration	HVG, scVI, Harmony	scFMs struggle to correct for batch effects between different experimental techniques zero-shot.	On the Pancreas dataset, Geneformer and scGPT failed to integrate across techniques, while scVI and Harmony succeeded [1].
Biologically-Relevant Embeddings	scFMs (e.g., scGPT, Geneformer)	scFM gene embeddings show promise in capturing functional gene relationships, as measured by ontology-based metrics.	scFMs demonstrated an ability to learn biologically meaningful gene embeddings that align with GO terms [18].

This data suggests that the decision to use a scFM should be guided by the specific problem. For standard tasks like batch correction on well-studied tissues, traditional methods may be more reliable and computationally efficient. However, for tasks requiring deep biological insight, such as predicting gene functions or analyzing complex clinical outcomes, scFMs offer a unique advantage, especially when their embeddings are evaluated with biology-aware metrics [18].

The Scientist's Toolkit: Research Reagent Solutions

The following table details key computational tools and resources essential for working with and evaluating single-cell foundation models.

Table 4: Essential Research Reagents for scFM Evaluation

Tool / Resource	Type	Function in scFM Research
CELLxGENE [1] [18]	Data Platform	Provides access to standardized, annotated single-cell datasets essential for pre-training and independent benchmark validation.
Harmony [1] [18]	Software Algorithm	A strong baseline method for batch integration; used to benchmark the integration performance of scFMs.
scVI [1] [18]	Software Algorithm	A generative deep learning model for single-cell data; another strong baseline for clustering and batch correction tasks.
HVG Selection [1]	Statistical Method	The simple approach of selecting Highly Variable Genes; a surprisingly strong and computationally efficient baseline for many tasks.
scGraph-OntoRWR [3] [18]	Evaluation Metric	A novel ontology-based metric that quantifies the biological consistency of cell-type relationships in an embedding space.
Lowest Common Ancestor Distance (LCAD) [3] [18]	Evaluation Metric	A metric for cell annotation that measures the ontological distance of misclassifications, providing a biologically informed error score.

The optimization of zero-shot scFM embedding quality is a multi-faceted challenge governed by the interplay of model size, pre-training corpus, and tokenization strategies. Current evidence indicates that there is no one-size-fits-all solution. Researchers must make strategic choices based on their specific task, computational resources, and the need for biological interpretability. While larger models trained on diverse data generally perform better, the law of diminishing returns applies, and simpler models or even traditional methods can be superior for specific, well-defined tasks. The future of scFM development lies not merely in scaling up, but in smarter architectural design, more biologically meaningful pre-training objectives, and the development of standardized, biologically-grounded evaluation protocols. By critically understanding these optimization levers, scientists and drug developers can better harness the power of scFMs to advance cellular biology and therapeutic discovery.

Rigorous Benchmarking and Comparative Analysis of scFM Performance

Single-cell Foundation Models (scFMs) represent a significant advancement in the analysis of single-cell RNA sequencing (scRNA-seq) data, promising to learn universal biological knowledge from massive datasets. However, rigorous benchmarking, particularly in zero-shot settings where models are used without any further training, reveals a more nuanced reality. Current evidence indicates that while scFMs are robust and versatile tools, their zero-shot embeddings do not consistently outperform those generated by established, simpler methods like Highly Variable Genes (HVG) selection, scVI, and Harmony across key tasks such as cell type clustering and batch integration [3] [1]. This guide provides a structured comparison of these methods, summarizing quantitative performance data and detailing the experimental protocols used for evaluation, to help researchers make informed choices in their single-cell analysis workflows.

Performance Benchmarking Tables

The following tables consolidate quantitative findings from recent large-scale benchmarking studies, which evaluated methods on their ability to remove batch effects while preserving meaningful biological variation.

Table 1: Performance Ranking in Cell Type Clustering and Batch Integration

This table summarizes the performance rankings of different methods across common single-cell analysis tasks, based on metrics like Average BIO (AvgBio) score and batch integration metrics [1] [37].

Method	Type	Cell Type Clustering (AvgBio Score)	Batch Integration (Pancreas Dataset)	Overall Ranking (Complex Tasks)
HVG Selection	Baseline	Outperforms Geneformer & scGPT [1]	Best scores across datasets [1]	Varies
Harmony	Traditional	Better than scGPT on most datasets [1]	Succeeds in qualitative/quantitative assessment [1]	Excels on simpler tasks [37]
scVI	Traditional (Deep Learning)	Better than scGPT on most datasets [1]	Succeeds in qualitative/quantitative assessment [1]	Excels on complex tasks [37]
scGPT	scFM	Inconsistent; outperformed by baselines [1]	Mixed; primary structure driven by batch effects [1]	Robust across tasks but inconsistent zero-shot [1] [6]
Geneformer	scFM	Performs worse than HVG, Harmony, scVI [1]	Fails to retain cell type info; ranks last [1]	Underperforms in zero-shot [1]

This table defines the key metrics used to evaluate method performance in the benchmarks, explaining what they measure and their ideal outcomes [37].

Metric Category	Metric Name	Description	Ideal Value
Batch Effect Removal	kBET (k-nearest-neighbor batch effect test)	Measures batch mixing within local neighborhoods.	Higher (0-1 scale)
	ASW (Average Silhouette Width) Batch	Measures how similar cells are to their batch versus others.	Higher (0-1 scale)
	Graph iLISI (Integration Local Inverse Simpson's Index)	Measures diversity of batches in a cell's neighborhood.	Higher
Biological Conservation	ASW (Average Silhouette Width) Cell Type	Measures how similar cells are to their cell type versus others.	Higher (0-1 scale)
	ARI (Adjusted Rand Index)	Measures similarity between two clusterings (e.g., before/after integration).	Higher (0-1 scale)
	Isolated Label Score	Assesses conservation of rare cell type annotations.	Higher
	Trajectory Conservation	Evaluates preservation of developmental trajectories.	Higher

Detailed Experimental Protocols

To ensure the reproducibility and proper contextualization of the benchmark results presented above, this section details the core experimental methodologies employed.

Benchmarking Pipeline for Zero-Shot Evaluation

The evaluation of zero-shot performance follows a structured pipeline to ensure a fair and rigorous comparison between scFMs and baseline methods [3] [1].

Data Preprocessing and Curation

Benchmarking studies assemble large and diverse datasets from public resources like the CELLxGENE Discover census [3] [8]. These datasets encompass a wide range of tissues, conditions, and sequencing technologies to test model generalizability. A critical step is standardized preprocessing, which includes:

Quality Control: Filtering out low-quality cells and genes based on metrics like mitochondrial read percentage and total detected genes [38].
Gene Name Standardization: Converting gene identifiers to a standard nomenclature (e.g., according to HUGO Gene Nomenclature Committee guidelines) [38].
Format Unification: Converting all data to a consistent sparse matrix format for downstream analysis [38].

Feature Extraction and Downstream Tasks

In the zero-shot protocol, embeddings from scFMs are generated without any task-specific fine-tuning. These are directly compared to the outputs of baseline methods [1].

scFMs: Pretrained models like Geneformer and scGPT take the preprocessed data and output cell (or gene) embeddings using their internal, fixed parameters [3] [1].
Baselines: Methods like HVG selection, Harmony, and scVI generate their own lower-dimensional representations or corrected feature spaces from the input data [1] [37]. These representations are then evaluated on a suite of downstream tasks, primarily:
Cell-level tasks: Cell type annotation (clustering), batch integration, and clinically relevant tasks like cancer cell identification [3].
Gene-level tasks: Gene function prediction and inference of gene-gene relationships [3] [38].

Novel Evaluation Metrics for Biological Insight

Beyond standard metrics, recent benchmarks have introduced novel, biology-informed metrics to deeper evaluate the embeddings.

scGraph-OntoRWR: This metric evaluates whether the relationships between cell types captured in the model's embedding space are consistent with established biological knowledge from cell ontologies [3].
Lowest Common Ancestor Distance (LCAD): When a model misclassifies a cell type, this metric assesses the "severity" of the error by measuring the ontological proximity between the true and predicted cell type. A smaller distance indicates a biologically plausible mistake [3].
Roughness Index (ROGI): This index acts as a proxy for model performance by measuring the smoothness of the cell-property landscape in the latent space. A smoother landscape generally makes it easier to train accurate task-specific models [3].

The Scientist's Toolkit: Essential Research Reagents

This section outlines the key computational tools and resources that are foundational for conducting benchmarking studies and single-cell analyses.

Tool/Resource Name	Category	Primary Function	Key Feature
CELLxGENE Discover [1] [8]	Data Resource	Provides unified access to millions of curated, annotated single-cell datasets.	Serves as a critical source of diverse, high-quality training and benchmarking data.
Scanpy [39]	Analysis Toolkit	A Python-based framework for comprehensive single-cell data analysis.	Offers scalable workflows for preprocessing, clustering, and visualization of millions of cells.
Seurat [39] [37]	Analysis Toolkit	An R-based toolkit for single-cell genomics.	Known for its versatile data integration methods and support for multi-modal data.
scvi-tools [39] [37]	Probabilistic Modeling	Uses deep generative models (VAEs) for single-cell data analysis.	Provides superior batch correction and imputation; excels on complex integration tasks.
BioLLM [6]	Evaluation Framework	A unified framework for integrating and benchmarking scFMs.	Standardizes APIs and evaluation protocols, enabling fair model comparison.

Conceptual Framework for Model Selection

The choice between a complex scFM and a simpler baseline is not straightforward. The following diagram and summary outline the key decision factors identified through benchmarking.

Task Nature and Label Availability: For exploratory analysis where cell type labels are unknown, making fine-tuning impossible, the zero-shot performance of scFMs is critical. Current evidence suggests that in this setting, traditional baselines may be more reliable [1]. If labels are available for fine-tuning, scFMs can show stronger performance [3].
Dataset Size and Computational Resources: Simpler machine learning models are often more efficient and can adapt more readily to specific, smaller datasets, especially under resource constraints [3]. scFMs, with their large number of parameters, require significant computational resources for both training and inference.
Need for Biological Interpretability: scFMs offer unique advantages for biological discovery. They can provide insights into gene-gene relationships and functional networks [38]. The novel metrics like scGraph-OntoRWR and LCAD also allow for a more biologically grounded interpretation of model outputs and errors [3].
Task Complexity and Data Structure: For standard tasks like batch correction on datasets with distinct batch structures, simpler methods may suffice. However, on more complex integration tasks—such as those involving nested batch effects or multiple technologies—deep learning models like scVI and scANVI have been shown to excel [37]. scFMs also hold promise for complex gene-level tasks like drug sensitivity and perturbation prediction [3].

In conclusion, rigorous benchmarking establishes that while single-cell Foundation Models are powerful and versatile tools, they are not a universal substitute for established baseline methods. The "pre-train then fine-tune" paradigm shows promise, but the zero-shot embeddings of current scFMs can be inconsistent and may be outperformed by simpler, more efficient approaches like HVG selection, scVI, and Harmony for core tasks like cell type clustering and batch integration [3] [1].

The field is rapidly advancing, with new, larger models like CellFM (trained on 100 million human cells) continuing to emerge [38]. The key to progress lies in the development of standardized benchmarking frameworks like BioLLM [6] and the adoption of more biologically meaningful evaluation metrics [3]. For researchers today, the optimal strategy is a pragmatic one: let the specific biological question, dataset characteristics, and available resources guide the choice of tool, leveraging the strengths of both traditional baselines and the increasingly sophisticated generation of foundation models.

The application of single-cell foundation models (scFMs) in clinical research represents a paradigm shift, moving beyond exploratory biology to direct impact on disease understanding and treatment strategies. These models, pretrained on millions of single-cell transcriptomes, promise to capture universal biological principles that can be adapted to various downstream tasks. However, their true utility in clinical contexts, particularly for critical applications like cancer cell identification and drug sensitivity prediction, must be rigorously validated. This evaluation is especially pertinent in zero-shot settings, where models are applied without task-specific fine-tuning, as this scenario mirrors real-world clinical challenges where labeled data for every new cancer type or drug is unavailable. This guide provides a structured, evidence-based comparison of leading scFMs against established traditional methods, offering researchers a clear framework for model selection in clinically relevant research.

Performance Comparison: scFMs vs. Baselines on Clinical Tasks

The performance of computational models on clinically oriented tasks is the ultimate measure of their utility. The following tables synthesize quantitative benchmarks from recent large-scale studies, comparing scFMs against traditional machine learning baselines. These evaluations focus on two pillars of clinical computational biology: accurately identifying cancer cells and predicting their response to therapeutic agents.

Table 1: Performance on Cancer Cell Identification (Cell-Level Task)

Model Category	Specific Model	Key Metric: AvgBIO Score (Higher is Better)	Performance Summary & Key Strengths
Single-Cell Foundation Model (scFM)	scGPT	~0.4 - 0.6 (varies by dataset)	Robust and versatile; strong on datasets with biological batch effects (e.g., Tabula Sapiens) [3] [1].
	Geneformer	~0.3 - 0.5 (varies by dataset)	Struggles with batch effect correction; clustering often driven by technical variation [1].
Traditional Baseline	HVGs (Highly Variable Genes)	Consistently high (~0.7)	Surprisingly powerful baseline; often outperforms scFMs in zero-shot cell type clustering [1].
	scVI	~0.5 - 0.7	Excellent at correcting for technical batch effects (e.g., in Pancreas dataset) [1].
	Harmony	~0.5 - 0.7	Strong performance, particularly in preserving biological variation after integration [3] [1].

Table 2: Performance on Drug Sensitivity Prediction (Gene-Level & Clinical Task)

Model / Approach	Key Metric / Performance	Context & Interpretability
scFMs in Zero-Shot	Not consistently superior to baselines [3]	Embeddings capture biological insights, but simpler models can adapt more efficiently to specific datasets [3].
XGBoost (Traditional ML)	Pearson ρ = 0.88-0.89 on GDSC dataset [40]	High performance for drug-specific models using gene expression; offers strong interpretability via SHAP [40].
CellHit Pipeline (XGBoost + Alignment)	Accurately predicted best drugs for TCGA patients [40]	Model interpretations converged with known drug Mechanisms of Action (MOA), validating biological relevance [40].
metaDRP (Few-Shot Learning)	Mitigates out-of-distribution issues; accurate in low-sample settings [41]	Incorporates prior knowledge via graph neural networks; provides interpretable insights into drug MOAs [41].

Experimental Protocols for Clinical Benchmarking

The comparative data presented above are derived from rigorous, standardized benchmarking studies. Understanding the underlying experimental methodologies is crucial for interpreting the results and designing future validation experiments.

Benchmarking Framework for scFMs

A comprehensive benchmark study evaluated six scFMs against established baselines under realistic conditions, encompassing both gene-level and cell-level tasks [3]. The protocol can be summarized as follows:

Task Selection: Pre-clinical batch integration and cell type annotation were evaluated across five datasets with diverse biological conditions. Clinically relevant tasks, including cancer cell identification and drug sensitivity prediction, were assessed across seven cancer types and four drugs [3].
Model Evaluation: Performance was quantified using 12 metrics spanning unsupervised, supervised, and knowledge-based approaches. A novel metric, scGraph-OntoRWR, was introduced to uncover the intrinsic biological knowledge encoded by scFMs by measuring the consistency of cell-type relationships captured by the models with prior biological knowledge from cell ontologies [3].
Zero-Shot Protocol: The evaluation focused on the zero-shot capabilities of scFMs, leveraging the embeddings generated by the models without any further task-specific fine-tuning. This provides a clear measure of the general biological knowledge acquired during pretraining [3] [1].

Interpretable Drug Sensitivity Modeling

The "CellHit" pipeline, which demonstrated high efficacy in predicting patient-specific drug responses, employed the following methodology [40]:

Data Processing: RNA-seq data from cancer cell lines (e.g., from GDSC) were processed and aligned to patient tumor data (e.g., from TCGA) using tools like Celligner. This step is critical for translating models trained on cell lines to clinical patient samples [40].
Model Training: For each drug, a separate model (e.g., XGBoost) was trained using cancer cell line transcriptomics data to predict IC50 values (a measure of drug sensitivity).
Biological Interpretation: The model's interpretability was achieved through two main approaches:
- Feature Importance: Using game theory approaches (SHAP) and permutation importance to identify genes crucial for prediction [40].
- MOA Pathway Validation: Leveraging Large Language Models (LLMs) to systematically curate associations between drugs and biological pathways (e.g., from Reactome). The genes identified as important by the model were then tested for enrichment in these known drug Mechanism of Action (MOA) pathways, validating that the model learns relevant biology [40].

Diagram 1: Drug sensitivity modeling workflow.

The Scientist's Toolkit: Essential Research Reagents & Frameworks

Successfully implementing and evaluating scFMs for clinical tasks requires a suite of computational tools, frameworks, and data resources.

Table 3: Key Resources for scFM and Clinical Validation Research

Resource Name	Type	Primary Function in Research
BioLLM Framework [42]	Software Framework	Provides a unified interface for diverse scFMs, standardizing APIs and evaluation protocols for consistent benchmarking.
CELLxGENE / CellxGene [3] [1] [8]	Data Platform	A curated corpus of millions of single-cell datasets, serving as a primary source for model pretraining and validation.
GDSC (Genomics of Drug Sensitivity in Cancer) [41] [40]	Database	A foundational resource containing drug sensitivity screens for hundreds of cancer cell lines, used for training drug response models.
TCGA (The Cancer Genome Atlas) [40] [43]	Database	Provides bulk RNA-seq and clinical data from patient tumors, used for validating the translatability of models trained on cell lines.
SHAP (SHapley Additive exPlanations) [40] [43]	Analysis Tool	A game theory-based method to interpret model predictions, identifying which input features (genes) drove a specific output.
Harmony & scVI [3] [1]	Software Tool	Established, high-performing baseline methods for key single-cell tasks like batch integration and data visualization.

The benchmarking data reveals a nuanced landscape for single-cell foundation models in clinical applications. While scFMs like scGPT demonstrate robustness and versatility, their zero-shot performance does not consistently surpass that of simpler, well-established methods like HVG selection, scVI, or Harmony on tasks such as cancer cell identification [3] [1]. For drug sensitivity prediction, traditional machine learning models like XGBoost can achieve high accuracy, particularly when coupled with robust biological interpretation pipelines [40]. The current evidence suggests that no single scFM dominates across all clinical tasks [3]. Model selection must therefore be guided by the specific application, dataset size, and available computational resources. The future of scFMs in clinical research is promising but hinges on overcoming key challenges, including improving zero-shot reliability, enhancing interpretability, and achieving smoother translation from cell line models to patient data.

The emergence of single-cell foundation models (scFMs) has revolutionized the analysis of single-cell RNA sequencing (scRNA-seq) data, offering unprecedented potential for uncovering novel biological insights. However, the rapid development of diverse scFMs, including Geneformer, scGPT, UCE, scFoundation, LangCell, and scCello, has created an urgent need for standardized quantitative metrics to objectively evaluate their performance in real-world research scenarios [3]. This evaluation is particularly critical for zero-shot learning applications, where models are applied without task-specific fine-tuning—a common requirement in exploratory biological discovery where labels are unknown [1].

This guide provides a comprehensive comparison of scFM performance evaluation, focusing on the quantitative metrics that researchers are using to benchmark these models against traditional methods. We delve into the specific methodologies and experimental protocols employed in leading benchmarking studies to help researchers and drug development professionals make informed decisions about model selection for their specific applications, from cell atlas construction to tumor microenvironment studies and treatment decision-making [3].

Quantitative Metrics for scFM Evaluation: Definitions and Applications

Evaluating scFMs requires a multifaceted approach that employs different metrics to assess various aspects of model performance. The table below summarizes the key quantitative metrics used in scFM benchmarking studies.

Table 1: Key Quantitative Metrics for Evaluating Single-Cell Foundation Models

Metric	Full Name	What It Measures	Interpretation	Primary Application Context
ASW	Average Silhouette Width	How similar cells are to their own cluster compared to other clusters	Higher values indicate better cell type separation and clustering quality	Cell type annotation, clustering analysis [1]
AvgBIO	Average BIO Score	Biological conservation after integration	Higher values indicate better preservation of biological variance	Batch integration, data harmonization [1]
ROGI	Roughness Index	Smoothness of the cell-property landscape in latent space	Lower values indicate smoother landscapes and easier model training	Model selection, latent space evaluation [3]
PCR	Principal Component Regression	Proportion of variance explained by batch effects	Lower values indicate better batch effect correction	Batch integration quality [1]
LCAD	Lowest Common Ancestor Distance	Ontological proximity between misclassified cell types	Lower values indicate less severe errors in cell type annotation	Cell type annotation quality assessment [3]
scGraph-OntoRWR	scGraph-Ontological Random Walk with Restart	Consistency of captured cell type relationships with prior biological knowledge	Higher values indicate better alignment with established biological hierarchies	Biological relevance of learned representations [3]

These metrics collectively address three critical dimensions of scFM evaluation: technical performance (ASW, PCR), biological relevance (AvgBIO, scGraph-OntoRWR, LCAD), and practical utility (ROGI). The ROGI metric, in particular, represents a novel advancement that quantitatively estimates how the model performance correlates with cell-property landscape roughness in the pretrained latent space, verifying that performance improvement arises from a smoother landscape that reduces the difficulty of training task-specific models [3].

Comparative Performance Analysis: scFMs vs. Traditional Methods

Comprehensive benchmarking studies have revealed that no single scFM consistently outperforms all others across diverse tasks, emphasizing the need for tailored model selection based on factors such as dataset size, task complexity, biological interpretability, and computational resources [3]. The following tables summarize performance comparisons across critical evaluation tasks.

Table 2: Cell Type Clustering Performance (AvgBIO Score) Comparison

Model/Method	Pancreas Dataset	PBMC (12k) Dataset	Tabula Sapiens	Immune Dataset
HVG (Baseline)	0.781	0.812	0.795	0.803
Harmony	0.792	0.805	0.784	0.798
scVI	0.801	0.819	0.812	0.811
scGPT	0.763	0.826	0.779	0.788
Geneformer	0.742	0.768	0.751	0.739

Table 3: Batch Integration Performance Comparison Across Multiple Datasets

Model/Method	Technical Batch Correction	Biological Batch Correction	Overall Ranking
HVG	1st	1st	1st
scVI	2nd	3rd	2nd
Harmony	3rd	2nd	3rd
scGPT	4th	4th	4th
Geneformer	5th	5th	5th

Table 4: Task-Specific Model Performance Rankings

Model	Cell Type Annotation	Batch Integration	Gene-Level Tasks	Perturbation Prediction
scGPT	1st	2nd	2nd	3rd
Geneformer	3rd	4th	1st	4th
scFoundation	2nd	3rd	3rd	2nd
scBERT	4th	5th	4th	5th

The data reveals several critical patterns. First, in zero-shot settings, scFMs frequently fail to outperform simpler baseline methods like Highly Variable Genes (HVG) selection [1]. For instance, in perturbation effect prediction, scFM embeddings offer limited improvement over simple baseline models, particularly under distribution shift [16]. Second, performance varies significantly by task type, with different models excelling in different domains—Geneformer and scFoundation demonstrate strong capabilities in gene-level tasks, while scGPT shows more robust performance across all tasks, including both zero-shot and fine-tuning scenarios [6]. Third, pretraining does provide measurable benefits, but these benefits appear to plateau, with larger and more diverse datasets not consistently conferring additional advantages beyond certain limits [1].

Experimental Protocols for scFM Benchmarking

Standardized Benchmarking Framework

Leading benchmarking studies employ rigorous methodologies to ensure fair and comprehensive evaluation of scFMs. The typical workflow encompasses multiple stages, from feature extraction through to performance evaluation across diverse tasks.

Diagram 1: scFM Benchmarking Workflow (Width: 760px)

Key Experimental Considerations

Benchmarking studies are designed to evaluate scFMs under realistic conditions that reflect actual research scenarios. The protocols typically incorporate several critical elements:

Zero-shot Evaluation Protocol: Models are evaluated without any task-specific fine-tuning to assess the inherent quality of their pretrained representations, which is particularly important for discovery settings where labels are unknown [1].
Diverse Dataset Selection: Evaluations use multiple datasets with different biological conditions, including normal tissues, cancer samples, and perturbation experiments, to assess generalizability [3].
Comprehensive Baseline Comparison: scFMs are compared against well-established traditional methods, including HVG selection, anchor-based approaches (Seurat), clustering-based methods (Harmony), and generative models (scVI) [3] [1].
Multiple Evaluation Metrics: Studies employ a battery of metrics (ASW, AvgBIO, ROGI, etc.) to capture different aspects of model performance, from technical efficacy to biological relevance [3].
Data Leakage Prevention: Independent and unbiased datasets, such as the Asian Immune Diversity Atlas (AIDA) v2 from CellxGene, are introduced to mitigate the risk of data leakage and validate conclusions [3].

Successful scFM evaluation and application requires both computational tools and biological datasets. The table below catalogues key resources referenced in benchmarking studies.

Table 5: Essential Research Reagents and Computational Resources for scFM Evaluation

Resource Name	Type	Primary Function	Relevance to scFM Research
CELLxGENE	Data Platform	Curated single-cell data collection	Provides standardized datasets for model training and evaluation [1]
AIDA v2	Dataset	Asian Immune Diversity Atlas	Serves as independent validation dataset to prevent data leakage [3]
BioLLM	Software Framework	Unified interface for scFMs	Standardizes model access and evaluation across diverse architectures [6]
PertEval-scFM	Benchmarking Framework	Specialized perturbation evaluation	Systematically tests perturbation effect prediction capabilities [16]
Seurat	Analysis Tool	Single-cell data analysis	Established baseline method for comparison with scFMs [3]
Harmony	Integration Algorithm	Batch effect correction	Benchmark method for evaluating integration capabilities [3] [1]
scVI	Generative Model	Probabilistic modeling of scRNA-seq	Baseline for comparison of latent representation quality [3] [1]

Interpretation Guidelines and Practical Recommendations

Relationship Between Metrics and Biological Insight

The quantitative metrics discussed provide distinct but complementary insights into scFM performance. ASW and AvgBIO scores primarily reflect technical performance in specific tasks like clustering and batch integration, while ROGI offers a more fundamental assessment of the latent space structure that can predict model usability across multiple applications [3]. The novel ontology-informed metrics (scGraph-OntoRWR and LCAD) provide crucial bridges between computational outputs and biological meaning by quantifying how well the models capture established biological relationships [3].

The relationship between these metrics and practical model utility is complex. A model might exhibit excellent ASW scores but perform poorly on scGraph-OntoRWR, indicating technically proficient but biologically misaligned representations. Similarly, strong batch integration metrics might sometimes come at the cost of biological signal preservation, particularly when dealing with complex biological batch effects [1].

Model Selection Framework

Based on comprehensive benchmarking results, researchers can employ the following framework for model selection:

For gene-level tasks: Geneformer and scFoundation demonstrate superior performance, benefiting from their effective pretraining strategies [6].
For cell-type annotation: scGPT generally performs best, particularly when fine-tuning is possible, though simpler methods may suffice for zero-shot applications [1] [6].
For batch integration with technical variation: Traditional methods (scVI, Harmony) typically outperform scFMs in zero-shot settings [1].
For perturbation prediction: Current scFMs show limitations, with specialized models often required for robust performance [16].
When computational resources are limited: Simpler machine learning models often provide more efficient adaptation to specific datasets, particularly under resource constraints [3].

The ROGI metric can serve as a practical proxy for model selection in a dataset-dependent manner, with lower roughness indices generally predicting better performance across diverse tasks [3].

The comprehensive evaluation of single-cell foundation models requires a multifaceted approach that moves beyond single-metric assessments. While quantitative metrics like ASW, BIO scores, and ROGI provide essential performance indicators, the integration of biological relevance metrics represents a critical advancement in the field. The current generation of scFMs shows promising capabilities but does not consistently outperform simpler traditional methods in zero-shot settings, highlighting the need for continued model development and refinement [3] [1].

For researchers and drug development professionals, the key to successful scFM application lies in task-aware model selection guided by systematic benchmarking. As the field evolves, standardized evaluation frameworks like BioLLM [6] and PertEval-scFM [16] will play increasingly important roles in ensuring fair comparison and methodological progress. By understanding the strengths and limitations of current evaluation metrics and the models they assess, the scientific community can better harness the power of scFMs to advance our understanding of complex biological systems and accelerate therapeutic development.

Single-cell Foundation Models (scFMs) like scGPT and Geneformer represent a significant advancement in the analysis of biological systems, promising to automate cell type identification and data integration. However, their application in discovery-driven research often necessitates reliable performance without further model training, a capability known as zero-shot learning. This guide presents a data-driven comparison of scFM performance against established, simpler methods in zero-shot settings. Synthesizing findings from recent rigorous evaluations, we demonstrate that no single model consistently outperforms others across all tasks. The optimal tool selection is highly dependent on the specific analytical goal, underscoring the critical need for task-specific benchmarking in single-cell research.

Foundation models are machine learning models pretrained on vast datasets with the goal of capturing universal patterns that can be applied to a wide range of downstream tasks [1]. In single-cell biology, proposed scFMs like Geneformer and scGPT aim to project noisy gene expression data into a biologically meaningful latent space. A model is used zero-shot when its internal representation of input data—an "embedding"—is used for analysis with no further task-specific training [1].

The reliability of zero-shot performance is particularly critical in exploratory biological research where predefined labels are unavailable, making fine-tuning infeasible. Despite the promise of scFMs, emerging evidence suggests their zero-shot capabilities may not yet consistently surpass those of simpler, established methods. This guide provides an objective, data-backed comparison to help researchers navigate this evolving landscape.

Performance Benchmarking: scFMs vs. Established Baselines

A 2025 independent evaluation rigorously assessed the zero-shot performance of scGPT and Geneformer against simpler baseline methods across multiple datasets and common analytical tasks [1]. The baselines included:

HVG Selection: Choosing a subset of genes with the highest variability across cells.
Harmony: An algorithm that iteratively corrects dataset-specific effects to enable integration.
scVI: A probabilistic graphical model designed for single-cell transcriptomic data.

The evaluation used standard metrics to quantify performance:

AvgBIO Score: Measures the biological integrity of clusters after integration (higher is better).
ASW (Average Silhouette Width): Assesses clustering quality (higher is better).
Batch Integration Metrics: Evaluate how well technical batch effects are removed while preserving biological variation.

Cell Type Clustering Performance

Cell type clustering is a fundamental task where the goal is to group cells based on transcriptional similarity. The following table summarizes the performance of each method across several datasets, ranked by the critical AvgBIO score.

Table 1: Zero-shot performance in cell type clustering (AvgBIO Score). Data sourced from [1].

Dataset	Best Performing Method	scGPT Performance	Geneformer Performance
PBMC (12k)	scGPT	Best	Underperformed HVG, Harmony, and scVI
Immune Cell Atlas	Harmony / scVI	Underperformed best baselines	Underperformed HVG, Harmony, and scVI
Tabula Sapiens	Harmony / scVI	Underperformed best baselines	Underperformed HVG, Harmony, and scVI
Pancreas	Harmony / scVI	Underperformed best baselines	Underperformed HVG, Harmony, and scVI

Key Finding: The simple method of selecting Highly Variable Genes (HVG) outperformed both scGPT and Geneformer on average across all metrics and datasets [1]. While scGPT showed competitive performance on one dataset (PBMC 12k), its performance was inconsistent. Geneformer consistently underperformed relative to all other methods, including the simple HVG baseline.

Batch Integration Performance

Batch integration is crucial for combining data from multiple experiments. The evaluation assessed how well each method mixed cells from different batches without obscuring true biological differences.

Table 2: Zero-shot performance in batch integration. Data sourced from [1].

Method	Overall Ranking	Key Strength	Key Limitation
HVG Selection	1st	Best batch integration scores across all datasets.	-
scVI	2nd	Effective on datasets with technical variation (e.g., Pancreas).	Challenged by complex biological batch effects (e.g., Immune dataset).
Harmony	3rd	Effective on complex biological batch effects (e.g., Immune dataset).	Ranks last for PCR score on Tabula Sapiens dataset.
scGPT	4th	Can integrate experiments using the same technique.	Embedding space primarily structured by batch, not biology.
Geneformer	5th	-	Consistently ranked last; failed to retain cell type information.

Key Finding: HVG selection achieved the best batch integration scores across all datasets [1]. Qualitative analysis revealed that while scGPT's embeddings offered some separation by cell type, the primary structure was still driven by batch effects. Geneformer's embedding space failed to retain meaningful cell type information, with clustering being primarily driven by batch.

Experimental Protocols for Evaluation

To ensure reproducibility and transparent benchmarking, the following section outlines the core methodologies used in the cited evaluations.

Zero-Shot Evaluation Workflow

The primary evaluation protocol [1] for assessing scFM embedding quality involved a standardized workflow, illustrated in the diagram below.

Diagram 1: Zero-shot evaluation workflow for single-cell foundation models. This process assesses the quality of model-generated cell embeddings without any task-specific fine-tuning.

The key characteristic of this protocol is the absence of fine-tuning. The pre-trained models are used as-is to generate embeddings, which are then directly evaluated on downstream tasks. This tests the generalizable biological knowledge captured during the model's initial pre-training.

Benchmarking Against Baselines

The comparative evaluation [1] employed a rigorous approach to ensure a fair and meaningful comparison between scFMs and established methods, as shown in the following workflow.

Diagram 2: Benchmarking workflow for comparative analysis of single-cell methods.

This protocol ensures that all methods are assessed on a level playing field. The use of multiple independent datasets guards against overfitting to a specific data type or technology, while multiple metrics provide a holistic view of performance across different aspects of each task.

The Scientist's Toolkit: Key Research Reagents & Solutions

The following table details essential computational tools and resources used in the evaluation of single-cell methods, providing a quick reference for researchers seeking to implement these protocols.

Table 3: Key research reagents and computational tools for scFM evaluation.

Tool / Resource	Type	Primary Function in Evaluation	Reference/Access
scGPT	Foundation Model	Generates cell embeddings from gene expression data via masked language model pretraining.	[1]
Geneformer	Foundation Model	Generates cell embeddings from gene expression data via masked language model pretraining.	[1]
scVI	Probabilistic Model	Generates cell embeddings and integrates datasets; used as a baseline for comparison.	[1]
Harmony	Integration Algorithm	Iteratively corrects for batch effects in cell embeddings; used as a baseline for comparison.	[1]
HVG Selection	Preprocessing Method	Selects a subset of informative genes to reduce dimensionality and noise.	[1]
BIO Score / ASW	Performance Metric	Quantifies the biological integrity and quality of cell clusters.	[1]
Pancreas Dataset	Benchmark Data	A composite dataset from five sources used to test batch integration performance.	[1]

The benchmark data clearly supports a "no-one-size-fits-all" approach. The performance of scFMs is highly task- and dataset-dependent, and simpler methods often provide more robust and reliable results in zero-shot settings.

Evidence-Based Selection Guidelines:

For Robust Cell Type Clustering: Rely on HVG selection or Harmony/scVI. The high performance and consistency of these methods make them preferable initial choices over current scFMs for this core task [1].
For Critical Batch Integration: HVG selection is the top-performing method. For specific cases, scVI (technical variation) or Harmony (biological variation) are strong alternatives. Current scFMs are not recommended as primary tools for this task [1].
For Exploratory Analysis with scFMs: Exercise caution. Always validate findings against results generated by simpler baseline methods to ensure that insights are driven by biology and not model artifacts [1].

The field of single-cell foundation models is evolving rapidly. Future model generations, trained with different architectures or objectives, may overcome these current limitations. Therefore, continuous and rigorous zero-shot benchmarking remains an essential practice for the scientific community.

Conclusion

The evaluation of zero-shot scFM embeddings reveals a nuanced landscape. Current benchmarks demonstrate that while these models are versatile and capture significant biological insights, their zero-shot performance is often inconsistent and can be outperformed by simpler, established methods on specific tasks. No single scFM consistently dominates across all applications, making careful, task-specific model selection paramount. The future of scFMs in biomedical and clinical research hinges on developing more biologically grounded pretraining objectives, creating standardized evaluation frameworks like BioLLM, and improving model interpretability. For researchers, this means that scFMs should be viewed as powerful but imperfect tools. Their successful application requires a rigorous, evidence-based approach to validation, ensuring that the insights they generate are robust and translatable to advancing our understanding of disease mechanisms and therapeutic development.