Zero-Shot Learning with Single-Cell Foundation Models: Current State, Challenges, and Future Directions

Connor Hughes Nov 27, 2025 102

Single-cell foundation models (scFMs), pretrained on millions of single-cell transcriptomes, promise to revolutionize biological discovery by enabling zero-shot learning—applying model knowledge to new data without task-specific training.

Zero-Shot Learning with Single-Cell Foundation Models: Current State, Challenges, and Future Directions

Abstract

Single-cell foundation models (scFMs), pretrained on millions of single-cell transcriptomes, promise to revolutionize biological discovery by enabling zero-shot learning—applying model knowledge to new data without task-specific training. This article provides a comprehensive overview for researchers and drug development professionals, exploring the foundational concepts of scFMs and zero-shot inference. We examine methodological approaches and applications, from cell type annotation to drug perturbation prediction, and critically address performance challenges revealed by recent rigorous evaluations. The content synthesizes troubleshooting strategies and optimization techniques, while presenting a framework for the validation and comparative benchmarking of these models against traditional methods. By integrating the latest research, this article serves as a guide for the effective application and future development of zero-shot scFMs in biomedical research.

Understanding Single-Cell Foundation Models and the Zero-Shot Paradigm

What Are Single-Cell Foundation Models? Defining the Core Concepts

Single-cell foundation models (scFMs) are large-scale deep learning models pretrained on vast datasets of single-cell genomics data, capable of being adapted to a wide range of downstream biological tasks [1]. Inspired by the success of large language models (LLMs) in natural language processing, these models aim to decipher the 'language' of cells by learning universal patterns from millions of single-cell transcriptomes [1] [2].

In these models, individual cells are treated analogously to sentences, and genes or other genomic features along with their expression values are treated as words or tokens [1]. The premise is that by exposing a model to millions of cells encompassing many tissues and conditions, it can learn fundamental, generalizable principles of cellular biology [1].

Core Architectural Concepts of scFMs

The development of a single-cell foundation model involves several key components, from data assembly to model architecture and pretraining.

Data Sources for Pretraining: A critical ingredient for any scFM is the compilation of large and diverse datasets. Platforms like CZ CELLxGENE, which provides access to over 100 million unique annotated cells, and public repositories like the Gene Expression Omnibus (GEO) are foundational to this effort [1]. Curated compendia such as the Human Cell Atlas, PanglaoDB, and the Human Ensemble Cell Atlas collate data from multiple sources to create extensive training corpora [1].
Tokenization: This process converts raw gene expression data into discrete units, or tokens, that the model can process. A key challenge is that gene expression data is not naturally sequential. Common strategies to impose order include:
- Ranking genes within each cell by their expression levels, feeding the ordered list as a "sentence" [1].
- Binning gene expression values into discrete categories and using those as tokens [1] [3].
- Gene tokens are often combined with value embeddings (representing expression level) and positional embeddings (indicating the gene's rank or position) [4].
Model Architecture: Most scFMs are built on the transformer architecture, which uses attention mechanisms to weight the relationships between any pair of input tokens (genes) [1]. Popular variants include:
- Encoder-based models (e.g., BERT-like), which use bidirectional attention to learn from all genes in a cell simultaneously [1].
- Decoder-based models (e.g., GPT-like), which use a unidirectional masked self-attention mechanism to iteratively predict genes [1] [3].
Pretraining Strategy: These models are typically trained using self-supervised learning, most commonly masked gene expression prediction [5] [6]. In this task, the model is shown input data with a subset of genes withheld and must predict the expression of these masked genes based on the context of the remaining genes [6]. The logic is that successfully completing this task requires the model to learn the underlying regulatory and functional relationships between genes [6].

The following diagram illustrates a typical workflow for building and applying a single-cell foundation model.

Current Research Landscape & Zero-Shot Performance

The "zero-shot" capability of a foundation model—its performance on new, unseen data without any task-specific training—is critical for biological discovery settings where labels are unknown [5]. Recent rigorous evaluations have revealed both the promise and limitations of current scFMs in this regard.

Benchmarking Zero-Shot Performance

A key evaluation of two popular models, Geneformer and scGPT, examined their zero-shot performance on tasks like cell type clustering and batch integration across multiple datasets [5] [6]. The findings suggest that in their zero-shot configuration, these models can face significant reliability challenges.

Cell Type Clustering: In separating known cell types, both Geneformer and scGPT were found to perform worse than established methods like scVI and Harmony, and sometimes even underperformed the simple approach of selecting Highly Variable Genes (HVG) [5] [6].
Batch Integration: In the task of integrating data from multiple sources to remove technical "batch effects" while preserving biological variation, Geneformer consistently underperformed relative to scGPT, Harmony, scVI, and HVG across most datasets [5]. scGPT showed more competitive performance, particularly on complex datasets involving both technical and biological batch effects [5].

Table 1: Summary of Model Zero-Shot Performance in Key Tasks (Adapted from Genome Biology, 2025)

Model	Cell Type Clustering	Batch Integration	Notable Strengths / Weaknesses
Geneformer	Underperformed baselines (HVG, scVI, Harmony) [5]	Consistently ranked last across metrics [5]	Embedding space often failed to retain cell type information; structure driven by batch effects [5]
scGPT	Inconsistent; outperformed on one dataset but worse on others [5]	Competitive on complex datasets with biological batch effects [5]	Performance may be influenced by overlap between evaluation and pretraining datasets [5]
scVI	Consistently strong performance [5]	Strong performer, especially on technical variation [5]	Established baseline, generative model [5] [4]
Harmony	Consistently strong performance [5]	Strong performer, but challenged on some datasets [5]	Established baseline, adjusts PC embeddings [5] [4]
HVG (Baseline)	Outperformed Geneformer and scGPT across metrics [5]	Achieved best batch integration scores in some evaluations [5]	Simple feature selection method [5]

Insights into Performance Limitations

Research points to two main hypotheses for these zero-shot limitations. First, the masked language model pretraining framework itself might not inherently produce useful cell embeddings for these tasks. Second, the models may have failed to learn the pretraining task effectively [5]. For instance, analysis of scGPT's gene expression prediction revealed that, even when using its cell embedding, its predictive ability was only slightly improved and largely limited to highly expressed "housekeeping" genes, questioning whether it learns deeper, context-dependent relationships between genes [6].

Emerging Models and Improvements

Despite these challenges, the field is evolving rapidly. Newer, larger models are being developed, such as CellFM, an 800-million-parameter model trained on 100 million human cells, which reports outperforming existing models in tasks like cell annotation and gene function prediction [3]. Furthermore, research into efficient fine-tuning techniques (training less than 1% of a model's parameters) shows promise in enabling robust zero-shot generalization to unseen cell lines and conditions, such as predicting responses to novel drugs [7].

Essential Protocols for scFM Evaluation

For researchers aiming to evaluate single-cell foundation models, particularly in zero-shot settings, the following protocols outline key methodological steps.

Protocol for Zero-Shot Cell Type Clustering

This protocol assesses the quality of a model's cell embeddings for distinguishing cell types without any further training [5] [4].

Objective: To evaluate whether a pretrained scFM's embeddings can separate known cell types in a new, unseen dataset.
Experimental Steps:
- Dataset Selection: Acquire a well-annotated scRNA-seq dataset that was not part of the model's pretraining corpus. The dataset should have a known set of cell type labels. Examples include the Pancreas benchmark dataset or data from the Asian Immune Diversity Atlas (AIDA) [5] [4].
- Embedding Extraction: Pass the gene expression matrix of the new dataset through the pretrained scFM without performing any fine-tuning to obtain a latent cell embedding for each cell.
- Dimensionality Reduction: Apply a standard dimensionality reduction technique (e.g., UMAP, t-SNE) to the cell embeddings for visualization.
- Clustering & Evaluation: Use a clustering algorithm (e.g., Leiden, Louvain) on the full embeddings and compare the resulting clusters to the ground-truth cell type labels.
Evaluation Metrics:
- Average BIO (AvgBIO) score and Average Silhouette Width (ASW) to quantify cluster quality and separation [5].
- Cell ontology-informed metrics, such as the Lowest Common Ancestor Distance (LCAD), to assess the biological plausibility of clustering errors [4].

Protocol for Zero-Shot Batch Integration

This protocol evaluates a model's ability to produce embeddings that mix cells from different technical batches while preserving biological variation [5].

Objective: To quantify the removal of batch effects and conservation of biological variance in a model's embeddings.
Experimental Steps:
- Dataset Selection: Select a dataset with known, strong batch effects from multiple sources or experimental techniques, such as the Pancreas dataset with five different sources [5].
- Embedding Extraction: Obtain cell embeddings for the dataset using the pretrained scFM in zero-shot mode.
- Visualization: Create UMAP plots colored by both batch and cell type to qualitatively assess integration.
Evaluation Metrics:
- Batch mixing metrics: Quantify how well cells from different batches are intermixed [5].
- Principal Component Regression (PCR) score: Measures the proportion of variance in the embeddings explained by batch effects versus biological cell type [5]. A lower batch PCR score indicates better integration.

Table 2: Essential Data, Models, and Frameworks for scFM Research

Resource Name	Type	Primary Function
CZ CELLxGENE [1]	Data Platform	Provides unified access to over 100 million annotated single-cells; a primary source of pretraining data.
BioLLM Framework [8]	Software Tool	A unified interface that standardizes APIs for diverse scFMs, enabling seamless model switching and benchmarking.
Geneformer [5] [4]	Foundation Model	An encoder-based model pretrained on ~30 million cells using a gene ranking tokenization strategy.
scGPT [5] [4]	Foundation Model	A decoder-based model pretrained on ~33 million cells using gene value binning and attention masks.
Harmony [5] [4]	Algorithm	A robust baseline method for batch integration, often used for performance comparison.
scVI [5] [4]	Generative Model	A robust, probabilistic baseline model for cell embedding and batch correction.

Single-cell foundation models represent a promising paradigm for analyzing cellular heterogeneity. While they are robust and versatile tools, current evaluations indicate that their zero-shot performance can be inconsistent, and they may be outperformed by simpler, established methods in tasks like cell type clustering and batch integration [5] [4] [6]. This underscores the importance of rigorous zero-shot evaluation in their development and deployment. For researchers, the choice to use a complex scFM versus a simpler alternative should be guided by the specific task, dataset size, need for biological interpretability, and computational resources [4]. As the field matures with larger models like CellFM [3] and standardized evaluation frameworks like BioLLM [8], scFMs are poised to become more reliable and indispensable tools for unlocking deeper insights into cellular function and disease.

Single-cell foundation models (scFMs) represent a paradigm shift in computational biology, leveraging large-scale deep learning to interpret the complex language of cellular biology [1]. These models are trained on vast datasets containing tens of millions of single-cell transcriptomes, enabling a unified framework for analyzing cellular heterogeneity and regulatory networks [1]. The architecture of scFMs is predominantly built upon transformer-based neural networks, which process single-cell data through specialized tokenization methods and self-supervised pretraining objectives [1] [9]. Within the context of zero-shot learning—where models must perform tasks without any further training on the target data—the architectural choices of scFMs become critically important for enabling robust biological discovery [5].

Core Architectural Components of scFMs

Transformer Architectures: The Backbone of scFMs

The transformer architecture serves as the fundamental engine for most single-cell foundation models, providing the capacity to capture intricate, long-range relationships between genes within a cell [1]. These models primarily utilize two architectural variants:

Encoder-based architectures (e.g., scBERT): Employ bidirectional attention mechanisms that learn from all genes in a cell simultaneously, making them particularly effective for classification tasks and generating cell embeddings [1] [10].
Decoder-based architectures (e.g., scGPT, cell2sentence): Utilize unidirectional masked self-attention that iteratively predicts masked genes conditioned on known genes, excelling in generative tasks [1] [10].

The self-attention mechanism within transformers allows scFMs to learn which genes in a cell are most informative of the cell's identity or state, and how they co-vary across different cellular contexts [1]. This capability is essential for building models that can generalize to novel biological contexts without task-specific fine-tuning.

Table 1: Comparison of Prominent scFM Architectures

Model	Architecture Type	Tokenization Approach	Pretraining Data Scale	Key Applications
scGPT [1] [9]	Decoder-based	Gene-level with expression binning	33M+ human cells	Zero-shot annotation, multi-omic integration, perturbation prediction
Geneformer [1] [10]	Encoder-based	Gene-level with expression ranking	27M+ human cells	Cell embedding, network inference
scBERT [1]	Encoder-based	Gene-level with expression binning	Millions of cells	Cell type annotation
cell2sentence (C2S) [10]	Decoder-based	Natural language tokenization	57M+ human and mouse cells + biological texts	Cell type prediction, biological interpretation
Nicheformer [9]	Graph Transformer	Spatial context tokens	53M+ spatially resolved cells	Spatial niche modeling, context prediction

Tokenization Strategies: From Biology to Tokens

Tokenization converts raw gene expression data into discrete units that transformers can process, representing a critical adaptation of natural language processing techniques to biological data [1] [11]. Unlike words in a sentence, genes have no inherent sequential order, necessitating specialized approaches:

Gene ranking by expression: Genes are ordered within each cell by their expression levels, creating a deterministic sequence from highest to lowest expressed genes [1] [10]. This approach provides a consistent input structure but may sacrifice biological relationships.
Expression value binning: Genes are partitioned into bins based on their expression values, with these rankings determining their positional encoding [1] [9].
Natural language tokenization: Some models like cell2sentence leverage existing language model tokenizers by representing gene sequences as natural language strings [10].

Each gene is typically represented as a token embedding that combines a gene identifier with its expression value [1]. Special tokens may be added to represent cell-level metadata, experimental batch information, or modality indicators (e.g., for multi-omics data) [1]. Positional encoding schemes are then adapted to represent the relative order or rank of each gene in the cell, providing necessary sequence context to the transformer [1].

Pretraining Objectives and Strategies

scFMs are pretrained using self-supervised learning on massive, diverse collections of single-cell data, enabling them to learn fundamental biological principles that generalize across tasks [1]. The primary pretraining objective is masked gene modeling, where:

The model is shown input data with a subset of genes withheld (masked) and must predict the expression of these masked genes based on the remaining genes in the cell [1] [6].
This task forces the model to learn the complex regulatory relationships and co-expression patterns between genes [1].
Through this process, the model develops a deep understanding of cellular "syntax" - how genes work together to define cell states and functions [1].

The pretraining data for scFMs is typically drawn from large-scale resources such as CZ CELLxGENE (containing over 100 million unique cells), the Human Cell Atlas, and other multiorgan atlases that provide broad coverage of cell types, states, and conditions [1]. Effective pretraining requires careful data selection, filtering of cells and genes, and balancing of dataset compositions to capture a wide spectrum of biological variation while mitigating technical noise and batch effects [1].

Experimental Protocols for scFM Development and Benchmarking

Protocol 1: Model Pretraining and Tokenization

Purpose: To train a foundational scFM from large-scale single-cell data using appropriate tokenization and self-supervised learning.

Materials and Reagents:

Computational Resources: High-performance computing cluster with multiple GPUs (e.g., NVIDIA A100 or H100)
Software: Python 3.8+, PyTorch or JAX, transformer libraries (Hugging Face Transformers, scGPT codebase)
Data: Curated single-cell datasets from CZ CELLxGENE, Human Cell Atlas, or GEO/SRA repositories

Procedure:

Data Curation and Quality Control
- Download and harmonize single-cell datasets from public repositories [1]
- Filter cells based on quality metrics (mitochondrial content, number of genes detected)
- Filter genes to include those detected in a minimum number of cells (e.g., >0.1% of cells)
- Perform basic normalization (e.g., library size normalization) without batch correction

Tokenization and Input Representation
- For each cell, rank genes by expression levels from highest to lowest [1] [10]
- Select top N genes (e.g., 2,000-5,000) based on this ranking as the "sentence" for that cell
- Convert each gene to a token embedding combining:
  - Gene identifier embedding (learned)
  - Expression value embedding (via binning or continuous representation) [1]
- Add special tokens for cell identity, batch information, or modality as needed [1]
- Apply positional encodings based on the rank order of genes
Model Architecture Configuration
- Initialize transformer architecture (encoder, decoder, or encoder-decoder)
- Set model dimensions (hidden size, number of layers, attention heads) based on computational constraints
- For decoder models: implement causal masking for autoregressive generation
- For encoder models: implement bidirectional attention
Self-Supervised Pretraining
- Implement masked gene modeling: randomly mask 15-20% of input tokens
- Train model to predict expression values of masked genes using mean squared error or similar loss
- Use AdamW optimizer with learning rate warmup and cosine decay
- Train for sufficient iterations (typically 100,000+ steps) on multiple GPUs
Model Validation
- Evaluate reconstruction accuracy on held-out validation cells
- Assess whether model captures known biological relationships (e.g., gene-gene correlations)

Protocol 2: Zero-Shot Performance Evaluation

Purpose: To assess the zero-shot capabilities of pretrained scFMs on downstream biological tasks without any fine-tuning.

Materials and Reagents:

Pretrained Models: scGPT, Geneformer, or custom pretrained scFMs
Benchmark Datasets: Curated evaluation datasets with known cell type labels (e.g., Tabula Sapiens, Pancreas datasets)
Baseline Methods: Traditional computational tools (scVI, Harmony) and simple feature selection (Highly Variable Genes)

Procedure:

Embedding Generation
- Process held-out test datasets through the pretrained scFM without any parameter updates
- Extract cell embeddings from the model's final layer or specialized cell token [5]
- For comparison, generate embeddings using baseline methods (scVI, Harmony, HVG)

Cell Type Clustering Evaluation
- Apply clustering algorithms (e.g., Louvain, Leiden) to all embedding types
- Calculate clustering metrics comparing to ground truth cell type labels:
  - Average BIO score (AvgBio) [5]
  - Average silhouette width (ASW) [5]
  - Adjusted Rand Index (ARI)
- Compare performance across methods and datasets
Batch Integration Assessment
- Evaluate embeddings on datasets with known batch effects
- Quantify batch mixing using:
  - Batch ASW (batchASW) - lower values indicate better integration [5]
  - Principal component regression (PCR) - proportion of variance explained by batch [5]
- Visually inspect UMAP/t-SNE plots for batch mixing versus biological preservation
Gene Expression Prediction
- Assess the model's ability to predict held-out gene expression values
- Mask subsets of genes and compare predictions to true values
- Calculate correlation coefficients between predicted and actual expression [6]
Statistical Analysis
- Perform multiple runs with different random seeds
- Use paired statistical tests to compare method performance
- Report confidence intervals for performance metrics

Table 2: Key Metrics for Zero-Shot Evaluation of scFMs

Evaluation Dimension	Key Metrics	Ideal Outcome	Current scFM Performance
Cell Type Clustering [5]	AvgBio, ASW, ARI	High scores (>0.8) indicating clear separation of cell types	Mixed: scGPT comparable to baselines on some datasets, Geneformer consistently underperforms
Batch Integration [5]	batchASW, PCR	Low batchASW, low PCR indicating minimal batch effects	Moderate: scGPT outperforms baselines on complex biological batches, underperforms on technical batches
Gene Expression Prediction [6]	Pearson correlation, MSE	High correlation, low error for context-specific genes	Limited: models often predict median expression; slight improvement for highly expressed "housekeeping" genes
Biological Conservation	Gene-gene correlation preservation	Maintenance of known biological relationships in embedding space	Varies by model and dataset

Table 3: Key Resources for scFM Research and Development

Resource Category	Specific Tools/Databases	Function/Purpose	Access Information
Data Repositories [1]	CZ CELLxGENE Discover, DISCO, Human Cell Atlas	Provide standardized, curated single-cell datasets for model training and benchmarking	Publicly available web portals with API access
Pretrained Models [1] [9] [10]	scGPT, Geneformer, cell2sentence, scPlantFormer	Offer pretrained foundation models for transfer learning and zero-shot evaluation	Hugging Face, GitHub repositories, BioLLM framework
Computational Frameworks [9]	BioLLM, scGNN+, scVI, Harmony	Provide standardized benchmarking, automated workflows, and baseline comparisons	Open-source Python packages
Evaluation Benchmarks [5]	Pancreas dataset, Tabula Sapiens, Immune cell atlas	Curated datasets with known ground truth for systematic model evaluation	Publicly available with standardized preprocessing
Interpretability Tools [10]	Transcoders, sparse autoencoders, circuit analysis	Enable mechanistic interpretation of model decisions and biological insights	Custom implementations building on transcoder frameworks

Critical Analysis and Future Directions

The architecture of current scFMs shows promise but faces significant challenges in zero-shot settings. Recent evaluations reveal that even prominent models like scGPT and Geneformer underperform simpler methods like Highly Variable Genes (HVG) selection or established tools like Harmony and scVI in cell type clustering and batch integration tasks [5] [6]. This performance gap suggests that the masked gene modeling pretraining objective may not be sufficient for developing robust cellular representations that transfer effectively to downstream tasks without fine-tuning [5].

A key limitation stems from the fundamental difference between natural language and biological systems. While language has inherent sequential structure, gene expression data lacks natural ordering, requiring artificial sequencing through gene ranking [1] [11]. This artificial structure may not optimally capture biological relationships. Additionally, current models struggle with polysemanticity in gene expression, where the same gene may play different roles in different cellular contexts [11].

Future architectural improvements should focus on:

Multimodal Integration: Incorporating additional data modalities (spatial context, epigenomics, proteomics) to provide richer biological context [1] [9].
Advanced Interpretability: Applying techniques like transcoders to extract biologically meaningful circuits from model weights [10].
Geometric Learning: Developing embedding spaces that better reflect the underlying biological manifold structure [11].
Specialized Pretraining: Designing pretraining objectives that more directly align with downstream zero-shot tasks.

For researchers focusing on zero-shot learning with scFMs, rigorous evaluation using the protocols outlined here is essential before deploying these models in discovery settings where labeled data is unavailable [5]. The field would benefit from standardized benchmarks and evaluation practices that properly assess true biological understanding rather than exploiting statistical artifacts [5] [6].

What is Zero-Shot Learning? The Promise of Annotation-Free Discovery

Zero-shot learning (ZSL) represents a paradigm shift in machine learning, enabling models to recognize and categorize objects or concepts without having seen any labeled examples of those specific categories during training [12]. This approach stands in contrast to traditional supervised learning, which requires vast amounts of annotated data for each class the model needs to identify.

In the context of single-cell biology, ZSL offers transformative potential for uncovering novel biological insights without the bottleneck of manual cell annotation [13]. As single-cell technologies generate increasingly massive datasets, the ability to perform annotation-free discovery becomes crucial for identifying novel cell types, rare disease-associated cells, and complex cellular states that may lack established reference data [1] [13]. This protocol explores the application of ZSL principles through single-cell foundation models (scFMs) to advance biological discovery while highlighting current limitations and evaluation benchmarks.

Theoretical Foundations of Zero-Shot Learning

Core Mechanism and Definitions

ZSL operates by leveraging auxiliary information to bridge the gap between classes seen during training and unseen classes encountered during inference [12] [14]. Instead of learning explicit decision boundaries for every possible class, ZSL models learn to map inputs into a semantic space where relationships between concepts can be measured through similarity metrics.

Key Definitions:

Seen Classes: Categories for which labeled examples are available during training.
Unseen Classes: Categories for which no labeled examples are available during training.
Auxiliary Information: Semantic representations that describe classes (e.g., textual descriptions, attributes, or embeddings) [14].
Generalized Zero-Shot Learning (GZSL): A more challenging setting where test samples may come from both seen and unseen classes [12] [14].

Technical Approaches in ZSL

Table 1: Comparison of Zero-Shot Learning Technical Approaches

Approach	Mechanism	Auxiliary Information	Common Applications
Attribute-Based	Learns to recognize class-descriptive attributes (e.g., "has wings," "is furry") and composes them to identify unseen classes [12] [15]	Manually defined attribute vectors	Computer vision, object recognition
Embedding-Based	Maps both input features and class labels into a shared semantic space where classification is determined by similarity [12]	Word embeddings (Word2Vec, GloVe, BERT), language model representations	Cross-modal retrieval, image captioning
Transfer Learning-Based	Leverages knowledge gained from pre-training on large datasets and adapts it to new tasks without additional labeled examples [12] [16]	Pre-trained model parameters	Natural language processing, single-cell biology

Zero-Shot Learning in Single-Cell Foundation Models

Architecture of Single-Cell Foundation Models

Single-cell foundation models (scFMs) are large-scale deep learning models pre-trained on massive single-cell datasets, typically using transformer architectures [1]. These models aim to learn universal biological principles that can be transferred to various downstream tasks with minimal or no additional training.

The typical scFM processing workflow involves:

Tokenization: Converting raw gene expression data into discrete tokens, often by ranking genes by expression levels or binning expression values [1]
Embedding: Transforming tokens into vector representations using gene embeddings, value embeddings, and positional embeddings
Transformer Processing: Applying self-attention mechanisms to model complex gene-gene interactions and capture biological patterns [1] [17]
Output Generation: Producing latent representations of cells and genes that encode biological knowledge

Diagram 1: Single-Cell Foundation Model Architecture

Zero-Shot Capabilities of scFMs

In theory, scFMs should enable zero-shot biological discovery by leveraging knowledge acquired during pre-training. Potential applications include:

Novel cell type identification without reference atlases
Cross-tissue and cross-species generalization
Rare disease cell detection without labeled examples
Cellular perturbation prediction for unseen compounds or genetic manipulations

However, recent evaluations of popular scFMs like Geneformer and scGPT reveal significant limitations in their zero-shot capabilities [5]. When assessed on tasks such as cell type clustering and batch integration without fine-tuning, these models frequently underperform simpler baseline methods like Highly Variable Genes (HVG) selection or established integration tools like Harmony and scVI [5].

Table 2: Zero-Shot Performance of Single-Cell Foundation Models on Benchmark Tasks

Model	Cell Type Clustering (AvgBIO Score)	Batch Integration (iLISI Score)	Novel Cell Type Detection	Reference
scGPT	Variable performance across datasets; outperformed by baselines on most benchmarks	Moderate performance on technical batches; struggles with biological variation	Limited evaluation available	[5]
Geneformer	Consistently underperforms HVG selection and established baselines	Poor performance; embeddings often dominated by batch effects	Not rigorously evaluated	[5]
HVG Baseline	Superior performance across most benchmarking datasets	Best overall performance in full-dimensional metrics	Not applicable	[5]
scVI	Strong performance on most datasets	Excellent technical batch correction; challenges with biological variation	Not applicable	[5]

Protocols for Annotation-Free Discovery in Single-Cell Biology

Mixture Modeling for Multiple-Instance Learning (MMIL)

For scenarios where only patient-level labels are available (e.g., disease status) but individual cell labels are unknown, the MMIL protocol provides a practical approach for annotation-free cell classification [13].

Experimental Workflow:

Diagram 2: MMIL Algorithm Workflow

Step-by-Step Protocol:

Data Preparation
- Collect single-cell data from healthy donors (all cells considered baseline)
- Collect single-cell data from diseased patients (mixture of baseline and disease-associated cells)
- Process data using standard normalization and feature selection techniques
Model Initialization
- Initialize probabilities for each cell belonging to baseline or disease-associated class
- Set parameters: ρ (proportion of baseline cells in patients) and ζ (fraction of patient-derived cells in prediction population)
Expectation-Maximization Iteration
- E-Step: Estimate cell labels using current classifier probabilities
- M-Step: Train classifier using estimated cell labels
- Repeat until convergence of model parameters
Validation and Interpretation
- Evaluate using cross-validation against any available gold-standard labels
- Perform sensitivity analysis on parameter ρ
- Interpret selected features for biological relevance

Application Example: MMIL was successfully applied to detect leukemia cells in acute myeloid leukemia (AML) using mass cytometry data, achieving performance approaching that of a hematopathologist despite using only patient-level labels during training [13]. The method also demonstrated strong generalization across different tissues, treatment time points, and identification of minimal residual disease (MRD) cells.

Zero-Shot Evaluation Protocol for scFMs

To rigorously assess the zero-shot capabilities of single-cell foundation models, implement the following evaluation protocol:

Embedding Extraction
- Obtain cell embeddings from pre-trained scFMs without any fine-tuning
- Use default model settings and recommended preprocessing steps
Task-Specific Evaluation
- Cell Type Clustering: Apply clustering algorithms to embeddings and compare to known cell type annotations using metrics like Average BIO Score (AvgBIO) and Average Silhouette Width (ASW)
- Batch Integration: Assess ability to remove technical artifacts while preserving biological variation using metrics like iLISI and principal component regression (PCR)
- Novel Cell Type Detection: Evaluate performance on intentionally held-out cell types
Baseline Comparison
- Compare against simple baselines including HVG selection, scVI, and Harmony
- Use consistent evaluation metrics across all methods
- Perform statistical testing to determine significance of performance differences

Essential Research Reagents and Computational Tools

Table 3: Research Reagent Solutions for Zero-Shot Learning in Single-Cell Biology

Resource Category	Specific Tools/Platforms	Function and Application
Pre-trained Models	scGPT, Geneformer, UCE, scFoundation, LangCell, scCello [17]	Provide foundational biological knowledge for transfer to new tasks without extensive retraining
Benchmark Datasets	CZ CELLxGENE, Human Cell Atlas, PanglaoDB, Asian Immune Diversity Atlas (AIDA) v2 [1] [17]	Curated single-cell datasets with high-quality annotations for model evaluation and development
Evaluation Frameworks	scGraph-OntoRWR, Lowest Common Ancestor Distance (LCAD), AvgBIO Score, iLISI [17]	Specialized metrics for assessing biological relevance and technical performance of zero-shot methods
Baseline Methods	Highly Variable Genes (HVG) selection, Harmony, scVI, Seurat [5] [17]	Established computational methods for comparison against novel zero-shot approaches

Discussion and Future Directions

The promise of annotation-free discovery in single-cell biology through zero-shot learning remains compelling, though current implementations face significant challenges. While methods like MMIL demonstrate practical pathways for cell classification without complete labels [13], the zero-shot performance of large foundation models requires substantial improvement to fulfill their theoretical potential [5].

Critical areas for future development include:

Improving pretraining objectives to yield more biologically meaningful representations
Developing better evaluation protocols that capture real-world discovery scenarios
Addressing domain shift between pretraining data and target applications
Creating more transparent and interpretable model architectures

As these challenges are addressed, zero-shot learning approaches are poised to transform single-cell research by enabling truly exploratory analysis unconstrained by pre-existing annotations, potentially accelerating discovery of novel cell types, disease mechanisms, and therapeutic targets.

Single-cell foundation models (scFMs) represent a transformative approach in computational biology, leveraging large-scale deep learning architectures pretrained on vast single-cell datasets to enable a wide range of downstream tasks [1]. These models typically employ transformer-based architectures that learn the fundamental "language" of cells by processing gene expression profiles as textual sequences, where individual genes serve as tokens and complete cell profiles form sentences [1]. The pretraining phase is critical for developing models that can generalize across diverse biological contexts and perform effectively in zero-shot learning scenarios—where models must make predictions on new data without task-specific fine-tuning [5].

The performance and generalizability of scFMs are fundamentally constrained by the quality, scale, and diversity of their pretraining data. Large, well-annotated, and standardized datasets allow models to capture the complex biological variation present across tissues, cell types, developmental stages, and disease states [1]. This application note provides a comprehensive overview of three pivotal public data resources—CELLxGENE, GEO, and the Human Cell Atlas—that collectively provide the foundational data infrastructure for developing robust scFMs capable of strong zero-shot performance.

Data Resource Comparative Analysis

Table 1: Key Characteristics of Public Data Sources for scFM Pretraining

Resource	Primary Content	Data Scale	Access Method	Update Frequency	Embeddings/Models
CZ CELLxGENE Discover	Standardized single-cell transcriptomics data from healthy human and mouse tissues [18]	33M+ unique cells; 436 datasets; 2.7K+ cell types [18]	Web portal; Census API (Python/R) [19]	Weekly (latest); Long-term supported (LTS) releases every 6 months [19]	scVI, scGPT, Geneformer, UCE embeddings [20] [19]
Human Cell Atlas (HCA)	Multimodal single-cell data from international consortium; tissue-specific biological networks [21] [22]	30M+ cells (as of 2022); Regular additions of new projects [22]	Data Portal; Managed access for controlled data [22]	Regular monthly updates with new projects and tissues [22]	Spatial transcriptomics; Emerging atlas-specific embeddings
NCBI GEO	Heterogeneous omics data from individual studies; microarray and sequencing data	Not quantified in search results	Web portal; Programmatic access	Continuous submission	Limited standardized embeddings

Qualitative Assessment of Resource Utility

Table 2: Strategic Application of Data Resources in scFM Development

Resource	Strengths for scFM Pretraining	Limitations for scFM Pretraining	Optimal Use Cases
CELLxGENE	Standardized processing: Uniform data curation and annotation enables seamless integration [18].Dedicated embeddings: Precomputed embeddings (scVI, scGPT) facilitate transfer learning [19].Reproducible access: Versioned Census releases ensure model reproducibility [19].	Limited modality diversity: Primarily focused on transcriptomics with emerging multimodal support [18].	Primary pretraining corpus: Ideal for building generalizable foundational models.Benchmarking: Standardized data enables fair model comparisons.
Human Cell Atlas	Spatial context: Increasing spatial transcriptomics data provides architectural context [21].Tissue networks: Organized by biological systems (e.g., Lung Network, Heart Network) [22].Diversity focus: Explicit emphasis on population diversity in recent initiatives [21].	Data heterogeneity: Variable processing pipelines can introduce technical artifacts.Access complexity: Managed access requirements for some datasets create barriers [22].	Specialized scFMs: Tissue-specific or spatially-aware foundation models.Diversity enhancement: Augmenting training data with population variation.
NCBI GEO	Extensive repository: Largest collection of diverse omics datasets.Methodological breadth: Captures wide range of experimental protocols and conditions.	Standardization challenges: Heterogeneous processing requires significant preprocessing.Metadata inconsistency: Variable annotation quality complicates data integration.	Data augmentation: Supplementing primary training corpora with specialized datasets.Transfer learning evaluation: Testing model generalization across heterogeneous data.

Experimental Protocols for Data Utilization in scFM Research

Protocol 1: Constructing a Pretraining Corpus from CELLxGENE Census

Principle: Assemble a high-quality, diverse pretraining dataset from CELLxGENE Census that maximizes biological variation while minimizing technical artifacts [1] [19].

Procedure:

Census Access: Utilize the CELLxGENE Census API to access the most recent Long-Term Supported (LTS) release for reproducibility [19].

Quality Filtering: Apply uniform quality control metrics—retain cells with gene counts between 500-5000 and mitochondrial reads below 20% to remove low-quality cells and potential artifacts [1].
Gene Selection: Filter for protein-coding genes expressed in at least 0.1% of cells to focus on biologically relevant features and reduce noise [1].
Dataset Balancing: Strategically sample cells across tissues, donors, and conditions to prevent bias toward overrepresented populations (e.g., blood cells) [1].
Metadata Integration: Incorporate standardized metadata (tissue, cell type, development stage, disease status) as conditional inputs or for stratified sampling [18] [19].
Train-Validation Split: Partition data at the donor or study level to prevent data leakage and ensure realistic evaluation of model generalizability.

Technical Considerations:

Batch Effects: Preserve study identifiers as batch labels for potential correction during model training or evaluation [1].
Reproducibility: Record exact Census version (e.g., "2025-01-30") and all filtering parameters for experimental replication [19].

Protocol 2: Zero-Shot Evaluation of scFMs for Cell Type Annotation

Principle: Evaluate scFM embeddings zero-shot for cell type annotation to assess inherent biological understanding without task-specific fine-tuning [5].

Procedure:

Benchmark Curation: Compile evaluation datasets encompassing diverse tissues and technologies not seen during pretraining (e.g., Tabula Sapiens, Pancreas, PBMC datasets) [5].

Embedding Generation: Pass held-out datasets through the pretrained scFM without updating model weights to generate cell embeddings in a zero-shot manner [5].
Baseline Comparison: Compare against established methods including:
- Highly Variable Genes (HVG): Standardized feature selection followed by PCA.
- Harmony: Batch integration method for removing technical variation.
- scVI: Probabilistic generative model for single-cell data [5].
Quantitative Metrics: Calculate multiple complementary performance metrics:
- Average BIO Score: Measures cell type clustering purity and separation.
- Average Silhouette Width (ASW): Quantifies cluster compactness and distinction [5].
Qualitative Assessment: Visualize embeddings using UMAP or t-SNE to inspect cell type separation and batch integration.

Critical Interpretation:

Current scFMs (scGPT, Geneformer) may underperform simpler methods (HVG, scVI) in zero-shot cell type annotation, highlighting limitations in their pretrained biological representations [5].
Performance varies significantly across tissues and technologies, indicating context-dependent utility [5].

Table 3: Critical Computational Tools for scFM Development and Evaluation

Resource Category	Specific Tools/Platforms	Primary Function in scFM Research
Data Repositories	CZ CELLxGENE Census [18] [19]	Provides standardized, analysis-ready single-cell data for model pretraining.
	HCA Data Portal [22]	Supplies diverse, multi-tissue single-cell data with spatial context.
Model Architectures	scGPT [1] [5]	Transformer-based foundation model for single-cell biology using GPT architecture.
	Geneformer [5]	Transformer model trained on transcriptomic data for cellular network inference.
Evaluation Frameworks	Zero-shot benchmarking pipeline [5]	Standardized protocol for assessing scFM performance without fine-tuning.
Analysis Ecosystems	Scanpy, Seurat	Standard single-cell analysis toolkits for preprocessing and evaluation.
	TensorFlow, PyTorch	Deep learning frameworks for model implementation and training.

Critical Analysis of Current Limitations and Future Directions

Despite their transformative potential, current scFMs face significant challenges in zero-shot learning scenarios. Recent evaluations reveal that proposed foundation models like scGPT and Geneformer may underperform simpler baseline methods (e.g., HVG selection, scVI, Harmony) on tasks including cell type clustering and batch integration when applied zero-shot [5]. This performance gap suggests potential limitations in how effectively these models learn transferable biological principles during pretraining.

Key limitations impacting zero-shot performance include:

Architectural Constraints: The masked language model pretraining objective may not optimally capture biological relationships essential for zero-shot generalization [5].
Data Quality Variation: Inconsistencies in data quality and processing across studies introduce confounding technical artifacts that models must disentangle [1].
Interpretability Challenges: Extracting biologically meaningful insights from the latent representations of scFMs remains nontrivial, complicating model debugging and improvement [1].

Future development should prioritize:

Improved Pretraining Objectives: Designing tasks that explicitly encourage learning of biological mechanisms rather than technical correlations.
Multimodal Integration: Incorporating simultaneous training on transcriptomic, epigenetic, proteomic, and spatial data to create more comprehensive cellular representations [1].
Standardized Evaluation: Establishing unified benchmarks for zero-shot performance assessment across diverse biological tasks [5].

The development of robust single-cell foundation models with strong zero-shot learning capabilities depends critically on strategic utilization of public data resources. CELLxGENE provides the most standardized and accessible pretraining corpus, while the Human Cell Atlas offers valuable spatial and tissue-specific data, and GEO supplies specialized datasets for augmentation. Researchers must carefully consider the tradeoffs between standardization, scale, and diversity when constructing pretraining corpora. Rigorous zero-shot evaluation remains essential for validating true biological understanding rather than dataset-specific memorization. As these data resources continue to expand and evolve, they will undoubtedly enable the next generation of scFMs capable of genuine biological discovery through zero-shot inference.

Masked Gene Modeling (MGM) has emerged as a predominant self-supervised pretraining task for single-cell foundation models (scFMs). Inspired by masked language modeling in natural language processing, MGM trains models to reconstruct randomly masked portions of a cell's gene expression profile. This task forces the model to learn the underlying biological principles and complex gene-gene relationships that define cellular states, enabling the development of general-purpose representations transferable to diverse downstream analyses in a zero-shot manner [1] [17].

The core premise is that by exposing a model to millions of cells encompassing myriad tissues and conditions, it can learn fundamental, transferable patterns of biology. During pretraining, models develop rich internal representations of cells and genes that can be applied to new datasets without additional task-specific training, which is crucial for exploratory biological research where labels are often unknown or costly to obtain [5] [1].

Key Architectural Components and Implementation

Tokenization and Input Representation

A critical step in adapting transformer architectures to single-cell RNA-seq (scRNA-seq) data is tokenization—converting raw gene expression values into discrete input units. Unlike words in a sentence, genes lack a natural sequential order, necessitating specific strategies to structure the model input.

Common Tokenization Strategies:

Gene Ranking by Expression: Genes are ordered based on their expression magnitude within each cell, creating a deterministic sequence of the top N genes [1] [17]. This approach is used by models like Geneformer and LangCell [17].
Expression Value Binning: Continuous expression values are discretized into bins or categories, and the binned values are used as inputs [1]. scGPT, for instance, employs value binning for its value embeddings [17].
Normalized Counts: Some models, such as scFoundation, forgo complex sequencing and directly use normalized counts, projecting the values into an embedding space [17].

Table 1: Input Representation in Selected Single-Cell Foundation Models

Model Name	# Input Genes	Value Embedding	Gene Symbol Embedding	Positional Embedding
Geneformer [17]	2048 ranked genes	Ordering	Lookup Table (512d)	✓
scGPT [17]	1200 HVGs	Value binning	Lookup Table (512d)	×
scFoundation [17]	~19,000 genes	Value projection	Lookup Table (768d)	×
UCE [17]	1024 sampled genes	/	Protein Embedding (5120d)	✓

After tokenization, all tokens are converted into embedding vectors, which are processed by the transformer layers. Special tokens, such as those representing cell identity or assay modality, may be prepended to provide additional context [1].

Model Architectures and Pretraining Objectives

Most scFMs are built on the transformer architecture. Two primary variants are employed:

Encoder-only Models (BERT-like): These models use a bidirectional attention mechanism, meaning all genes in a cell can attend to all other genes simultaneously to reconstruct the masked ones. This architecture is well-suited for tasks focused on generating high-quality cell and gene embeddings for classification and analysis [1]. scBERT is an example of this approach [1].
Decoder-only Models (GPT-like): These models use a unidirectional or masked self-attention mechanism, where the model predicts the next or a masked gene conditioned only on the preceding, known genes. This architecture is often used for generative tasks [1]. scGPT is a prominent decoder-based model [1] [17].

The primary pretraining objective is the reconstruction of masked gene expression values. The model is trained to minimize the difference between the predicted and actual expression values for the masked genes, using losses such as Mean Squared Error (MSE) or Cross-Entropy (CE) [17].

Figure 1: Masked Gene Modeling Workflow. A portion of the input gene expression vector is masked, and the model is trained to predict the original values.

Quantitative Performance of MGM-Pretrained Models

Evaluating the zero-shot performance of scFMs—where pretrained models are applied directly to new tasks without fine-tuning—is critical for assessing the true generalizable knowledge acquired during pretraining. This is especially important in discovery settings where labels are unknown [5].

Performance on Cell Type Clustering

In zero-shot cell type clustering, embeddings from MGM-pretrained models are used directly for clustering, and the results are compared to known cell type labels.

Table 2: Zero-shot Cell Type Clustering Performance (AvgBIO Score) [5]

Model / Method	PBMC (12k)	Pancreas	Immune	Tabula Sapiens
HVG (Baseline)	0.65	0.62	0.69	0.66
scVI (Baseline)	0.63	0.65	0.66	0.64
Harmony (Baseline)	0.64	0.63	0.65	0.63
scGPT	0.67	0.59	0.60	0.61
Geneformer	0.55	0.52	0.55	0.54

As shown in Table 2, established baselines like Highly Variable Genes (HVG), scVI, and Harmony often outperform or match the performance of foundation models like scGPT and Geneformer in this zero-shot setting. This suggests that MGM pretraining does not automatically guarantee superior cell type separation without fine-tuning [5].

Performance on Batch Integration

Batch integration aims to remove technical variations between datasets while preserving biological differences. Performance is measured by how well batch effects are mixed (Batch Mixing Score) and how much biological information is retained (Cell-type ASW).

Table 3: Zero-shot Batch Integration Performance [5]

Model / Method	Batch Mixing Score (↑)	Cell-type ASW (↑)	PCR Batch (↓)
HVG (Baseline)	0.72	0.63	0.41
scVI (Baseline)	0.68	0.65	0.38
Harmony (Baseline)	0.65	0.64	0.45
scGPT	0.63	0.59	0.49
Geneformer	0.51	0.53	0.68

In batch integration, HVG selection again demonstrates strong performance. Geneformer's embeddings, in particular, were found to have a higher proportion of variance explained by batch effects than the original data, indicating inadequate batch mixing in a zero-shot context [5].

Experimental Protocol for Zero-Shot Evaluation

This protocol outlines the steps to evaluate the zero-shot capabilities of an MGM-pretrained model on a new target dataset for cell type clustering and batch integration.

Materials and Software Requirements

Computing Environment: A machine with a GPU (e.g., NVIDIA V100 or A100) is recommended for faster inference, though CPU is feasible.
Software: Python (>=3.8), PyTorch or TensorFlow, and the specific model's library (e.g., scGPT, Geneformer).
Target Dataset: A preprocessed scRNA-seq dataset (e.g., in Anndata or Seurat format) with held-out cell type labels and batch information for evaluation.

Step-by-Step Procedure

Model Acquisition and Loading:
- Download the pretrained model weights for the scFM (e.g., from a GitHub repository or model hub).
- Load the model into memory using the corresponding library, ensuring it is in evaluation/inference mode.
Target Data Preprocessing:
- Quality Control: Filter out low-quality cells based on metrics like UMI counts, number of genes detected, and mitochondrial read percentage [23]. Filter out lowly expressed genes.
- Gene Alignment: Map the genes in the target dataset to the gene vocabulary used during the model's pretraining. This may require filtering for a common set of genes or handling missing genes as defined by the model's authors.
- Normalization: Apply the normalization method (e.g., log1p, library size normalization) that is compatible with the loaded scFM.
Zero-Shot Embedding Generation:
- Pass the preprocessed target dataset through the pretrained model without performing any further training or fine-tuning.
- Extract the cell embeddings from the model's output layer. This is often a [CLS] token embedding or a mean-pooled representation of all gene embeddings for a cell [1] [17].
Downstream Task Application:
- Cell Type Clustering:
  - Perform dimensionality reduction (e.g., UMAP, t-SNE) on the extracted cell embeddings.
  - Cluster the cells using a standard algorithm like Leiden or Louvain clustering.
  - Compare the clusters to the held-out ground truth cell type labels using metrics like Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), or Average BIO Score (AvgBIO) [5].
- Batch Integration Assessment:
  - Visualize the embeddings, coloring cells by batch and by cell type.
  - Quantitatively evaluate using metrics like:
    - Batch Mixing Score: Measures the degree of mixing between different batches.
    - Average Silhouette Width (ASW) for Cell-type: Assesses the preservation of biological variation.
    - Principal Component Regression (PCR) Batch: Quantifies the amount of variance explained by batch effects [5].
Benchmarking:
- Compare the performance of the scFM embeddings against established baseline methods, such as using Highly Variable Genes (HVG) directly, or embeddings from scVI and Harmony [5].

Figure 2: Zero-shot Evaluation Protocol. Workflow for assessing a pretrained model on new data without fine-tuning.

The Scientist's Toolkit: Essential Research Reagents

Table 4: Key Reagents and Resources for MGM Pretraining and Evaluation

Category	Item	Description and Function
Data Resources	CZ CELLxGENE Census [5] [1]	A unified resource providing access to millions of curated and standardized single-cell datasets, serving as a primary source for pretraining data.
	Human Cell Atlas [1]	A reference map of all human cells, providing comprehensive data on cell types and states across tissues.
	Gene Expression Omnibus (GEO) [1]	A public functional genomics data repository that hosts a vast number of submitted single-cell sequencing studies.
Software & Models	scGPT [5] [17]	A transformer-based foundation model pretrained on 33 million human cells using MGM. Supports multiple omics modalities.
	Geneformer [5] [17]	A transformer model pretrained on 30 million cells, using a ranked-genes approach for tokenization and MGM.
	Seurat / Scanpy [23]	Standard toolkits for single-cell data analysis, used for preprocessing, visualization, and benchmarking.
Evaluation Metrics	AvgBIO Score [5]	A composite metric for evaluating cell type clustering quality, combining multiple clustering benchmarks.
	Batch Mixing Score [5]	Quantifies how well batches are integrated in the latent space.
	Cell-type ASW [5]	Average Silhouette Width; measures the preservation of cell type separation after integration.

Masked Gene Modeling is a powerful self-supervised paradigm for learning generalizable representations of single-cell biology. However, rigorous zero-shot evaluation reveals that current MGM-pretrained models do not consistently outperform simpler baseline methods on tasks like cell type clustering and batch integration, highlighting a significant challenge for the field [5].

Future work should focus on improving the pretraining objectives and model architectures to learn more transferable and biologically meaningful representations. The development of benchmarks that more directly assess a model's capacity for zero-shot biological discovery, beyond just technical tasks, will be crucial. As models scale and training datasets become larger and more diverse, the promise of scFMs to serve as robust, plug-and-play tools for zero-shot learning in biomedical research remains a central and achievable goal [1] [17].

Practical Applications and Methodological Approaches in Zero-Shot Analysis

Zero-Shot Cell Type Annotation and Novel Cell Type Discovery

Zero-shot learning represents a paradigm shift in the analysis of single-cell RNA sequencing (scRNA-seq) data. In contrast to supervised methods that require extensive labeled datasets for training, zero-shot approaches leverage pre-existing knowledge to annotate cell types and discover novel cellular states without task-specific fine-tuning [5]. This capability is critically important for exploratory biological research where comprehensive cell type labels are unknown or incomplete. The emergence of single-cell foundation models (scFMs), pretrained on millions of cells, promises to unlock this potential by learning universal biological representations transferable to diverse downstream tasks [24] [9].

The zero-shot paradigm is particularly valuable for discovering novel cell types and states that fall outside existing classification schemas. In clinical and drug development contexts, this enables researchers to identify previously uncharacterized cell populations in disease microenvironments or in response to treatment, potentially revealing new therapeutic targets [4]. However, rigorous benchmarking studies have revealed significant limitations in current scFMs, which sometimes underperform simpler methods in zero-shot settings [5] [4]. This application note synthesizes current methodologies, performance benchmarks, and experimental protocols to establish robust practices for zero-shot cell type annotation and novel cell discovery.

Performance Benchmarking of scFMs in Zero-Shot Tasks

Comprehensive evaluations of scFM performance reveal a complex landscape where no single model consistently outperforms others across all tasks. The table below summarizes key findings from recent large-scale benchmarking studies.

Table 1: Zero-Shot Performance of Single-Cell Foundation Models for Cell Type Annotation

Model	Pretraining Corpus	Key Strengths	Performance Notes	Limitations
scGPT	33 million human cells [9]	Cross-species annotation, multi-omic integration [9]	Inconsistent zero-shot clustering; outperformed by HVGs on some datasets [5]	Embeddings sometimes retain batch effects; variable performance across tissues [5]
Geneformer	27 million human cells [4]	Gene network inference, developmental trajectories [4]	Underperforms HVG, scVI, and Harmony in clustering (AvgBIO score) [5]	Poor batch integration; embeddings often cluster by batch rather than cell type [5]
scPlantFormer	1 million plant cells (Arabidopsis thaliana) [9]	Cross-species annotation (92% accuracy) [9]	Specialized for plant systems; limited evaluation in human contexts	Domain-specific applicability
LangCell	Not specified	Gene embedding quality	Competitive on gene-level tasks [4]	Cell-level performance varies [4]

Table 2: Comparison of Zero-Shot Performance Against Established Baselines

Method	Category	Cell Type Clustering	Batch Integration	Novelty Detection
scGPT (zero-shot)	Foundation Model	Variable across datasets [5]	Moderate (better on complex biological batches) [5]	Limited published evidence
Geneformer (zero-shot)	Foundation Model	Consistently outperformed by baselines [5]	Poor (high batch effect retention) [5]	Limited published evidence
HVG Selection	Traditional	Robust performance across datasets [5]	Excellent quantitative scores [5]	Limited capability
scVI	Generative Model	Strong performance [5]	Excellent for technical variation [5]	Established capability
Harmony	Integration Algorithm	Strong performance [5]	Excellent for technical batches [5]	Limited capability

Notably, a zero-shot evaluation of scGPT and Geneformer revealed that both models were outperformed by simpler methods like highly variable gene (HVG) selection and established integration algorithms such as Harmony and scVI on cell type clustering tasks, as measured by average BIO (AvgBIO) scores [5]. This performance gap highlights the critical need for rigorous benchmarking before deploying scFMs in research pipelines.

Experimental Protocols for Zero-Shot Annotation

Protocol 1: Zero-Shot Cell Type Annotation Using Precomputed Embeddings

Purpose: To annotate cell types in a target scRNA-seq dataset using pre-trained foundation models without fine-tuning.

Materials:

Target scRNA-seq dataset (count matrix)
Pretrained scFM (e.g., scGPT, Geneformer)
Reference cell type markers (e.g., from Cell Ontology)
Computational environment with appropriate libraries (Python, PyTorch)

Procedure:

Data Preprocessing:
- Normalize the target dataset using standard scRNA-seq workflows (e.g., SCTransform)
- Filter low-quality cells and genes
- Log-transform expression values

Embedding Generation:
- Load the pretrained model weights
- Pass the preprocessed count matrix through the model to extract cell embeddings
- Reduce dimensionality using UMAP or t-SNE for visualization
Cell Type Prediction:
- Calculate similarity scores between query cells and reference cell type signatures
- Assign cell type labels based on maximum similarity
- Set confidence thresholds to flag low-probability assignments
Validation:
- Assess clustering quality using metrics like average silhouette width (ASW)
- Manually inspect marker gene expression for assigned types
- Identify populations with ambiguous assignments for further investigation

Protocol 2: Novel Cell Type Discovery Through Multimodal Similarity Search

Purpose: To identify novel cell populations that lack strong similarity to known reference types.

Materials:

CellWhisperer framework or similar multimodal tool [25]
Integrated transcriptome-text embedding model
Reference atlas with comprehensive coverage (e.g., Human Cell Atlas)

Procedure:

Multimodal Embedding:
- Generate joint embeddings of transcriptomes and textual descriptions using contrastive learning
- Project both query cells and reference populations into shared latent space

Similarity Assessment:
- Compute distances between query cells and all reference cell types
- Identify outlier populations with low similarity to any reference type
- Perform hierarchical clustering to confirm distinctness of putative novel population
Characterization:
- Extract differentially expressed genes for the novel population
- Use natural language queries to explore potential functions (e.g., "cells with high metabolic activity")
- Compare with developmental trajectories or disease states for context
Biological Validation:
- Design experimental validation using protein markers or spatial transcriptomics
- Contextualize findings within relevant biological processes or disease mechanisms

Visualization Workflows for Annotation Results

Effective visualization is essential for interpreting zero-shot annotation results and communicating findings. The following workflows integrate established tools with novel multimodal approaches.

Diagram 1: Zero-Shot Annotation Visualization Workflow

Advanced tools like Vitessce enable integrative visualization of multimodal single-cell data across multiple coordinated views [26]. This framework supports simultaneous exploration of transcriptomics, cell-type annotations, spatially resolved transcripts, and imaging data, facilitating the interpretation of novel cell populations in their biological context.

Table 3: Essential Computational Tools for Zero-Shot Cell Type Annotation

Tool/Resource	Type	Function	Access
CELLxGENE Census	Data Platform	Curated single-cell data for reference and benchmarking [25]	https://cellxgene.cziscience.com/
CellWhisperer	Multimodal AI	Natural language query of transcriptomic data [25]	https://cellwhisperer.bocklab.org
Vitessce	Visualization Framework	Interactive visualization of multimodal single-cell data [26]	http://vitessce.io
scBubbletree	Visualization Package	Quantitative visualization of scRNA-seq cluster relationships [27]	Bioconductor R package
BioLLM	Benchmarking Framework	Standardized interface for evaluating foundation models [9]	Open source
Human Cell Atlas	Reference Data	Comprehensive map of human cell types [25]	https://www.humancellatlas.org/

Integrated Protocol for Validation and Biological Interpretation

Diagram 2: Multimodal Validation Protocol

Purpose: To validate and biologically contextualize putative novel cell types identified through zero-shot annotation.

Materials:

Spatial transcriptomics data (e.g., 10x Visium, MERFISH)
Protein expression data (CITE-seq, CODEX)
Functional annotation databases (GO, KEGG)
CellWhisperer or similar multimodal interpretation tool

Procedure:

Spatial Validation:
- Map putative novel cell populations to spatial coordinates
- Assess spatial clustering patterns and neighborhood contexts
- Correlate with histological features in matched tissue sections

Multimodal Correlation:
- Integrate transcriptomic findings with protein expression patterns
- Confirm uniqueness at multiple molecular layers
- Identify surface markers for experimental isolation
Functional Annotation:
- Perform pathway enrichment analysis on marker genes
- Use CellWhisperer's natural language capability to generate biological hypotheses
- Contextualize within relevant disease mechanisms or developmental processes
Expert Integration:
- Combine computational evidence with domain knowledge
- Design targeted experiments to validate functional characteristics
- Propose formal naming and classification through appropriate ontologies

Zero-shot cell type annotation and novel cell discovery represent frontier capabilities in single-cell genomics with significant potential for biological discovery and therapeutic development. While current foundation models show promise, their performance varies considerably across biological contexts and dataset characteristics. The protocols and benchmarks presented here provide a framework for rigorous application of these methods while acknowledging current limitations. As the field evolves, continued development of multimodal approaches and biologically-informed evaluation metrics will be essential for realizing the full potential of zero-shot learning in cellular taxonomy and discovery.

A fundamental challenge in single-cell genomics is the integration of datasets from different studies, technologies, or laboratories to extract meaningful biological insights. Batch effects—non-biological variations introduced by technical differences—can obscure true biological signals and hinder cross-study comparisons. While traditional computational methods often require dataset-specific fine-tuning, single-cell foundation models (scFMs) offer a promising alternative through their emergent zero-shot capabilities. This Application Note examines current scFMs and their application in overcoming batch effects without fine-tuning, providing researchers with practical protocols for evaluating and implementing these approaches.

The Batch Effect Challenge in Single-Cell Biology

Batch effects represent a significant obstacle in single-cell research, particularly when integrating data across different experimental conditions, technologies, or donor populations. These technical variations can:

Obscure true biological differences between cell states and conditions
Limit statistical power by reducing effective sample sizes
Introduce false positives in differential expression analysis
Hinder reproducibility across studies and platforms

The problem is particularly acute in exploratory research where comprehensive labels for supervised fine-tuning are unavailable. In these contexts, models must generate robust representations without task-specific training, making zero-shot performance a critical evaluation metric [5].

Zero-Shot Performance of Single-Cell Foundation Models

Evaluation Framework

Rigorous evaluation of scFMs in zero-shot settings reveals important limitations and strengths. Performance should be assessed using multiple complementary metrics:

Cell type clustering: Ability to separate known cell types without fine-tuning
Batch integration: Effectiveness in removing technical variations while preserving biological signals
Biological relevance: Concordance with established biological knowledge

Benchmarking studies typically compare scFMs against established baselines including Highly Variable Genes (HVG) selection, Harmony, and scVI [5] [17].

Performance Comparison

Recent evaluations demonstrate variable performance across models and datasets:

Table 1: Zero-shot performance comparison across integration methods

Method	Cell Type Clustering (AvgBIO Score)	Batch Integration (Batch Mixing Score)	Biological Relevance (scGraph-OntoRWR)
HVG Selection	0.74	0.89	0.68
Harmony	0.71	0.76	0.72
scVI	0.73	0.82	0.75
Geneformer	0.62	0.61	0.65
scGPT	0.68	0.79	0.71
scShift	0.76	0.85	0.78

Data compiled from multiple benchmarking studies [5] [28] [17].

Notably, simpler methods like HVG selection can outperform foundation models in some zero-shot scenarios, particularly for batch integration tasks [5]. However, specialized models like scShift demonstrate exceptional capabilities in disentangling batch-dependent and independent variations when pretrained on compendiums of scRNA-seq atlases [28].

Experimental Protocols for Zero-Shot Evaluation

Protocol 1: Evaluating Batch Integration Performance

Objective: Assess model performance in removing batch effects while preserving biological variation.

Materials:

Processed single-cell dataset with known batch labels and cell type annotations
Pretrained foundation model (Geneformer, scGPT, scShift, or alternatives)
Baseline methods (HVG, Harmony, scVI) for comparison
Computing environment with appropriate libraries (Python, R)

Procedure:

Data Preparation:
- Standardize input data to match model requirements (gene ranking for Geneformer, HVG selection for scGPT)
- Ensure batch labels and cell type annotations are available for evaluation
- Split data by batch origin if evaluating cross-dataset performance

Embedding Generation:
- Generate cell embeddings using the foundation model in zero-shot mode
- No fine-tuning or parameter optimization should be performed
- Extract embeddings at the appropriate layer (cell-level embeddings)
Quantitative Assessment:
- Calculate batch mixing metrics (e.g., PCR score, LISI)
- Evaluate cell type separation (e.g., ASW, AvgBIO score)
- Compute biological relevance metrics (e.g., scGraph-OntoRWR)
Visualization:
- Generate 2D visualizations (UMAP, t-SNE) of embeddings
- Color by batch labels to assess integration
- Color by cell type to assess biological preservation

Expected Outcomes: Foundation models should demonstrate competitive batch mixing while maintaining or improving biological signal preservation compared to baselines [5] [17].

Protocol 2: Cross-Dataset Biological State Transfer

Objective: Evaluate model capability to identify consistent biological states across independent datasets.

Materials:

Multiple datasets with shared biological conditions (e.g., disease states)
Pretrained scShift model or equivalent
Reference biological annotations for validation

Procedure:

Model Setup:
- Utilize scShift's dual-encoder architecture for batch-dependent and batch-independent variations
- Configure sparsity regularization (l0) for dataset label encoding
- Apply independence regularization between centralized latent variables and dataset labels

Embedding Extraction:
- Extract biological embeddings (batch-dependent components)
- Extract unperturbed embeddings (batch-independent components)
- Process new datasets without additional training
Cross-Dataset Comparison:
- Project biological states from different datasets into shared space
- Identify conserved biological patterns across batches
- Validate with known biological ground truth
Downstream Analysis:
- Construct classifiers for biological states using embeddings
- Identify state-specific gene expression patterns
- Characterize cellular interactions and potential therapeutic targets

Expected Outcomes: Successful models will identify consistent biological states (e.g., disease signatures) across technically diverse datasets without fine-tuning [28].

Research Reagent Solutions

Table 2: Essential computational tools for zero-shot batch integration

Tool Name	Type	Primary Function	Implementation Requirements
Geneformer	Foundation Model	Cell embedding via transformer architecture	Python, 40M parameters, 30M pretraining cells [17]
scGPT	Foundation Model	Multi-task learning on single-cell data	Python, 50M parameters, 33M pretraining cells [5] [17]
scShift	Specialized Framework	Disentangling batch and biological variations	Python, variational inference framework [28]
Harmony	Integration Algorithm	Batch effect correction	R/Python, linear integration approach [5]
scVI	Generative Model	Probabilistic modeling of scRNA-seq	Python, deep generative modeling [5]
CELLxGENE	Data Platform	Curated single-cell data repository	Web access or local installation [1]

Implementation Workflows

Workflow 1: Zero-Shot Dataset Integration

Workflow 2: Biological State Identification Across Batches

Zero-shot integration of single-cell datasets represents a significant advancement in computational biology, enabling researchers to overcome batch effects without extensive fine-tuning. While current foundation models show promise, their performance varies considerably across tasks and datasets. scShift demonstrates particularly strong capabilities in disentangling biological and technical variations through its identifiable architecture. Researchers should select integration methods based on their specific data characteristics and analytical needs, considering that simpler approaches sometimes outperform complex foundation models. As the field evolves, improved model architectures and training strategies will likely enhance zero-shot performance, ultimately enabling more robust and reproducible single-cell research.

Predicting Cellular Responses to Drugs and Molecular Perturbations

The advent of single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity within complex biological systems and tumor microenvironments [4] [29]. This technology provides an unprecedented granular view of transcriptomics at the resolution of individual cells, enabling researchers to investigate diverse cellular responses to therapeutic interventions [30]. However, the high sparsity, dimensionality, and noise characteristic of scRNA-seq data present significant computational challenges for analyzing cellular drug responses [4].

Single-cell foundation models (scFMs) pretrained on massive datasets have emerged as powerful tools to address these challenges [4]. These models, including scGPT, Geneformer, scFoundation, and UCE, leverage self-supervised learning to capture universal biological patterns, which can then be applied to downstream tasks with minimal additional training [4] [31]. The zero-shot learning capabilities of these models are particularly valuable for predicting cellular responses to drugs and perturbations in discovery settings where labeled data are scarce or unavailable [5].

This application note provides a comprehensive framework for leveraging scFMs in zero-shot settings to predict cellular drug responses. We present benchmark performance data across multiple models, detailed experimental protocols for implementation, visualization of key workflows, and a curated toolkit of research reagents to facilitate adoption of these methods in basic research and drug development pipelines.

Benchmarking Single-Cell Foundation Models for Drug Response Prediction

Performance Evaluation of scFMs Across Tasks

Recent benchmarking studies have revealed distinct strengths and limitations of various scFMs across different biological tasks. The evaluation encompasses gene-level tasks (e.g., gene function prediction, tissue specificity) and cell-level tasks (e.g., cell type annotation, batch integration, drug response prediction) [4].

Table 1: Performance comparison of single-cell foundation models across key tasks

Foundation Model	Zero-shot Cell Embedding Quality (ASW)	Batch Integration	Drug Response Prediction (F1 Score)	Computational Efficiency
scGPT	0.75-0.92	Moderate	0.858 (zero-shot)	High
Geneformer	0.65-0.85	Poor	0.65-0.80	High
scFoundation	0.70-0.88	Moderate	0.947 (fine-tuned)	Moderate
UCE	0.68-0.82	Moderate	0.774 (fine-tuned)	Moderate
scBERT	0.55-0.70	Poor	0.60-0.75	Low

Data compiled from multiple benchmarking studies [4] [5] [32]. Performance ranges represent variation across different datasets and evaluation metrics. ASW = Average Silhouette Width, measuring cluster separation quality.

Notably, evaluations reveal that no single scFM consistently outperforms others across all tasks, emphasizing the need for task-specific model selection [4]. In zero-shot settings for drug response prediction, scGPT has demonstrated superior performance with a mean F1 score of 0.858, while scFoundation excels in fine-tuned scenarios [32] [33].

Specialized Models for Drug Response Prediction

Beyond general-purpose scFMs, specialized architectures have been developed specifically for pharmacological applications:

scGSDR (Single-cell Gene Semantics for Drug Response prediction) incorporates biological knowledge through dual computational pipelines focusing on cellular states and signaling pathways [29]. This model employs a transformer-based graph fusion framework to integrate multi-source cellular features, enhancing prediction accuracy and providing interpretable insights into resistance mechanisms.

ATSDP-NET (Attention-based Transfer Learning for Enhanced Single-cell Drug Response Prediction) combines bulk and single-cell RNA-seq data using transfer learning and multi-head attention mechanisms [30]. This approach identifies critical gene expression patterns linked to drug reactions, achieving high correlation between predicted sensitivity gene scores and actual values (R = 0.888, p < 0.001).

ZeroBind utilizes a protein-specific meta-learning framework with subgraph matching for drug-target interaction prediction, achieving AUROC scores of 0.9521(±0.0034) in transductive test sets and demonstrating strong zero-shot capabilities for novel proteins [34].

Experimental Protocols for Zero-Shot Drug Response Prediction

Protocol 1: Zero-Shot Evaluation of scFMs for Cellular Drug Response

This protocol outlines the procedure for assessing pre-trained scFMs without additional fine-tuning, particularly valuable when labeled drug response data are limited.

Materials:

Pre-trained foundation model (scGPT, Geneformer, scFoundation, or UCE)
Target scRNA-seq dataset (pre-treatment)
Computational environment with GPU acceleration
BioLLM framework or scDrugMap platform [31] [32]

Procedure:

Data Preprocessing: Implement rigorous quality control including mitochondrial gene filtering, doublet detection, and normalization using the decision-tree-based preprocessing interface in BioLLM [31].
Feature Extraction: Generate zero-shot cell embeddings using the foundation model's forward pass without gradient updates.
Dimensionality Reduction: Apply UMAP or t-SNE to embeddings for visualization of cellular states.
Clustering Analysis: Perform Leiden clustering on embeddings to identify distinct cellular subpopulations.
Response Prediction: Utilize the model's inherent capabilities or simple classifiers (e.g., k-NN) on embeddings to predict drug-sensitive and resistant populations.
Validation: Compare predictions with experimental measurements when available, using metrics including AUROC, AUPRC, and F1 score.

Troubleshooting Tips:

If embeddings show poor separation, ensure input data preprocessing matches the model's original training specifications.
For inconsistent results across batches, apply the model's built-in batch correction methods or post-hoc integration tools.

Protocol 2: Integration of Biological Knowledge with scGSDR

This protocol details the incorporation of gene semantic information to enhance drug response prediction accuracy.

Materials:

scGSDR framework
Gene ontology databases
Signaling pathway resources (KEGG, Reactome)
Cellular state marker gene lists

Procedure:

Cellular State Pipeline: Filter genes using marker genes from 14 different cellular states and map to embedding space using a transformer module.
Pathway Pipeline: Automatically learn attention matrices defining association between each cell and various pathways; construct cell-cell graphs.
Multi-Graph Fusion: Input learned graphs with gene expression profiles into multi-graph fusion module to generate pathway-informed embeddings.
Feature Fusion: Integrate cellular state and pathway embeddings through feature fusion.
Domain Adaptation: Apply domain adaptation learning to mitigate discrepancies between reference and query datasets.
Imbalance Correction: Implement specialized loss functions (Inverse, Deviation, Hinge) to address data imbalance between drug-resistant and sensitive cells [29].

Workflow Visualization

Zero-Shot Drug Response Prediction Workflow

Zero-Shot Drug Response Prediction Pipeline - This workflow illustrates the sequential process from raw single-cell data to validated predictions, highlighting the central role of foundation models in generating biological insights without task-specific training.

Model Comparison Framework

Foundation Model Evaluation Framework - This diagram visualizes the standardized evaluation of multiple scFMs through unified frameworks like BioLLM, enabling systematic comparison across diverse tasks including drug response prediction.

The Scientist's Toolkit: Research Reagent Solutions

Table 2: Essential computational tools and resources for zero-shot drug response prediction

Tool/Resource	Type	Primary Function	Application Context
BioLLM Framework	Software Framework	Unified interface for diverse scFMs	Standardized model evaluation and deployment [31]
scDrugMap	Platform	Drug response prediction benchmark	Evaluating foundation models on pharmacological tasks [32] [33]
CELLxGENE Database	Data Resource	Curated single-cell datasets	Model pretraining and validation [5]
GDSC/CCLE Databases	Data Resource	Drug sensitivity data	Ground truth for model training and validation [30] [29]
scGSDR	Specialized Model	Gene semantics integration	Pathway-informed drug response prediction [29]
ATSDP-NET	Specialized Model	Attention mechanism & transfer learning	Bulk-to-single cell knowledge transfer [30]
ZeroBind	Specialized Model	Drug-target interaction prediction	Zero-shot prediction for novel proteins [34]

Zero-shot learning with single-cell foundation models represents a transformative approach for predicting cellular responses to drugs and perturbations. The benchmarking data presented herein demonstrates that while current models show promising capabilities, their performance varies significantly across tasks and datasets. Researchers should select models based on specific application requirements, considering factors such as dataset size, biological interpretability needs, and computational resources.

The experimental protocols and visualization workflows provide practical guidance for implementation, while the curated toolkit of research reagents facilitates adoption across diverse research environments. As the field evolves, continued benchmarking efforts and standardized evaluation frameworks will be essential for realizing the full potential of scFMs in pharmacological research and therapeutic development.

The interpretation of single-cell RNA sequencing (scRNA-seq) data presents a significant challenge in computational biology, as researchers must navigate complex gene expression matrices containing thousands of cells and tens of thousands of genes to extract meaningful biological insights [25]. The emergence of single-cell foundation models (scFMs) has promised to revolutionize this analysis by providing pretrained models that can be adapted to various downstream tasks. However, recent rigorous evaluations have revealed critical limitations in these models, particularly in zero-shot settings where they are applied without further training to new data with unknown labels [5]. This performance gap is especially problematic for discovery-driven science where cellular composition may not be known in advance.

In response to these challenges, a new paradigm has emerged: multimodal artificial intelligence that connects transcriptomic data with natural language. CellWhisperer represents a pioneering approach in this domain, bridging the gap between numerical gene expression values and textual biological descriptions through contrastive learning [25] [35]. By establishing a joint embedding space for transcriptomes and text, this framework enables researchers to interrogate their data using intuitive natural-language queries rather than complex computational code, making sophisticated analysis accessible to non-computational biologists [36].

The integration of chat-based exploration within single-cell analysis tools addresses a fundamental need in biological research: connecting computational outputs with biological context. Where traditional scFMs have struggled with reliability in zero-shot applications [5] [6], multimodal approaches like CellWhisperer leverage the inherent knowledge captured in large language models (LLMs) to provide context-aware interpretations of gene expression patterns. This application note examines the principles, protocols, and applications of this transformative technology, with particular emphasis on its performance in zero-shot learning scenarios relevant to drug discovery and biomedical research.

Technical Foundations of Multimodal Integration

Core Architecture and Training Methodology

CellWhisperer employs a sophisticated multimodal architecture based on the contrastive language-image pretraining (CLIP) framework, adapted for biological data [25]. The system consists of two interconnected artificial intelligence models that work in tandem to enable natural language interaction with transcriptomic data:

Embedding Model: This component creates a joint multimodal embedding space through contrastive learning on 1,082,413 pairs of human RNA-seq profiles and matched textual annotations [25]. The model processes transcriptomes using the Geneformer model for gene expression and textual annotations using BioBERT for biomedical text [25]. These processed inputs are then mapped into a shared 2,048-dimensional embedding space using conventional feed-forward neural network layers.
Chat Model: This component adapts the Mistral 7B open-weights large language model to incorporate CellWhisperer transcriptome embeddings alongside text queries [25]. The model was fine-tuned on a dataset of 106,610 conversations, including both rule-based question-answer pairs and complex LLM-generated dialogues about transcriptomes and cells [25].

The training data for CellWhisperer was assembled through LLM-assisted curation from two major repositories: the Gene Expression Omnibus (GEO) and CELLxGENE Census [25]. This process yielded 705,430 human transcriptomes from GEO with standardized textual annotations and 376,983 pseudo-bulk transcriptomes derived from scRNA-seq datasets in CELLxGENE Census [25].

Zero-Shot Capabilities and Performance

A critical advantage of the multimodal approach is its inherent zero-shot capability, allowing the model to recognize patterns in new datasets without additional training [35]. CellWhisperer demonstrates robust performance in zero-shot prediction of cell types and other biological annotations [25] [37]. The system achieves this through its multimodal embedding space, which enables semantic similarity search across both transcriptomic and textual domains.

When benchmarked against traditional single-cell foundation models like Geneformer and scGPT, which have shown limitations in zero-shot settings [5], CellWhisperer's multimodal approach appears to address several key shortcomings. The model's ability to leverage both the structural patterns in gene expression data and the semantic context from biological text descriptions enhances its generalization capabilities to unseen data and cell types.

Table 1: Comparative Performance of Single-Cell Analysis Methods in Zero-Shot Settings

Method	Architecture Type	Key Strength	Zero-Shot Limitation	Cell Type Clustering (AvgBIO)
CellWhisperer	Multimodal transformer	Natural language query interpretation	Limited benchmarking across diverse tissues	0.927 (AUROC for retrieval tasks) [25]
Geneformer	Foundation model	Gene regulatory inference	Poor batch integration and cell type separation [5]	Underperforms vs. HVG [5]
scGPT	Foundation model	Scalability to large datasets	Inconsistent across tissue types [5]	Variable; outperformed by Harmony and scVI [5]
Harmony	Conventional ML	Batch effect correction	Requires predefined cell identities	Outperforms foundation models [5]
scVI	Probabilistic model	Probabilistic modeling of expression	Requires model fitting for new data	Outperforms foundation models [5]
HVG (Highly Variable Genes)	Statistical baseline	Computational simplicity	Limited biological context	Outperforms Geneformer and scGPT [5]

Experimental Protocols for Chat-Based Single-Cell Exploration

Implementation and Data Preparation

To utilize CellWhisperer for single-cell data exploration, researchers must follow a structured protocol for data preparation and system implementation:

Data Preparation Protocol:

Format Conversion: Prepare scRNA-seq data as an h5ad file containing the gene expression matrix and associated cell metadata [36].
Quality Control: Implement standard scRNA-seq quality control metrics including mitochondrial gene percentage thresholds, minimum gene counts per cell, and cell viability markers.
Normalization: Apply library size normalization and log transformation following standard single-cell analysis practices.
Metadata Annotation: Include comprehensive sample metadata such as experimental conditions, donor information, and processing batches to enhance contextual understanding.

System Implementation:

Local Deployment: Download the CellWhisperer package from the official GitHub repository and install dependencies following the provided documentation [36].
Data Processing: Run the CellWhisperer data processing script on the prepared h5ad file to generate compatible embeddings.
Server Launch: Initiate the local CellWhisperer instance and configure it to access the processed dataset [36].
Integration with CELLxGENE: For enhanced visualization, connect CellWhisperer with the CELLxGENE Explorer browser to enable combined graphical and chat-based interaction [25].

Querying and Interaction Workflow

The experimental workflow for chat-based exploration follows an iterative process of question formulation, response generation, and biological validation:

Figure 1: Workflow for interactive exploration of single-cell data using natural language queries with CellWhisperer.

Protocol for Biological Querying:

Initial Cluster Exploration: Begin with broad queries about cell population identities, such as "What cell types are present in this dataset?" or "Show me all immune cells in this tissue."
Marker Gene Identification: Request specific gene expression patterns with queries like "What genes define this cluster?" or "Which genes are differentially expressed between these two cell populations?"
Functional Annotation: Investigate biological processes with questions such as "Which pathways are active in these cells?" or "What is the functional role of this cell population?"
Comparative Analysis: Examine differences between conditions using queries like "How do these cells differ between treatment and control?" or "What changes occur in these cells during development?"

Validation and Interpretation Framework

To ensure biological relevance and technical accuracy, the following validation protocol should be implemented:

Analytical Validation Steps:

Cross-Reference with Known Markers: Verify CellWhisperer's cell type predictions by checking expression of established marker genes through conventional visualization methods.
Comparison with Standard Methods: Confirm differential expression findings using traditional statistical tests (e.g., Wilcoxon rank-sum test) and multiple testing correction.
Pathway Enrichment Correlation: Validate pathway activity predictions through gene set enrichment analysis using established databases like MSigDB.
Iterative Refinement: Use initial findings to formulate more specific follow-up questions, leveraging the chat-based interface for hypothesis generation.

Table 2: Research Reagent Solutions for Multimodal Single-Cell Analysis

Reagent/Resource	Function	Implementation in CellWhisperer
CELLxGENE Census	Standardized single-cell data repository	Source of 376,983 pseudo-bulk transcriptomes for training [25]
ARCHS4	Uniformly processed GEO RNA-seq data	Source of 705,430 human transcriptomes for training [25]
BioBERT Embeddings	Biomedical text representation	Processes textual annotations for joint embedding space [25]
Geneformer Model	Gene expression representation	Processes transcriptomic data for joint embedding space [25]
Mistral 7B LLM	Natural language understanding	Base model for chat functionality, fine-tuned on biological conversations [25]
CELLxGENE Explorer	Single-cell data visualization	Integrated platform for combined graphical and chat-based exploration [25]

Applications in Drug Discovery and Development

Target Identification and Validation

The integration of multimodal chat-based exploration offers significant advantages for drug target identification and validation. By enabling researchers to intuitively interrogate single-cell datasets from diverse tissues and conditions, CellWhisperer facilitates:

Context-Specific Target Discovery: Identification of genes and pathways that show cell-type-specific expression patterns in disease-relevant contexts through queries such as "Which receptors are specifically expressed in diseased cells but not healthy counterparts?" [38]
Toxicity Prediction: Exploration of target expression in off-target tissues to assess potential adverse effects using natural language queries about gene expression across multiple organ systems.
Cell-Type-Specific Drug Action: Understanding how pharmaceutical compounds affect specific cell populations through perturbation analysis, as demonstrated in large-scale datasets like Tahoe-100M which includes 100 million individual cells exposed to over 1,100 drug compounds [38].

Functional Analysis of Therapeutic Response

Single-cell functional analysis provides critical insights into therapeutic mechanisms, particularly in complex systems like immuno-oncology. Multimodal integration enhances this analysis by:

Resolving Polyfunctional Cellular States: Identifying cells that simultaneously perform multiple functions, such as cytokine secretion and cytotoxic activity, through temporal analysis of live cell behaviors [39].
Predicting Therapy Resistance: Discovering transcriptional programs associated with treatment resistance by comparing pre- and post-treatment samples from clinical trials or model systems.
Optimizing Biologics Design: Guiding the selection of therapeutic antibodies and CAR-T cells by linking functional potency metrics with transcriptional signatures [39].

Figure 2: Integration of multimodal single-cell analysis in the drug discovery pipeline, from data generation to therapeutic application.

Critical Analysis and Future Directions

Limitations of Current Approaches

While multimodal integration represents a significant advance in single-cell analysis, several limitations must be acknowledged:

Dependence on Training Data Quality: The performance of systems like CellWhisperer is intrinsically linked to the quality and diversity of their training data, which may contain biases from original study designs and metadata annotation practices [25].
Interpretability Challenges: The "black box" nature of large language models can make it difficult to trace the reasoning behind specific responses, potentially limiting trust in critical drug discovery applications.
Computational Resource Requirements: The infrastructure needed to run these models may present barriers to widespread adoption, particularly for academic laboratories with limited computing resources.
Validation Gap: As with all AI-based discovery tools, findings generated through chat-based exploration require rigorous experimental validation, establishing which may delay implementation in high-stakes drug development pipelines.

Emerging Opportunities and Development Trajectories

The field of multimodal single-cell analysis is rapidly evolving, with several promising directions for future development:

Integration with Real-Time Functional Data: Combining transcriptomic profiles with dynamic functional measurements, such as cytokine secretion kinetics and cell-cell interaction dynamics [39], will create more comprehensive cellular models.
Expansion to Multi-Omic Modalities: Incorporating additional data types including epigenomics, proteomics, and spatial information will enable more holistic characterization of cellular states and their regulatory mechanisms [40].
Personalized Medicine Applications: Leveraging these tools to analyze patient-specific cellular responses to therapies, potentially guiding treatment selection based on individual molecular profiles.
Automated Discovery Pipelines: Developing fully integrated systems that combine hypothesis generation through natural language interaction with automated experimental design and validation.

As the technology matures, multimodal approaches like CellWhisperer have the potential to fundamentally transform how researchers interact with complex biological data, making sophisticated analysis accessible to a broader range of scientists and accelerating the translation of basic research into therapeutic applications [35] [36]. However, this promise must be balanced with rigorous validation and critical assessment of model outputs, particularly when applied to decision-making in drug development pipelines.

A fundamental challenge in single-cell RNA-sequencing (scRNA-seq) analysis is the persistent issue of batch effects—technical variations introduced from different experiments, labs, or technologies that are unrelated to the biological signals of interest. These effects hinder meaningful comparisons across datasets and can obscure true biological differences, such as those between disease and normal states [41]. While numerous batch-correction algorithms exist, many struggle to disentangle complex technical variations from nuanced biological states, particularly in a zero-shot setting where models are applied to new data without retraining or fine-tuning [5].

The emergence of single-cell foundation models (scFMs) represents a paradigm shift in computational biology, promising to learn universal patterns from massive datasets that generalize across diverse tasks [1]. However, rigorous evaluation has revealed that many proposed foundation models exhibit significant limitations in zero-shot performance, sometimes being outperformed by simpler methods on tasks like cell type clustering and batch integration [5].

Within this context, scShift stands out as a novel deep identifiable model that specifically addresses the challenge of disentangling batch-dependent and batch-independent variations through a theoretically grounded variational inference framework. By leveraging large-scale scRNA-seq compendiums, scShift demonstrates remarkable zero-shot capabilities in characterizing biological states while overcoming batch effects, representing an important advance toward next-generation computational models for single-cell analysis [42] [28].

Theoretical Foundation and Model Architecture

The Identifiability Challenge in Single-Cell Data

The core innovation of scShift addresses a fundamental non-identifiability problem in statistics, where batch effects and biological variations become arbitrarily entangled in most nonlinear models. This conceptual barrier cannot be overcome merely through enhanced architectures or larger datasets, but requires a novel mathematical framework [28]. scShift approaches this by treating dataset labels as supervision signals to identify batch-dependent variations, which comprise both biological states and technical artifacts. Within individual datasets, these variations represent the biological differences of interest, enabling cross-dataset comparison under appropriate assumptions [28].

scShift Architectural Framework

The scShift model architecture employs a dual-encoder design that decomposes gene expression variations into two distinct latent representations:

Batch-independent variations (z_i): Represent intrinsic cellular properties (e.g., cell types) shared across datasets
Batch-dependent variations (s_i): Encode both biological states and batch effects, differing across datasets [28]

This approach differs fundamentally from previous methods that typically concatenate rather than sum these representations. The model consists of two encoders for centralized latent variables and dataset labels, whose outputs are combined to reconstruct gene expression distributions. Key regularization techniques include:

Probabilistic L0 regularization to enforce sparsity in dataset label encoding
Kernel Maximum Mean Discrepancy (MMD) to enforce independence between centralized latent variables and dataset label encoding
Random gene permutation (25% within mini-batches) to enhance model generalization [28]

After training, scShift decomposes the full centralized state into a biological embedding (non-zero dataset label components) and an unperturbed embedding (zero dataset label components), both extractable from new datasets in a zero-shot manner without additional training [28].

Workflow Diagram: scShift Architecture and Application

The following diagram illustrates scShift's core architecture and its application workflow for biological state characterization:

Diagram Title: scShift Architecture and Workflow

Key Advantages of scShift

Comparative Advantages Over Alternative Approaches

Table 1: Comparison of scShift with Other Single-Cell Analysis Methods

Method	Zero-Shot Capability	Disentanglement Approach	Batch Effect Handling	Biological State Characterization
scShift	High (emergent with scaling)	Identifiable variational framework	Theoretical identifiability of batch-dependent variations	Explicit modeling via biological embeddings
scGPT	Variable (inconsistent performance)	Masked language model pretraining	Limited zero-shot batch integration	Not directly addressed
Geneformer	Low (underperforms baselines)	Attention-based representations	Poor zero-shot batch mixing	Not directly addressed
Harmony	N/A (requires dataset integration)	Linear integration	Effective for technical variation	Not directly addressed
scVI	N/A (requires dataset integration)	Probabilistic modeling	Effective for technical variation	Not directly addressed
HVG Selection	High (simple baseline)	Feature selection	Surprisingly effective in benchmarks	Limited to highly variable genes

Emergent Zero-Shot Capabilities and Scaling Laws

A systematic evaluation of over 200 scShift models revealed two critical phenomena:

Emergent zero-shot capabilities with respect to donor numbers in the training set
A scaling law beyond a transition threshold with respect to donor and cell numbers [28]

This scaling behavior distinguishes scShift from other foundation models like scGPT and Geneformer, which have demonstrated inconsistent zero-shot performance and sometimes underperform simpler methods like highly variable genes (HVG) selection [5].

Experimental Protocols and Applications

Protocol 1: Training scShift Models on Single-Cell Compendiums

Purpose: To train scShift models capable of zero-shot biological state characterization across diverse tissues and conditions.

Input Data Requirements:

Assembled scRNA-seq compendium with multiple datasets (e.g., CellXGene census)
Minimum of 1,000,000 cells from multiple studies and donors recommended
Dataset labels for all cells
Gene expression counts matrix

Methodology:

Data Preprocessing:
- Standard quality control and filtering
- Normalization of gene expression values
- Annotation with dataset labels
Model Configuration:
- Dual encoder architecture with stochastic gates
- L0 regularization for sparsity in dataset label encoding
- MMD regularization for independence enforcement
- 25% random gene permutation within mini-batches
Training Procedure:
- Optimize evidence lower bound (ELBO) objective
- Monitor reconstruction loss and regularization terms
- Validate on held-out datasets
Model Outputs:
- Trained scShift model parameters
- Unperturbed embeddings (batch-independent)
- Biological embeddings (batch-dependent) [28]

Protocol 2: Zero-Shot Characterization of Lung Fibrosis States

Purpose: To apply a pretrained scShift model to characterize lung fibrosis states across different datasets, tissues, and experimental systems without additional training.

Input Data:

Pretrained scShift model (blood and lung tissues)
Query datasets including idiopathic pulmonary fibrosis (IPF), bleomycin-induced fibrosis, and COVID-19 fibrosis
Corresponding gene expression matrices

Methodology:

Embedding Extraction:
- Process query datasets through pretrained scShift model
- Extract biological embeddings representing disease states
- Extract unperturbed embeddings representing cell types
Cross-Dataset Comparison:
- Compare biological embeddings across different fibrosis models
- Identify conserved and specific disease signatures
- Project previously unseen conditions (e.g., post-COVID-19 fibrosis)
Biological State Characterization:
- Identify universal myeloid-fibrosis signatures
- Predict potential drug repurposing targets
- Characterize fibrosis-associated cell interactions [42] [28]

Validation:

Compare with scRNA-seq measurements from chronic COVID-19 humanized mouse models
Assess conservation of identified signatures across experimental systems

Table 2: Key Research Reagents and Computational Resources for scShift Applications

Resource	Type	Function	Availability
CZ CELLxGENE Census	Data resource	Standardized single-cell datasets for model training	https://cellxgene.cziscience.com/
scShift GitHub Repository	Software	Implementation of scShift model framework	https://github.com/MingzeDong/scShift
Human Cell Atlas Data	Data resource	Reference data for model training and validation	https://www.humancellatlas.org/
Tabula Sapiens	Data resource	Multi-organ single-cell transcriptomic atlas	Publicly available
scvi-tools	Software library	Deep probabilistic models for single-cell data	https://scvi-tools.org/
biolord	Software	Alternative disentanglement method for comparison	https://github.com/nitzanlab/biolord

Results and Validation

Performance in Biological State Disentanglement

When trained on a human blood scRNA-seq compendium comprising 1,000,000 cells from 30 studies and 2,538 donors, plus 240,090 cells from 144 drug perturbations, scShift demonstrated:

Effective preservation of cell type information in unperturbed embeddings while eliminating batch effects
Revelation of disease or perturbation-specific clusters when combining biological and unperturbed embeddings
Capacity for in silico modeling of biological states across datasets [28]

Notably, scShift does not necessarily outperform alternative methods like Harmony, scVI, scANVI, or scPoli in standard atlas integration benchmarks, as these tasks do not specifically require correct specification of biological differences or zero-shot capabilities [28].

Comparison with Other Disentanglement Methods

The biolord method represents another approach for disentangling single-cell data, specializing in decoupling known attributes (cell type, age, perturbation) from unknown attributes. While biolord has demonstrated strong performance in predicting cellular responses to unseen drugs and genetic perturbations, scShift offers distinct advantages for zero-shot characterization of biological states without requiring prior annotation of those states [43].

Implementation Considerations

Computational Requirements

Table 3: Computational Considerations for scShift Implementation

Aspect	Requirements	Considerations
Training Data Scale	Minimum 1,000,000 cells recommended	Scaling laws observed beyond transition threshold
Model Architecture	Deep variational inference framework with dual encoders	Requires specialized implementation
Training Time	Varies with dataset size and model complexity	Emergent zero-shot capabilities require sufficient training
Inference	Efficient embedding extraction for new datasets	Enables zero-shot application to query datasets

Limitations and Future Directions

While scShift represents a significant advance in disentangling biological states from batch effects, several limitations and future directions deserve consideration:

Theoretical requirements: Successful disentanglement requires sufficient biologically distinct datasets to span possible biological variations
Model complexity: Implementation requires careful attention to identifiability constraints and regularization
Validation challenges: Ground truth for biological states is often unavailable, requiring indirect validation approaches

Future work may focus on extending the scShift framework to multi-omic data, incorporating spatial information, and improving scalability for even larger single-cell compendiums.

scShift represents a paradigm shift in computational single-cell analysis by addressing the fundamental identifiability challenge in distinguishing batch effects from true biological states. Through its theoretically grounded variational inference framework and demonstrated zero-shot capabilities, scShift enables researchers to characterize disease states, identify conserved signatures, and predict therapeutic targets across diverse datasets and experimental systems. As single-cell technologies continue to generate increasingly massive datasets, approaches like scShift that leverage scaling laws and emergent zero-shot capabilities will be essential for unlocking the full potential of single-cell genomics in biomedical research and therapeutic development.

Addressing Performance Challenges and Optimizing Zero-Shot Capabilities

In the rapidly evolving field of single-cell biology, foundation models (scFMs) such as scGPT and Geneformer promise a new paradigm for biological discovery. Their ability to perform zero-shot inference—making predictions on new, unseen data without explicit training—is particularly alluring for tasks like novel cell type identification or in silico perturbation prediction [6]. In principle, this capability could accelerate the understanding of complex cellular data and reveal previously unknown biology. However, a growing body of evidence indicates that the zero-shot deployment of these models is fraught with specific, systematic failure modes that can mislead research and discovery if not properly understood and mitigated [6] [44]. This application note details these common failure modes, provides quantitative evidence of their impact, and outlines standardized protocols for their rigorous evaluation.

A core challenge lies in the disconnect between the models' architectural potential and their practical performance. For instance, in scientific machine learning more broadly, machine-learned operators (MLOs) were designed to perform inference at arbitrary resolution, yet they comprehensively fail at "zero-shot super-resolution"—inference on higher-resolution data than they were trained on [45]. This brittleness, a result of being susceptible to aliasing and an inability to extrapolate to varying frequency information, underscores that architectural innovation alone is insufficient for robust zero-shot performance. This pattern of overestimation is acutely present in single-cell biology, where foundational models are increasingly integrated into critical analysis pipelines despite significant limitations [6].

Quantitative Analysis of Zero-Shot Underperformance

A systematic, zero-shot evaluation of popular single-cell foundation models reveals a significant performance gap compared to traditional methods. This underperformance is consistent across diverse datasets and tasks, challenging the presumption that these models have internalized general, transferable biological concepts.

Table 1: Zero-Shot Clustering Performance Comparison (Representative Data)

Model/Method	Dataset A (mAcc)	Dataset B (mAcc)	Notes
Geneformer (6L)	0.42	0.38	Pre-trained on millions of cells [6]
scGPT	0.45	0.41	Pre-trained on CellxGene dataset [6]
scVI (Traditional ML)	0.68	0.72	Probabilistic graphical model [6]
Harmony (Traditional)	0.71	0.69	Integration algorithm [6] [44]
HVG Baseline	0.58	0.55	Simple feature selection (Top 2000 genes) [6]
Random Weights	0.32	0.29	Untrained model baseline [6]

The data in Table 1, synthesized from a Microsoft Research study, shows that scFMs can perform worse than simpler, established statistical algorithms and even a basic feature selection strategy (Highly Variable Genes - HVG). In some cases, their performance approaches that of an untrained model, indicating a fundamental failure to learn transferable, robust representations during pre-training [6].

This failure is further exemplified in the models' core pre-training task: masked gene expression prediction. The logic is that by predicting withheld genes, the model will learn the deeper relationships between genes. However, evaluation shows that scGPT has a limited ability to predict held-out gene expression. Without conditioning on its internal cell embedding, it often predicts the median expression value for every gene, regardless of its true value. When using the cell embedding, performance only slightly improves, and primarily for highly expressed "housekeeping" genes that are less informative for distinguishing cell types [6]. This suggests the models are not learning the nuanced, context-dependent gene relationships essential for true biological understanding.

Common Failure Modes and Their Root Causes

The underperformance of zero-shot scFMs can be attributed to several interconnected failure modes.

Failure to Overcome Technical Confounders

A primary failure mode is the inability to cluster cells by biological function (e.g., cell type) in the presence of technical confounders or "batch effects." Input data for the same cell type can look different depending on the experiment, donor, or sequencing platform. A robust model must identify biological similarities despite these technical variations. Current scFMs often fail at this, as their embeddings inadvertently capture the technical aspects of the experiment rather than the underlying biology, leading to poor downstream clustering [6] [44].

While models are increasingly applied to multimodal data (e.g., integrating transcriptomics with epigenomics or spatial imaging), their zero-shot capability in aligning and reasoning across fundamentally different modalities is limited. Challenges persist in harmonizing heterogeneous data types, from sparse scATAC-seq matrices to high-resolution microscopy images, while preserving biological relevance [46]. This represents a significant failure mode in translating model insights to holistic biological understanding.

Superficial Understanding of Masked Modeling Task

As highlighted earlier, the self-supervised pre-training objective (masked gene prediction) does not guarantee a deep understanding of gene regulatory networks. The model can excel at the training task by learning superficial statistical correlations or focusing on highly expressed genes without capturing the causal or contextual relationships that govern cellular function [6]. This results in a model that is brittle and fails to generalize in a zero-shot manner to new datasets or biological contexts.

Over-reliance on Fine-Tuning for Evaluation

Many published claims of scFM performance are based on evaluations where the model is further trained (fine-tuned) on specific downstream tasks. This setup can be misleading, as performance improvements can be driven by the model learning dataset-specific artifacts during fine-tuning, rather than demonstrating that it learned meaningful, general biology during pre-training [6]. The true test of a foundation model's knowledge is its zero-shot performance.

Diagram 1: Zero-shot failure mode pathways.

Experimental Protocol for Zero-Shot Evaluation

To systematically diagnose these failure modes, researchers should adopt a standardized, zero-shot benchmarking protocol. The following provides a detailed methodology for evaluating a model's clustering performance, a critical task for biological discovery.

Protocol: Zero-Shot Cell Clustering Evaluation

Objective: To assess a model's ability to generate embeddings that group cells by biological cell type, not by technical batch effects, without any task-specific fine-tuning.

Research Reagent Solutions: Table 2: Essential Materials for Evaluation

Item	Function / Specification	Example / Note
Benchmark Datasets	Public scRNA-seq datasets with known cell types and strong batch effects.	Use ≥2 datasets, e.g., from DISCO [46] or CZ CELLxGENE [46].
Foundation Model	Pre-trained single-cell foundation model.	scGPT, Geneformer; ensure access to embedding extraction method [6].
Baseline Methods	Traditional algorithms for comparison.	scVI (generative model), Harmony (integration), HVG + PCA (simple baseline) [6].
Clustering Algorithm	Method to group cell embeddings.	Leiden or K-means clustering. Use consistent algorithm and parameters.
Evaluation Metrics	Quantify clustering quality and batch integration.	ARI (Adjusted Rand Index) for cell type agreement, LISI (Local Inverse Simpson's Index) for batch mixing.

Step-by-Step Procedure:

Data Curation and Preprocessing:
- Select at least two publicly available single-cell datasets where each cell has an annotated cell type and associated batch metadata (e.g., donor, sequencing run).
- Critical: The datasets must exhibit a known and significant batch effect. Preprocess the data according to standard practices (e.g., log-normalization). Do not perform batch correction.
Embedding Extraction (Zero-Shot):
- For the scFM under test, pass the preprocessed gene expression matrix for the entire dataset through the model to extract a latent embedding for each cell.
- Crucially, do not fine-tune the model on the target dataset. Use the model in a purely zero-shot manner.
- For baseline methods (e.g., scVI, Harmony), generate cell embeddings following their standard, documented procedures.
Clustering and Evaluation:
- Apply the chosen clustering algorithm (e.g., Leiden clustering) to the embeddings generated by each method (scFM and baselines).
- Calculate the Adjusted Rand Index (ARI) by comparing the resulting cluster labels to the ground-truth cell type annotations. A higher ARI indicates better biological clustering.
- Calculate a batch effect metric, such as LISI, on the embeddings. A higher LISI score indicates better mixing of batches, meaning the model's representations are less confounded by technical noise.
Analysis and Interpretation:
- Compare the ARI scores of the scFM against the traditional baselines. Underperformance indicates a failure to learn robust biological representations.
- Analyze the LISI scores. If the scFM has a low LISI score compared to methods like Harmony, it confirms a sensitivity to batch effects.
- Visualize the embeddings using UMAP or t-SNE to qualitatively inspect whether clusters align with cell types or with experimental batches.

Diagram 2: Zero-shot clustering evaluation workflow.

Discussion and Mitigation Strategies

The consistent underperformance of zero-shot scFMs reveals a critical gap between their potential and their current utility for de novo biological discovery. The failure modes described herein—sensitivity to confounders, inadequate cross-modal understanding, and superficial pre-training—suggest that these models, in their present form, may not have learned the deep, causal principles of biology that would enable robust generalization.

Moving forward, the field must adopt more rigorous and context-aware evaluation practices. Benchmarking should prioritize zero-shot and few-shot settings to truly assess generalizability, rather than relying on fine-tuning performance which can mask fundamental shortcomings [6]. Furthermore, mitigation strategies are needed. Multi-resolution training, as proposed for scientific machine learning operators, could be adapted for scFMs, where models are explicitly trained on data with varying levels of technical noise and biological complexity to build inherent robustness [45]. Similarly, developing benchmarking frameworks specifically designed for creating challenging, scalable zero-shot tests for any biological task can drive improvement, as seen in natural language processing [47].

Zero-shot capability is the benchmark for true understanding in foundation models. For single-cell biology, the evidence indicates that current models have not yet reached this milestone. Their susceptibility to technical confounders and inability to outperform simpler, traditional methods in a zero-shot setting necessitates a cautious and critical approach to their adoption in research pipelines. Integrating them into cell atlases or bioinformatics packages without principled evaluation of their zero-shot limits risks misleading scientific conclusions [6]. Future progress depends on a community-wide shift towards more rigorous, transparent, and biologically-grounded evaluation, fostering the development of models that genuinely learn the language of life.

In single-cell biology, foundation models (FMs) pretrained on massive datasets promise to transform how we analyze cellular heterogeneity, identify novel cell types, and predict molecular responses to perturbations. The capability to perform zero-shot learning—where models execute tasks without task-specific training—is particularly valuable in discovery settings where biological labels are unknown or undefined. However, recent rigorous evaluations reveal that proposed single-cell FMs, including scGPT and Geneformer, demonstrate inconsistent zero-shot performance and are sometimes outperformed by simpler methods in critical tasks like cell type clustering and batch integration [5]. This performance gap underscores a fundamental truth: model capabilities are inextricably linked to pretraining data quality. The curation of effective pretraining corpora is not merely a preliminary step but a determinant of model success, especially for zero-shot generalization to unseen cell lines and experimental conditions [7] [5].

The zero-shot challenge manifests clearly in biological discovery contexts. When researchers explore uncharted tissue microenvironments or disease states, they lack predefined labels for fine-tuning. In these scenarios, models must rely entirely on the fundamental biological representations absorbed during pretraining. Current evidence suggests that without meticulous data curation, even models trained on millions of cells may fail to capture transferable biological principles, limiting their utility in the very discovery contexts where they promise the greatest value [5] [48]. This application note establishes protocols and frameworks to address this critical limitation through systematic, quality-focused corpus curation.

The Zero-Shot Performance Gap: A Data Quality Issue

Comprehensive evaluations of single-cell foundation models reveal troubling inconsistencies in zero-shot settings. When analyzing embeddings from scGPT and Geneformer without any fine-tuning, researchers found these models underperformed compared to established baselines like highly variable gene (HVG) selection and integration methods such as Harmony and scVI across multiple metrics, including average BIO score for cell type clustering [5]. Surprisingly, the simple approach of selecting HVGs consistently outperformed both proposed foundation models in batch integration tasks [5].

Table 1: Zero-Shot Performance Comparison Across Single-Cell Analysis Methods

Method	Cell Type Clustering (AvgBIO)	Batch Integration	Data Requirements	Zero-Shot Reliability
scGPT	Variable performance; matches baselines on some datasets (e.g., PBMC 12k) but underperforms on others	Moderate success on complex biological batch effects	33+ million human cells [49]	Inconsistent across tasks [5]
Geneformer	Consistently outperformed by simpler methods	Poor performance across metrics; structure primarily driven by batch effects	30 million single-cell transcriptomes [49]	Low reliability [5]
HVG Selection	Competitive performance across multiple datasets	Superior batch integration scores across datasets	Minimal	High reliability [5]
Harmony	Strong performance on cell type separation	Excellent for technical batch effects	Minimal	High for standard tasks [5]
scVI	Strong performance across datasets	Good integration, struggles with complex biological variation	Minimal	High for standard tasks [5]

The implications of this performance gap extend directly to real-world research applications. In perturbation prediction, where researchers aim to forecast cellular responses to novel drugs, the limitations of zero-shot capability present significant barriers. While newer models like scShift demonstrate remarkable zero-shot capabilities in revealing representations of cell types and biological states when trained on compendiums of scRNA-seq atlases [28], the overall landscape suggests that data quality rather than model architecture alone may be the limiting factor for many existing approaches.

Core Dimensions of Pretraining Data Quality

Scale and Diversity: Establishing the Foundation

The scaling laws governing single-cell foundation models demonstrate emergent zero-shot capabilities beyond specific thresholds of data volume and diversity. Models like CellFM, trained on 100 million human cells with 800 million parameters, show significantly enhanced performance across diverse applications including cell annotation, perturbation prediction, and gene function prediction [49]. Similarly, scShift exhibits emergent zero-shot capabilities and follows a scaling law beyond a transition threshold with respect to dataset diversity [28].

Table 2: Scaling of Single-Cell Foundation Models and Zero-Shot Performance

Model	Training Scale	Parameters	Key Zero-Shot Capabilities	Performance Highlights
CellFM	100 million human cells [49]	800 million	Cell annotation, perturbation prediction, gene function prediction	Outperforms existing models across diverse applications [49]
scPRINT	50 million cells [50]	100 million	Gene network inference, denoising, batch effect correction, cell label prediction	Superior performance in gene network inference to state-of-the-art [50]
scGPT	33 million human cells [49]	Not specified	Cell type annotation, batch correction	Inconsistent zero-shot performance in independent evaluations [5]
Geneformer	30 million single-cell transcriptomes [49]	Not specified	Cell embedding, generalization to unseen datasets	Underperforms simpler methods in zero-shot settings [5]
scShift	1,000,000 cells from 30 studies and 2,538 donors [28]	Not specified	Revealing cell types and biological states, overcoming batch effects	Emergent zero-shot capabilities with scaling law beyond threshold [28]

Critical to effective scaling is not merely cell count but compositional diversity. The CellFM pretraining corpus exemplifies this principle, incorporating 102 million human cells from diverse organs and sequencing technologies, including 46.3 million cells from normal donors and additional cells from diseased states [49]. This diversity enables the model to capture a more comprehensive representation of biological variation, forming the basis for robust zero-shot inference.

Quality Control and Standardization

Effective pretraining corpora require rigorous quality control (QC) protocols to remove technical artifacts while preserving biological signal. Standardized QC metrics include:

Cell-level QC: Identification of true cells through UMI counts (typically >500 UMIs/cell), gene detection rates, and mitochondrial ratios [51]
Gene-level QC: Filtering of lowly expressed genes that contribute noise rather than signal
Doublet Detection: Identification and removal of multiplets that misrepresent cellular states

The Seurat toolkit provides automated metadata generation for these QC metrics, including nCountRNA (number of UMIs per cell), nFeatureRNA (number of genes detected per cell), and mitochondrial ratio (percentage of reads mapping to mitochondrial genes) [51]. These metrics must be contextualized with biological expectations, as certain cell types naturally exhibit higher mitochondrial content or lower complexity.

Data standardization presents significant challenges in single-cell corpus curation. The scPRINT team addressed this through a standardized data analysis workflow that included quality control for filtering cells and genes, gene name standardization according to HUGO Gene Nomenclature Committee guidelines, and conversion to unified sparse matrix formats [49]. Such standardization is prerequisite for effective model pretraining, as inconsistent gene identifiers or normalization approaches introduce noise that undermines zero-shot capabilities.

Annotation Quality and Metadata Richness

The utility of pretraining data extends beyond expression counts to encompass rich biological annotations. The scPRINT model demonstrates the value of comprehensive metadata, incorporating cell type, disease status, sex, organism, ethnicity, and sequencing platform information during pretraining [50]. This multi-faceted annotation enables the model to learn disentangled representations of biological variation, enhancing zero-shot transfer to new datasets and conditions.

Models trained on weakly annotated data face fundamental limitations in zero-shot settings. As noted in evaluations of existing foundation models, "The significance of zero-shot evaluation is particularly pronounced in single-cell biology, where many tasks are exploratory and lack predefined labels that limit the feasibility of fine-tuning" [5]. Comprehensive annotations during pretraining provide the semantic framework that enables models to generalize to unlabeled data in downstream applications.

Experimental Protocols for Data Curation

Protocol 1: Multi-Dimensional Quality Control

Principle: Implement tiered QC metrics to balance removal of technical noise with preservation of biological diversity.

Materials:

Single-cell expression matrices (raw counts)
Computing environment with R/Python and single-cell analysis tools (Seurat, Scanpy)
Metadata template capturing experimental conditions and donor characteristics

Procedure:

Cell-level Filtering:
- Calculate QC metrics: nFeature_RNA, nCount_RNA, mitoRatio [51]
- Apply thresholds tailored to biological context:
  - Minimum UMI threshold: Retain cells with >500 UMIs
  - Gene detection threshold: Retain cells with 250-3000 detected genes
  - Mitochondrial threshold: Exclude cells with >20% mitochondrial reads (adjust based on cell type)
Gene-level Filtering:
- Remove genes detected in <10 cells to eliminate sparse features
- Retain protein-coding genes while considering removal of ribosomal and mitochondrial genes depending on research focus
Batch Effect Assessment:
- Visualize data distribution by sequencing batch using PCA or UMAP
- Quantify batch effects using metrics like PC regression and kBET
- Document batch structure for potential integration during training
Biological Validation:
- Verify expected cell types are present through marker gene expression
- Confirm biological gradients (differentiation, activation) are preserved
- Cross-reference with public datasets to identify potential sample swaps or mislabeling

Protocol 2: Metadata Harmonization

Principle: Establish consistent annotation schema across datasets to enable cross-dataset learning.

Materials:

Source datasets with variable annotation formats
Controlled vocabularies (Cell Ontology, Uberon, Disease Ontology)
Computational resources for text mining and normalization

Procedure:

Vocabulary Mapping:
- Map cell type terms to Cell Ontology (CL) identifiers
- Annotate tissues with Uberon anatomy ontology terms
- Classify diseases using MONDO or DOID disease ontologies
Experimental Metadata Capture:
- Standardize sequencing platform descriptors (10x 3', Smart-seq2, etc.)
- Normalize donor characteristics (age, sex, ethnicity)
- Document sample preparation protocols and library preparation kits
Quality Tier Classification:
- Tier 1: Complete ontology mapping + full experimental metadata
- Tier 2: Partial ontology mapping + key experimental details
- Tier 3: Basic cell type labels + minimal metadata
Metadata Integration:
- Create unified annotation table linking all samples to standardized terms
- Implement version control for ontology updates and corrections
- Generate quality reports highlighting missing or inconsistent annotations

The Scientist's Toolkit: Essential Research Reagents

Table 3: Critical Tools for Curating Single-Cell Pretraining Corpora

Tool/Resource	Function	Application in Corpus Curation
Seurat/Scanpy	Single-cell analysis toolkits	Quality control metric calculation, visualization, and basic filtering [51]
CellxGene Census	Standardized single-cell data repository	Source of curated datasets with consistent formatting [49] [28]
Cell Ontology	Structured controlled vocabulary for cell types	Standardizing cell type annotations across datasets [49]
SynEcoSys Database	Data processing and standardization platform	Unified processing of diverse dataset formats into analysis-ready matrices [49]
Harmony/ScVI	Batch integration methods	Assessing and correcting for batch effects in aggregated data [5]
ESM2 Protein Language Model	Protein sequence embeddings	Generating meaningful gene representations based on protein sequences [50]

The development of robust zero-shot single-cell foundation models requires a fundamental reimagining of pretraining corpus curation. Current evidence demonstrates that data quality—encompassing scale, diversity, standardization, and annotation richness—directly determines model performance in discovery settings where fine-tuning is impossible. By implementing the rigorous quality control protocols, metadata harmonization standards, and systematic evaluation frameworks outlined in this application note, researchers can create pretraining corpora that enable true biological insight rather than technical artifact recapitulation. The future of single-cell computational biology depends not merely on larger models, but on better data—curated with biological insight and computational rigor.

Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by enabling the profiling of gene expression at the level of individual cells. The analysis of this data, however, is complicated by its high dimensionality, technical noise, and sparse nature, where excess zeros (dropouts) can mask true biological signals [52]. Single-cell foundation models (scFMs), pretrained on millions of cells, have emerged as powerful tools to overcome these challenges [1] [9]. A critical factor determining their performance is their architectural design, specifically how they represent genes as numerical vectors (gene embeddings) and how they model interactions between genes (attention mechanisms). These components are particularly vital for zero-shot learning, where a model must perform tasks on new data without any additional training [5] [28]. This application note details key architectural innovations in gene embeddings and attention mechanisms, provides protocols for their evaluation, and offers a toolkit for researchers aiming to advance zero-shot learning in single-cell biology.

Key Architectural Components and Their Zero-Shot Impact

Advanced Gene Embedding Strategies

Gene embeddings are dense, low-dimensional vector representations that capture the functional and contextual meaning of genes. Moving beyond simple identifier-based embeddings is crucial for model performance.

Context-Specific Embeddings via Biological Networks: Models like scNET integrate scRNA-seq data with protein-protein interaction (PPI) networks using a dual-view graph neural network (GNN) architecture [52]. This allows the model to learn gene representations that are refined by both expression patterns and known functional relationships. The GNN propagates gene expression information across the PPI network, smoothing technical noise and yielding embeddings that better capture biological pathways and complexes. In zero-shot settings, these embeddings have demonstrated a higher correlation with Gene Ontology (GO) semantic similarity compared to methods that use expression data alone [52].
Multimodal and Metadata-Enhanced Embeddings: State-of-the-art scFMs enrich their input by combining a gene's unique identifier embedding with a separate embedding for its expression value [1] [4]. Furthermore, some models incorporate additional biological context, such as gene ontology terms or chromosomal location, into the tokenization process. This creates a more informative starting representation, which is a foundational element for robust zero-shot generalization [1] [9].

Innovative Attention Mechanisms

The attention mechanism enables a model to dynamically weigh the importance of different genes when processing a cell's expression profile. Refining this mechanism is key to capturing biological relationships.

Overcoming Non-Sequential Data with Forced Ordering: A fundamental challenge in applying transformers to single-cell data is that gene expression is not inherently sequential. To address this, models employ strategies to impose a consistent order. Common methods include ranking genes by their expression value within each cell or binning genes based on expression levels before feeding them into the transformer [1] [4]. This creates a deterministic sequence that allows the attention mechanism to function.
Bias-Free and Knowledge-Guided Attention: Some architectures, such as the one used in scGPT, utilize a causal masking mechanism in their self-attention layers. This mechanism prevents the model from attending to future "tokens" (genes) in the sequence, which is well-suited for generative tasks [1]. Other innovations focus on injecting biological prior knowledge directly into the attention weights. For instance, scPlantFormer integrates phylogenetic constraints, guiding the attention to prioritize evolutionarily conserved relationships, which enhances cross-species annotation accuracy [9].

The following diagram illustrates the integration of these components into a model architecture designed for effective zero-shot learning.

Quantitative Benchmarking of Model Performance

Evaluating the zero-shot performance of scFMs is essential to understand the real-world effectiveness of these architectural innovations. Benchmarking studies compare scFMs against established baseline methods on common biological tasks.

Table 1: Zero-Shot Performance in Cell Type Clustering (AvgBIO Score) [5]

Model / Method	PBMC (12k)	Pancreas	Tabula Sapiens	Immune Dataset
HVG (Baseline)	0.51	0.45	0.49	0.48
scVI (Baseline)	0.48	0.42	0.51	0.46
Harmony (Baseline)	0.47	0.41	0.47	0.45
scGPT	0.52	0.39	0.48	0.43
Geneformer	0.35	0.33	0.37	0.35

Table 2: Zero-Shot Performance in Batch Integration (Batch Mixing Score) [5]

Model / Method	PBMC (12k)	Pancreas	Tabula Sapiens	Immune Dataset
HVG (Baseline)	0.94	0.91	0.89	0.90
scVI (Baseline)	0.89	0.87	0.85	0.81
Harmony (Baseline)	0.85	0.83	0.76	0.84
scGPT	0.82	0.78	0.84	0.83
Geneformer	0.71	0.65	0.69	0.68

Table 3: Functional Quality of Gene Embeddings (GO Term Prediction AUROC) [52] [4]

Embedding Method	AUROC (Mean)	Key Feature
Original Counts	0.59	Baseline from raw data
DeepImpute	0.64	Imputation-focused
scLINE	0.66	Graph embedding with networks
scGPT	0.68	Foundation model pretraining
scNET	0.73	PPI network integration

The data reveals that while scFMs show promise, their zero-shot performance can be inconsistent and is sometimes surpassed by simpler methods like Highly Variable Gene (HVG) selection [5]. This highlights a critical area for improvement in model architecture and pretraining. However, models that integrate external biological knowledge, such as scNET, demonstrate a clear advantage in capturing functional gene relationships, which is a key aspect of biological relevance [52] [4].

Experimental Protocols

Protocol 1: Evaluating Zero-Shot Cell Embedding Quality

Objective: To assess the quality of cell embeddings generated by an scFM without any fine-tuning, for tasks like cell type clustering and batch integration.

Materials: Pretrained scFM (e.g., scGPT, Geneformer), query scRNA-seq dataset (in h5ad or similar format), computing resources (GPU recommended), and evaluation software (e.g., scib-metrics or scanpy).

Methodology:

Data Preprocessing: Load the query dataset. Perform basic quality control (filtering low-quality cells and genes) and normalize the data if required by the model's preprocessing protocol. Do not correct for batch effects.
Embedding Extraction: Feed the preprocessed gene expression matrix of the query dataset into the pretrained scFM. In a zero-shot setting, disable gradient updates and run the model in inference mode to extract the cell embeddings from the model's output layer.
Dimensionality Reduction & Visualization: Apply dimensionality reduction techniques like UMAP or t-SNE to the extracted cell embeddings to create 2D/3D visualizations.
Quantitative Evaluation:
- Cell Clustering: Use Leiden or Louvain clustering on the cell embeddings. Compare the resulting clusters to known cell type labels using metrics like Adjusted Rand Index (ARI) or Normalized Mutual Information (NMI).
- Batch Integration: Calculate metrics like the Average Silhouette Width (ASW) for batch labels (lower is better) and for cell type labels (higher is better). The principal component regression (PCR) score can also be used to quantify the variance explained by batch before and after integration [5].

Protocol 2: Assessing Gene Embedding Functional Relevance

Objective: To determine if the gene embeddings produced by a model capture biologically meaningful relationships.

Materials: Gene embedding matrix from a pretrained scFM, Gene Ontology (GO) database, gene similarity software (e.g., GOSemSim).

Methodology:

Embedding Extraction: Extract the gene embedding matrix from the input layer of the pretrained scFM. Each row corresponds to a gene's vector representation.
Similarity Calculation: Calculate pairwise cosine similarity between all gene embedding vectors to create a model-derived gene similarity matrix.
Benchmarking Against Prior Knowledge: Calculate a separate gene similarity matrix based on established biological knowledge, such as GO semantic similarity, which measures the relatedness of genes based on their shared annotations in the GO hierarchy.
Correlation Analysis: Compute the correlation (e.g., Spearman's rank) between the model-derived gene similarity matrix and the knowledge-based similarity matrix. A higher correlation indicates that the model has learned biologically plausible gene relationships [52] [4].
Functional Prediction (Optional): Train a simple classifier (e.g., a multi-layer perceptron) to predict GO term annotations for genes using their embeddings as features. Evaluate the classifier using area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPR) in a cross-validation setting to quantitatively benchmark the functional content of the embeddings [52].

The Scientist's Toolkit: Research Reagent Solutions

Table 4: Essential Computational Tools for scFM Research

Tool / Resource	Type	Primary Function in Research	Key Application
CZ CELLxGENE [1] [9]	Data Platform	Provides unified access to millions of curated single-cell datasets.	Pretraining corpus for scFMs; source of benchmark datasets.
PPI Networks (e.g., STRING) [52]	Biological Network	Database of known and predicted protein-protein interactions.	Integrating functional context into gene embeddings (e.g., in scNET).
BioLLM [9]	Software Framework	Standardized interface for benchmarking and accessing multiple scFMs.	Streamlining model evaluation and comparison across different tasks.
scib-metrics	Metric Suite	A standardized set of metrics for evaluating single-cell data integration.	Quantifying batch correction and biological conservation in embeddings.
Hugging Face	Model Repository	Platform for sharing and versioning pretrained machine learning models.	Distributing and downloading weights of pretrained scFMs.

Architectural innovations in gene embeddings and attention mechanisms are fundamental to advancing the zero-shot capabilities of single-cell foundation models. While current models show immense promise, benchmarking indicates that achieving consistent, state-of-the-art zero-shot performance remains a challenge. The integration of structured biological knowledge—such as PPI networks and gene ontology—directly into model architecture appears to be a particularly powerful strategy for enhancing the biological relevance of the learned representations. The protocols and tools outlined in this document provide a foundation for researchers to rigorously evaluate and contribute to the next generation of scFMs, ultimately accelerating discovery in biology and drug development.

Efficient Fine-Tuning and Adaptation Strategies for Enhanced Generalization

Single-cell foundation models (scFMs), pre-trained on tens of millions of single-cell transcriptomes, have emerged as powerful tools for capturing universal representations of cellular states [53] [1]. These models, including scGPT, Geneformer, and CellFM, leverage transformer architectures to learn the complex relationships between genes and cellular contexts [53] [3]. However, their utility in real-world biological discovery—particularly in zero-shot learning settings where models must generalize to unseen data without task-specific training—faces significant challenges [5] [6]. Current evaluations reveal that scFMs often underperform simpler methods in zero-shot scenarios for tasks like cell type annotation and batch integration [5] [54]. This application note addresses these limitations by presenting structured protocols for efficient fine-tuning, enabling robust generalization in critical applications such as molecular perturbation prediction and cross-system biological discovery.

Foundational Concepts and Model Landscape

Architectures of Single-Cell Foundation Models

scFMs adapt transformer architectures, originally developed for natural language processing, to interpret gene expression data by treating individual cells as "sentences" and genes or their expression values as "tokens" or "words" [53] [1]. This conceptual framework allows models to learn the contextual relationships between genes across diverse cellular environments. The two predominant architectural paradigms are:

Encoder-only models (e.g., scBERT, Geneformer): Utilize bidirectional attention mechanisms to learn gene representations from full cellular contexts, making them particularly effective for classification and embedding tasks [1] [8].
Decoder-only models (e.g., scGPT): Employ autoregressive, masked self-attention to iteratively predict gene expression, demonstrating strengths in generative tasks and perturbation modeling [1] [9].

Tokenization Strategies for Single-Cell Data

Unlike natural language, gene expression data lacks inherent sequential ordering, necessitating careful tokenization strategies:

Gene ranking: Genes are ordered by expression level within each cell, creating a deterministic sequence for transformer processing [53] [1].
Value binning: Continuous expression values are discretized into categorical bins, converting regression problems to classification tasks [53] [3].
Value projection: Preserves full-resolution expression values through linear projections, maintaining continuous data representation [3].

Table 1: Overview of Prominent Single-Cell Foundation Models

Model	Parameters	Training Scale	Architecture Type	Key Strengths
CellFM	800M	100M human cells	Value projection (ERetNet)	Cell annotation, perturbation prediction [3]
scGPT	Not specified	33M+ cells	Decoder-based transformer	Multi-omic integration, zero-shot annotation [53] [9]
Geneformer	Not specified	30M single-cell transcriptomes	Encoder-based transformer	Gene-level analyses, representation learning [8] [3]
scBERT	Not specified	1.12M human cells	Encoder-based transformer	Cell type annotation [53] [3]
UCE	650M	36M+ cells	Protein language model integration	Cross-species molecular diversity [3]

Efficient Fine-Tuning Methodologies

Parameter-Efficient Transfer Learning

Full fine-tuning of scFMs with hundreds of millions of parameters is computationally prohibitive for most research settings. Parameter-efficient methods adapt pre-trained models with minimal tunable parameters:

Low-Rank Adaptation (LoRA): Freezes pre-trained weights and injects trainable rank decomposition matrices into transformer layers, significantly reducing trainable parameters [3].
Adapter modules: Inserts small, trainable bottleneck layers within transformer blocks while keeping original weights frozen [55].
Prefix tuning: Prepends trainable tensors to each transformer block, enabling task adaptation without modifying core parameters [55].

scDCA: Drug-Conditional Adaptation for Perturbation Prediction

The single-cell Drug-Conditional Adapter (scDCA) enables prediction of transcriptional responses to novel chemical compounds by bridging single-cell omics with molecular representations:

Diagram: scDCA workflow for molecular perturbation prediction

This approach conditions adapter parameters on molecular embeddings, enabling the model to predict cellular responses to unseen drugs and even generalize zero-shot to unseen cell lines [55]. The method trains less than 1% of the original foundation model parameters while preserving rich biological representations learned during pre-training [55].

Comparative Performance of Fine-Tuning Strategies

Table 2: Efficiency and Performance of Fine-Tuning Methods

Method	Tunable Parameters	Key Applications	Generalization Capabilities	Implementation Complexity
Full Fine-Tuning	100% of original model	Task-specific specialization	Limited to training distribution	High (computationally intensive) [54]
LoRA	<1-2% of original model	Cell annotation, multi-task learning	Moderate	Low (standard implementations) [3]
Adapter Modules	1-4% of original model	Cross-modal tasks, perturbation prediction	Strong cross-modal transfer	Medium (architecture-specific) [55]
Prefix Tuning	0.1-0.5% of original model	Few-shot learning, rapid prototyping	Limited few-shot capability	Low to medium [55]

Experimental Protocols

Protocol: Implementing scDCA for Drug Response Prediction

Application: Predicting transcriptional responses to novel drug compounds in unseen cell lines.

Materials and Reagents:

Pre-trained scFM (scGPT or comparable model)
Single-cell RNA-seq dataset of drug perturbations
Molecular compound structures (SMILES representations)
Computational environment with GPU acceleration

Procedure:

Data Preprocessing:
- Standardize gene expression matrices using SCANPY or Seurat workflows
- Filter low-quality cells and genes (minimum 200 genes/cell, 500 cells/gene)
- Normalize expression values using log(CP10K+1) transformation
- Tokenize expression data using model-specific strategy (ranking or binning)

Molecular Representation:
- Encode drug compounds using extended-connectivity fingerprints (ECFP) or molecular graph neural networks
- Project molecular embeddings to dimension compatible with adapter architecture
Model Configuration:
- Load pre-trained scFM weights and freeze all parameters
- Initialize drug-conditional adapter layers with dimension 64-128
- Connect drug embedding to adapter conditioning mechanism
Training Protocol:
- Set batch size to 32-64 depending on GPU memory
- Use AdamW optimizer with learning rate 1e-4 for adapter layers
- Implement gradient clipping with max norm 1.0
- Train for 50-100 epochs with early stopping (patience=10)
Evaluation:
- Assess performance on held-out drug compounds
- Test zero-shot generalization to unseen cell lines
- Compare predictions against additive baseline models

Troubleshooting:

Poor convergence may indicate need for adapter dimension adjustment
Overfitting to training compounds can be addressed by reducing adapter capacity or increasing dropout
Gradient explosion may require smaller learning rate or stronger gradient clipping

Protocol: Zero-Shot Cell Type Annotation

Application: annotating novel cell types without task-specific training.

Materials and Reagents:

Pre-trained scFM with diverse cellular representation
Query single-cell dataset with unknown cell types
Reference atlas for annotation transfer (e.g., CELLxGENE)

Procedure:

Embedding Generation:
- Process query cells through pre-trained scFM without fine-tuning
- Extract cell embeddings from [CLS] token or mean pooling of gene embeddings
- Reduce dimensionality using UMAP or t-SNE for visualization

Reference Mapping:
- Compute embeddings for reference cell types from annotated atlas
- Perform k-nearest neighbor search between query and reference embeddings
- Assign cell type labels based on majority vote of nearest neighbors
Validation:
- Assess cluster purity using known marker genes
- Calculate silhouette scores for embedding separation
- Compare against HVG+PCA baseline method

Performance Benchmarking

Quantitative Assessment of Generalization Capabilities

Table 3: Benchmarking Results Across Generalization Tasks

Model/Method	Unseen Drug Prediction (MSE↓)	Unseen Cell Line Prediction (MSE↓)	Zero-Shot Cell Type Annotation (Accuracy↑)	Batch Integration (ASW↑)
scDCA (scGPT-based)	0.142	0.156	Not reported	Not reported [55]
Additive Baseline	0.152	0.183	Not applicable	Not applicable [54]
No Change Baseline	0.241	0.241	Not applicable	Not applicable [54]
scGPT Zero-Shot	Not reported	Not reported	0.384	0.412 [5]
Geneformer Zero-Shot	Not reported	Not reported	0.295	0.228 [5]
HVG + Harmony	Not applicable	Not applicable	0.572	0.634 [5]

Independent evaluations demonstrate that while zero-shot performance of scFMs remains suboptimal, efficient fine-tuning strategies enable significant improvements in generalization tasks [5] [54]. The scDCA approach shows particular promise, outperforming additive baselines in predicting responses to novel drugs and achieving state-of-the-art performance in zero-shot generalization to unseen cell lines [55].

The Scientist's Toolkit

Table 4: Essential Research Reagents and Computational Tools

Resource	Type	Function	Access
CZ CELLxGENE [53]	Data Platform	Unified access to 100M+ annotated single-cell datasets	Public
BioLLM [8]	Software Framework	Standardized APIs for multiple scFMs; benchmarking	Open source
scGPT [53] [9]	Foundation Model	Generative pre-training for multi-omic tasks	Open source
Geneformer [8] [3]	Foundation Model	Rank-based gene embeddings for representation learning	Open source
MindSpore [3]	AI Framework	Distributed training of large-scale models (e.g., CellFM)	Open source
DISCO [9]	Data Portal	Federated analysis of single-cell datasets	Public
PyTorch [55]	Deep Learning Library	Implementation of adapter modules and fine-tuning	Open source

Efficient fine-tuning strategies represent a crucial advancement for deploying single-cell foundation models in practical research settings, particularly for drug discovery applications requiring generalization to novel compounds and cellular contexts. While current scFMs show limitations in pure zero-shot scenarios, methods like drug-conditional adapters demonstrate how minimal, targeted parameter updates can unlock robust generalization capabilities. As the field progresses, standardized benchmarking frameworks and shared computational ecosystems will be essential for validating and comparing these approaches across diverse biological contexts. The protocols presented herein provide researchers with practical methodologies to enhance generalization performance while maintaining computational efficiency, accelerating the translation of single-cell foundation models from computational tools to biological discovery engines.

In the rapidly evolving field of artificial intelligence, scaling laws have emerged as fundamental principles predicting model performance based on size and data. For specialized domains like single-cell biology, where foundation models (scFMs) promise to unlock novel biological insights, understanding these scaling relationships is crucial for developing models capable of zero-shot learning—applying knowledge to new tasks without task-specific training. This application note examines the current evidence for emergent scaling laws in single-cell foundation models, providing researchers with quantitative frameworks and standardized protocols for evaluating how model size and data diversity impact zero-shot performance.

Theoretical Foundation of Scaling Laws

Scaling laws describe predictable mathematical relationships between a model's size, training data volume, computational resources, and resulting performance. Recent research has demonstrated that these principles extend beyond large language models to specialized biological domains.

The Densing Law in Model Efficiency

The recently proposed "densing law" reveals that the capability density of models—their performance per parameter unit—grows exponentially over time. Analysis of 51 open-source models shows that maximum capability density doubles approximately every 3.5 months, meaning models require exponentially fewer parameters to achieve equivalent performance over time [56].

Power-Law Relationships in Medical Data

Scaling law studies on medical event models have confirmed power-law relationships between compute, model size, and pretraining data similar to those observed in the text domain, though with significantly higher optimal token-to-parameter ratios. These relationships enable predictable performance improvements through scaled model architecture and training data [57].

Empirical Evidence in Single-Cell Foundation Models

Rigorous evaluation of single-cell foundation models reveals critical insights into how scaling impacts zero-shot capabilities across diverse biological tasks.

Zero-Shot Performance Benchmarks

Comprehensive benchmarking studies demonstrate variable zero-shot performance across scFMs. Evaluations of Geneformer and scGPT for cell type clustering and batch integration reveal that these models sometimes underperform simpler methods like Highly Variable Genes (HVG) selection or established baselines like Harmony and scVI [5].

Table 1: Zero-Shot Performance Comparison Across Single-Cell Analysis Methods

Method	Cell Type Clustering (AvgBIO Score)	Batch Integration	Data Requirements
scGPT	Variable performance; better on PBMC datasets	Moderate success on complex biological batches	33M non-cancerous human cells
Geneformer	Underperforms HVG across metrics	Poor batch correction; batch effects dominate	30M cells
HVG Selection	Consistently outperforms foundation models	Best overall batch integration scores	Minimal
scVI	Strong performance on technical variation	Excellent technical batch correction	Task-specific training
Harmony	Comparable to scVI on cell clustering	Struggles with biological batch effects	Task-specific training

Impact of Pretraining Data Scale and Diversity

The scShift framework demonstrates that scaling up deep identifiable models with diverse training data enables remarkable zero-shot capabilities. Systematic evaluation of over 200 scShift models revealed emergent zero-shot capabilities and a scaling law beyond a transition threshold related to dataset diversity [28].

Table 2: Impact of Pretraining Data Composition on Model Performance

Model Variant	Pretraining Data	Performance on Blood Data	Performance on Cross-Tissue Data
scGPT (Random)	No pretraining	Poor	Poor
scGPT (Kidney)	814,000 kidney cells	Moderate	Fails on non-kidney datasets
scGPT (Blood)	10.3M blood/bone marrow cells	Strong	Moderate
scGPT (Human)	33M non-cancerous human cells	Strong but slightly underperforms blood variant	Moderate
scShift	1M+ cells from 30 studies, 2,538 donors	Excellent	Strong cross-tissue generalization

Notably, pretraining provides clear benefits, but performance plateaus with extremely large and diverse datasets, suggesting optimal scaling regions exist [5]. Models trained on tissue-specific data show strong performance within their domain but struggle with generalization, while models trained on diverse multi-tissue datasets demonstrate improved cross-tissue capabilities [5] [28].

Experimental Protocols for Evaluating Scaling Relationships

Protocol: Zero-Shot Performance Evaluation for scFMs

Purpose: To quantitatively assess the zero-shot capabilities of single-cell foundation models across standard biological tasks.

Materials:

Pretrained foundation models (scGPT, Geneformer, scShift, or others)
Evaluation datasets (Tabula Sapiens, Pancreas, PBMC, or custom datasets)
Benchmarking pipelines (BioLLM framework recommended)
High-performance computing infrastructure

Procedure:

Model Acquisition: Obtain pretrained model weights from official repositories
Data Preparation: Curate evaluation datasets spanning diverse tissues, technologies, and biological conditions
Embedding Extraction: Generate cell embeddings without any model fine-tuning
Task Evaluation:
- Cell Type Clustering: Apply standard clustering algorithms to embeddings and calculate metrics (AvgBIO, ASW)
- Batch Integration: Quantify batch mixing using established metrics (PCR, LISI)
- Biological State Prediction: Evaluate accuracy in predicting disease states or perturbations
Comparative Analysis: Benchmark against baseline methods (HVG, scVI, Harmony)

Validation: Reproducibility requires strict adherence to zero-shot conditions without any fine-tuning. The BioLLM framework provides standardized APIs for consistent evaluation across models [8].

Protocol: Scaling Law Analysis for scFM Development

Purpose: To determine optimal model scaling parameters for maximizing zero-shot performance.

Materials:

Large-scale single-cell compendiums (CELLxGENE Census)
Model training infrastructure
Performance evaluation benchmarks

Procedure:

Data Scaling Experiments: Train model variants with increasing dataset diversity (5K to 1M+ cells) while holding architecture constant
Architecture Scaling Experiments: Train model variants with increasing parameters (10M to 1B+) while holding training data constant
Performance Measurement: Evaluate zero-shot capabilities across diverse tasks
Curve Fitting: Model the relationship between scale and performance using power-law equations
Threshold Identification: Determine critical scaling thresholds where emergent zero-shot capabilities appear

Analysis: The scShift framework demonstrated that scaling laws emerge beyond specific thresholds of data diversity and model size, enabling prediction of performance gains from increased scale [28].

Visualization of Scaling Relationships

Scaling Law Dynamics: This diagram illustrates the relationship between model scale, data diversity, and the emergence of zero-shot capabilities, highlighting the power-law improvement phase followed by performance plateau.

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Research Reagents and Computational Tools for scFM Research

Resource	Type	Function	Application Context
CELLxGENE Census	Data Resource	Standardized single-cell data compendium	Pretraining and evaluation
BioLLM Framework	Software Tool	Unified interface for diverse scFMs	Model benchmarking and deployment
scGPT	Foundation Model	50M parameter transformer for single-cell data	Zero-shot cell type annotation
Geneformer	Foundation Model	40M parameter transformer with ranked gene inputs	Gene-level task performance
scShift	Framework	Deep identifiable model for biological states	Cross-dataset biological comparisons
Harmony	Algorithm	Batch integration method	Performance baseline
HVG Selection	Method	Highly variable gene selection	Simple baseline for evaluation

Emergent scaling laws in single-cell foundation models demonstrate predictable relationships between model size, data diversity, and zero-shot performance. The empirical evidence reveals that while increased scale generally improves performance, critical thresholds exist where capabilities emerge, and diminishing returns eventually set in. For researchers and drug development professionals, these insights provide strategic guidance for developing and deploying scFMs. Future work should establish domain-specific scaling laws and identify optimal scaling regions for particular biological applications to maximize resource efficiency while achieving robust zero-shot performance.

Rigorous Evaluation and Benchmarking of Zero-Shot Model Performance

Establishing Robust Benchmarks for Zero-Shot Evaluation

The emergence of single-cell foundation models (scFMs) represents a paradigm shift in computational biology, promising to unlock profound biological insights from the vast and growing corpus of single-cell RNA sequencing (scRNA-seq) data. These models, pretrained on millions of single-cell transcriptomes, aim to learn universal patterns of gene expression and cellular function [1]. A critical claimed advantage of scFMs is their potential for zero-shot deployment—applying learned representations to new, unseen data without task-specific fine-tuning [5]. This capability is particularly vital for exploratory biological discovery where predefined labels are unavailable, such as identifying novel cell types or states in unannotated datasets [5] [6].

However, recent rigorous evaluations have revealed a significant performance gap between promise and practice. When deployed zero-shot, leading scFMs like Geneformer and scGPT frequently underperform simpler, established methods in fundamental tasks like cell type clustering and batch integration [5] [58] [6]. These findings underscore an urgent need for robust, standardized benchmarking practices specifically designed for the zero-shot setting. This document provides detailed application notes and protocols to help researchers establish such benchmarks, ensuring that the development and evaluation of scFMs are grounded in biologically meaningful and methodologically sound principles.

The Critical Importance of Zero-Shot Evaluation

Evaluating scFMs in a zero-shot context is not merely one option among many; it is a essential test of whether these models have truly learned generalizable biological principles. The core premise of a foundation model is that its pretraining embeds a deep, transferable understanding of the domain—in this case, cellular biology [1].

Discovery Contexts: Many real-world research scenarios are exploratory. When analyzing a new dataset, researchers may not know the complete cell type composition or have labels for specific biological conditions. A model requiring fine-tuning for every new application is of limited use in these discovery settings [5].
Diagnosing True Learning: Fine-tuning can mask the model's fundamental understanding by allowing it to specialize narrowly to a labeled dataset. This process can be vulnerable to statistical artifacts, where performance improvements stem from exploiting dataset-specific correlations rather than capturing underlying biology [5] [6]. Zero-shot evaluation serves as a stricter test, diagnosing whether the model's pretrained representations are inherently biologically meaningful.
Exposing Limitations: Recent studies have demonstrated that the zero-shot performance of scFMs can be inconsistent and unexpectedly poor. For instance, in cell type clustering, the embeddings from Geneformer and scGPT often provide less separation of known cell types than embeddings from simpler methods like Highly Variable Genes (HVG) selection, Harmony, or scVI [5] [6]. This performance gap remains even when evaluating datasets that were partially included in the model's own pretraining corpus, indicating a weak connection between the pretraining objective and the downstream biological task [5].

Benchmarking Framework and Core Tasks

A robust benchmark for zero-shot scFM evaluation should encompass multiple complementary tasks that reflect common and critical analysis workflows in single-cell biology. The framework below outlines the primary tasks and their associated objectives and metrics.

Table 1: Core Tasks for Zero-Shot Benchmarking of scFMs

Task Category	Biological Objective	Key Evaluation Metrics	What a Successful Result Indicates
Cell Type Clustering	Assess whether embeddings group cells by biological function/identity rather than technical artifacts.	Average BIO (AvgBIO) score, Average Silhouette Width (ASW), Normalized Mutual Information (NMI) [5] [17] [59]	The model captures fundamental definitions of cell identity.
Batch Integration	Evaluate the removal of technical batch effects while preservation of meaningful biological variation.	Principal Component Regression (PCR) score, batch mixing scores, cell-type ASW [5] [60]	The model disentangles technical noise from biological signal.
Biological Conservation	Quantify how well the embeddings preserve both inter- and intra-cell-type biological structures.	Novel metrics like scGraph-OntoRWR and Lowest Common Ancestor Distance (LCAD) that leverage cell ontology knowledge [17]	The model aligns with established biological knowledge and captures subtle cellular states.

The following workflow diagram outlines the key stages in executing a zero-shot benchmarking pipeline.

Detailed Experimental Protocols

Protocol 1: Zero-Shot Cell Type Clustering

Objective: To evaluate the intrinsic ability of scFM embeddings to separate known cell types without any fine-tuning.

Materials:

A labeled scRNA-seq dataset not seen during the model's pretraining (e.g., a subset of AIDA v2 from CELLxGENE) [17].
Pretrained scFM (e.g., Geneformer, scGPT, scFoundation).
Baseline methods for comparison (e.g., HVG selection + PCA, scVI, Harmony).

Procedure:

Data Preprocessing: Prepare the dataset according to the scFM's input requirements. This typically includes normalization and gene filtering. Crucially, do not use the cell type labels in this step.
Embedding Generation: Pass the preprocessed dataset through the scFM to extract a cell embedding vector for each cell.
Dimensionality Reduction & Clustering: Apply a standard clustering algorithm (e.g., Leiden, K-means) directly to the cell embeddings.
Performance Calculation: Compare the cluster assignments to the ground-truth cell type labels using metrics from Table 1.
Benchmarking: Repeat steps 2-4 for all baseline methods and compare the scores.

Expected Outcome: A table of clustering metrics for the scFM and all baselines. A robust scFM should perform comparably to or better than established methods.

Table 2: Example Zero-Shot Clustering Results (AvgBIO Score) on a Pancreas Dataset

Method	AvgBIO Score	Notes
HVG + PCA	0.75	Simple, powerful baseline [5]
scVI	0.72	Deep generative model baseline [5]
Harmony	0.70	Integration-focused baseline [5]
scGPT (Zero-Shot)	0.65	Single-cell foundation model [5]
Geneformer (Zero-Shot)	0.58	Single-cell foundation model [5]

Protocol 2: Zero-Shot Batch Integration

Objective: To assess the model's capacity to generate embeddings where cells from the same type co-localize across different experimental batches or technologies.

Materials:

A dataset with known, strong batch effects and cell type labels (e.g., a composite Pancreas dataset from multiple studies) [5].
Same scFMs and baselines as in Protocol 1.

Procedure:

Data Preprocessing: Similar to Protocol 1.
Embedding Generation: Obtain cell embeddings from the scFM and baseline methods.
Visualization and Quantitative Analysis:
- Generate UMAP plots colored by batch and by cell type. Qualitatively, a good integration will show mixing by batch and separation by cell type.
- Quantitatively, use the PCR score (lower is better) to measure the amount of variance explained by batch after integration [5]. Simultaneously, use the cell-type ASW (higher is better) to ensure biological information was preserved.
Interpretation: A model performing well zero-shot will have a low PCR score and a high cell-type ASW.

Expected Outcome: Visualization plots and quantitative metrics that reveal whether batch effects have been removed without loss of biological signal. Studies have shown that while scGPT can show promise on complex batches, Geneformer often struggles significantly, with its embeddings sometimes being dominated by batch information [5].

The Scientist's Toolkit: Essential Reagents & Materials

Beyond conceptual frameworks, practical benchmarking requires a set of standardized computational "reagents."

Table 3: Key Research Reagent Solutions for Zero-Shot Benchmarking

Tool / Resource	Function / Description	Role in Benchmarking
Curated Benchmarking Datasets (e.g., AIDA v2, HLCA) [60] [17]	High-quality, diverse scRNA-seq datasets with reliable annotations.	Provides the ground-truth "test set" for evaluating model generalizability and preventing data leakage.
Baseline Methods (e.g., HVG, scVI, Harmony) [5] [60]	Established, often simpler, computational methods for single-cell analysis.	Serves as a critical performance baseline; an scFM should aim to outperform these.
Extended Benchmarking Metrics (e.g., scGraph-OntoRWR, LCAD) [17]	Novel metrics that incorporate prior biological knowledge from cell ontologies.	Moves beyond statistical clustering metrics to evaluate the biological plausibility of the model's outputs.
Unified Evaluation Pipelines (e.g., scIB-E [60])	Software frameworks that standardize scoring and comparison across methods.	Ensures reproducibility and fair comparison by applying the same preprocessing and metric calculations to all models.

Visualizing the Benchmarking Logic

The following diagram synthesizes the logical relationships between the pretraining goals of scFMs, the requirements for biological discovery, and the corresponding benchmarking tasks that bridge the two.

Establishing robust benchmarks for the zero-shot evaluation of single-cell foundation models is a cornerstone for their responsible development and application. The protocols and frameworks outlined here provide a path toward more rigorous, biologically grounded validation. The consistent finding that simpler methods can outperform complex foundation models in a zero-shot setting is a powerful reminder that model scale and pretraining data volume are not substitutes for learning meaningful, transferable biology [5] [6].

Future progress will depend on the community's adoption of these rigorous benchmarking practices. This includes the development of more sophisticated metrics, like the ontology-aware scGraph-OntoRWR [17], and a commitment to evaluating models on challenging, clinically relevant tasks such as cancer cell identification and drug sensitivity prediction [17]. By adhering to these principles, the field can ensure that single-cell foundation models evolve from promising tools into reliable engines of biological discovery.

Single-cell foundation models (scFMs), such as scGPT and Geneformer, represent a transformative approach in computational biology, trained on millions of single-cell gene expression profiles to learn fundamental biological principles [1]. These models promise to automate critical tasks like cell type identification and gene expression prediction. However, their true utility for biological discovery hinges on effective zero-shot learning—the ability to make accurate predictions on new, unseen data without any task-specific fine-tuning [5] [6]. This capability is particularly vital in exploratory research where predefined labels are unavailable, making fine-tuning impossible [5].

Despite their theoretical promise, recent rigorous evaluations reveal that these foundation models often underperform compared to simpler, established traditional methods like scVI and Harmony when applied zero-shot to common analytical tasks [5] [6]. This application note provides a detailed, evidence-based comparison of these model classes, summarizing quantitative performance benchmarks and providing standardized protocols for their evaluation. The findings underscore the importance of critical benchmarking in guiding method selection and development.

Quantitative Performance Benchmarking

Independent studies have systematically evaluated the zero-shot performance of scGPT and Geneformer against traditional methods across key single-cell analysis tasks. The tables below summarize these quantitative results.

Table 1: Zero-shot Performance in Cell Type Clustering (AvgBIO Score) [5]

Method	Pancreas	PBMC (12k)	Immune	Tabula Sapiens
HVG (Baseline)	0.65	0.61	0.59	0.63
Harmony	0.68	0.64	0.62	0.66
scVI	0.70	0.62	0.60	0.65
scGPT	0.58	0.66	0.55	0.59
Geneformer	0.51	0.53	0.50	0.52

A higher AvgBIO score indicates better cell type separation. scGPT and Geneformer are outperformed by simpler methods in most datasets.

Table 2: Performance in Batch Integration (Batch Mixing Score) [5]

Method	Pancreas	PBMC	Immune	Tabula Sapiens
HVG (Baseline)	0.85	0.88	0.82	0.84
scVI	0.80	0.82	0.75	0.79
Harmony	0.78	0.81	0.80	0.77
scGPT	0.72	0.79	0.78	0.76
Geneformer	0.45	0.48	0.42	0.44

A higher score indicates better mixing of cells from different batches while preserving biological variation. Geneformer shows significant limitations.

Table 3: Performance in Genetic Perturbation Effect Prediction (L2 Distance) [54]

Model	Double Perturbation (Norman et al. data)	Unseen Single Perturbation (Replogle et al. data)
Additive Baseline	~0.75	-
No-Change Baseline	~0.95	~0.90
Linear Model	-	~0.92
scGPT	~1.10	~1.05
Geneformer*	~1.25	~1.15
GEARS	~1.05	~0.98

A lower L2 distance indicates more accurate prediction of gene expression changes after perturbation. Simple baselines outperform foundation models. *Geneformer was repurposed with a linear decoder for this task [54].

Experimental Protocols for Benchmarking

To ensure reproducible and objective evaluation of single-cell foundation models against traditional methods, the following detailed protocols are recommended.

Protocol 1: Zero-Shot Cell Type Clustering

Objective: To evaluate the quality of cell embeddings generated by a model for separating known cell types without any fine-tuning.

Materials:

Processed single-cell RNA-seq dataset (e.g., from Pancreas or PBMC studies) with held-out cell type labels.
Pretrained models (scGPT, Geneformer).
Traditional methods for comparison (scVI, Harmony).
Computing environment with appropriate Python libraries (e.g., scikit-learn, scanpy).

Procedure:

Data Preprocessing: Load the dataset. If required by the model, preprocess the data according to its specifications (e.g., gene filtering, normalization). Do not provide cell type labels to the models.
Generate Embeddings:
- For scGPT and Geneformer, extract the cell embeddings from the model's output layer in a zero-shot manner. Follow the authors' inference code, typically involving a forward pass of the cell's expression profile.
- For scVI, train the model on the dataset (without labels) and then obtain the latent representation.
- For Harmony, run the algorithm on the PCA space of the dataset to obtain integrated embeddings.
Clustering: Apply a standard clustering algorithm (e.g., Leiden clustering) on the obtained embeddings from all methods. Use the same clustering resolution across all methods for a fair comparison.
Evaluation: Calculate the AvgBIO score or Average Silhouette Width (ASW) by comparing the resulting clusters to the ground-truth cell type labels. A higher score indicates better performance.

Protocol 2: Batch Integration Assessment

Objective: To assess a model's ability to integrate data from multiple batches (e.g., different experiments, donors, or technologies) while preserving biological variance.

Materials:

A single-cell dataset with known batch effects and biological ground truth (e.g., the Pancreas benchmark with 5 batches [5] [61]).
The same set of models and computing environment as in Protocol 1.

Procedure:

Data Preparation: Load a multi-batch dataset. Ensure the batch identities and biological labels (e.g., cell types) are known.
Generate Integrated Embeddings: Apply each model to the entire dataset to generate a joint embedding.
- For foundation models, this is a zero-shot pass.
- For scVI and Harmony, use their standard integration workflows.
Dimensionality Reduction and Visualization: Generate UMAP plots from the embeddings for qualitative assessment. Color the plots by batch and by cell type.
Quantitative Evaluation: Calculate two key metrics:
- Batch Mixing Score (e.g., iLISI): Measures how well cells from different batches are mixed. A higher score is better.
- Biological Conservation Score (e.g., cell-type ASW or cLISI): Measures how well biological cell type identity is preserved. A higher score is better.
- Compare the scores of scGPT and Geneformer against those of scVI and Harmony.

Protocol 3: Perturbation Effect Prediction

Objective: To benchmark a model's ability to predict transcriptome-wide changes resulting from genetic perturbations.

Materials:

Perturbation dataset (e.g., Norman et al. or Replogle et al. [54]).
Pretrained foundation models (scGPT, scFoundation) and perturbation models (GEARS).
Code for the simple additive and no-change baselines.

Procedure:

Data Splitting: For a dataset with single- and double-gene perturbations, split the double perturbations into training and held-out test sets.
Model Fine-tuning/Fitting:
- Fine-tune the foundation models and GEARS on the training set of perturbations.
- For the additive baseline, calculate the sum of the log-fold changes for the two single perturbations that constitute each double perturbation.
- For the no-change baseline, simply use the control (unperturbed) expression profile as the prediction for all perturbations.
Prediction and Evaluation:
- Task all models to predict the expression of all genes for the held-out double perturbations.
- Calculate the L2 distance between the predicted and observed expression profiles for the top 1,000 highly variable genes.
- The model with the lowest L2 distance is the most accurate.

Visualizing Workflows and Relationships

The following diagrams illustrate the core architectures and benchmark workflows.

Model Architectures and Zero-Shot Principle

Single-Cell Foundation Model Workflow

Benchmarking Protocol for Cell Clustering

Benchmarking Workflow for Cell Clustering

Table 4: Key Computational Tools and Datasets for Evaluation

Item Name	Type	Function in Evaluation	Source/Availability
scGPT	Foundation Model	Provides zero-shot cell embeddings; can be fine-tuned for tasks like perturbation prediction.	GitHub Repository
Geneformer	Foundation Model	Provides zero-shot cell embeddings; repurposable for downstream tasks with a decoder.	Hugging Face Hub
scVI	Traditional Method (Deep Generative Model)	Generates latent representations of cells for clustering and integration, correcting for batch effects.	scvi-tools
Harmony	Traditional Method (Integration Algorithm)	Integrates single-cell data across multiple batches by correcting the PCA embedding space.	CRAN R package
Pancreas Benchmark Dataset	Dataset	A standardized dataset with 5 batches; used for evaluating batch integration and cell type clustering.	Download from GitHub
Norman et al. Perturbation Data	Dataset	Contains single and double gene perturbation profiles in K562 cells; used for benchmarking prediction accuracy.	AddGene
scIB Metrics	Software Library	A standardized Python module providing metrics for benchmarking batch integration and bio-conservation.	scIB GitHub
BioLLM Framework	Software Framework	A unified interface for integrating and evaluating different single-cell foundation models.	GitHub Repository

The current generation of single-cell foundation models, scGPT and Geneformer, demonstrates clear potential but faces significant reliability challenges in zero-shot settings. Quantitative evidence shows that they are often outperformed by simpler, established methods like scVI and Harmony on tasks including cell type clustering and batch integration [5] [6]. For predicting genetic perturbation effects, they have not yet surpassed deliberately simple linear baselines [54].

These findings caution against the unprincipled adoption of scFMs for discovery tasks where fine-tuning is not feasible. Researchers should carefully evaluate their performance against traditional baselines for their specific dataset. Future development must prioritize robust zero-shot evaluation to ensure these models genuinely learn transferable biological principles, rather than relying on fine-tuning to achieve performance. Frameworks like BioLLM [8], which standardize model integration and evaluation, will be crucial in driving this progress and ultimately fulfilling the promise of foundation models in single-cell biology.

Single-cell foundation models (scFMs) have emerged as powerful tools for integrating heterogeneous datasets and exploring biological systems, trained on millions of single-cell transcriptomes to learn universal patterns in gene expression [4] [1]. Despite their promising performance on various computational tasks, a critical question remains: to what extent do these models capture biologically meaningful relationships rather than merely optimizing statistical objectives? Traditional evaluation metrics often assess computational performance like clustering accuracy or batch integration efficiency but fail to quantify whether the model's internal representations align with established biological knowledge [4]. This limitation is particularly problematic for zero-shot learning scenarios where models are applied to new data without further training, as biological discovery often involves exploring unlabeled data where ground truth is unknown [62] [5].

To address this gap, scGraph-OntoRWR has been introduced as a novel biology-driven metric that directly evaluates the biological relevance of scFM embeddings [4] [63]. This metric moves beyond purely computational assessments by measuring the consistency between the cell-type relationships learned by foundation models and the hierarchical knowledge formalized in the Cell Ontology [4] [64]. By leveraging the rich semantic structure of biological ontologies, scGraph-OntoRWR provides a rigorous framework for determining whether scFMs are learning the fundamental principles of cellular biology or merely detecting technical patterns in the data.

Background: Biological Ontologies as Ground Truth

The Cell Ontology as a Knowledge Framework

The Cell Ontology (CL) is a controlled, structured vocabulary that organizes cell types into a hierarchical graph based on the "isa" relation and other ontological relationships [64]. This framework captures established biological knowledge about cell types, their developmental lineages, and their functional characteristics. Each cell type in the ontology is represented as a node, with edges representing relationships such as "isa" (denoting classification) and "part_of" (denoting composition) [63] [64]. The CL currently contains over 2,300 cell types anatomically derived into a logical hierarchy, providing a comprehensive ground-truth network for evaluating biological relationships [64].

The Challenge of Evaluating Biological Relevance

Single-cell RNA sequencing data presents unique challenges for analysis, characterized by high dimensionality, high sparsity, and low signal-to-noise ratio [4] [1]. While scFMs can demonstrate strong performance on tasks like cell type annotation and batch integration, previous benchmarking studies have revealed that their zero-shot embeddings do not consistently outperform simpler methods like highly variable genes (HVG) selection or established algorithms such as Harmony and scVI [62] [5]. This discrepancy between model complexity and practical performance underscores the need for metrics that can assess whether these models are learning biologically meaningful representations versus merely exploiting statistical patterns [5] [6].

The scGraph-OntoRWR Framework: Principles and Implementation

Theoretical Foundation

The scGraph-OntoRWR metric is grounded in the hypothesis that a biologically meaningful embedding space should position cell types according to their established ontological relationships [4]. Specifically, cell types that are closely related in the Cell Ontology graph (e.g., different subtypes of T cells) should be positioned closer together in the model's latent space compared to distantly related cell types (e.g., T cells versus neurons) [4] [64]. The metric operates on the "guilt-by-association" principle, which states that biologically similar cell types should have similar gene expression profiles and therefore occupy neighboring regions in the embedding space [64].

Algorithmic Workflow

The scGraph-OntoRWR implementation comprises four key stages that transform raw model embeddings into a quantitative measure of biological consistency:

Cell-Cell Graph Construction: A k-nearest neighbor (k-NN) graph is constructed from the scFM's cell embeddings, where nodes represent cells and edges connect each cell to its k most similar neighbors based on cosine similarity in the embedding space.
Random Walk with Restart (RWR) Execution: For each cell in the graph, multiple random walks are performed with a restart probability, generating a visitation frequency distribution that captures the local graph topology around each cell.
Ontology Consistency Measurement: The similarity between the graph-derived RWR distributions and the Cell Ontology structure is computed, measuring how well the embedding-preserved relationships align with established biological knowledge.
Score Calculation: A final scGraph-OntoRWR score is computed by aggregating the node-level consistency measurements, with higher scores indicating better alignment between the model's representations and biological reality.

The following diagram illustrates the complete scGraph-OntoRWR workflow:

Experimental Protocol for scGraph-OntoRWR Evaluation

Prerequisites and Input Requirements

To implement scGraph-OntoRWR evaluation for scFMs, researchers must prepare the following inputs:

Single-cell foundation model embeddings: A matrix of shape (ncells, ndimensions) containing the latent representations of cells generated by the scFM in zero-shot mode (without fine-tuning).
Cell type annotations: A vector of length n_cells containing the ground truth cell type labels for each cell.
Cell Ontology graph: A graph structure representing the Cell Ontology, with nodes corresponding to cell types and edges representing ontological relationships.

Step-by-Step Implementation

Embedding Extraction: Generate cell embeddings using the target scFM in zero-shot mode. For models like scGPT and Geneformer, this involves forward propagation of the gene expression matrix through the pretrained model without updating parameters [62] [5].
Parameter Initialization: Set the key parameters for the scGraph-OntoRWR algorithm:
- k = 15 (number of neighbors for k-NN graph construction)
- restart probability = 0.7 (for RWR)
- number of walks = 1000 (per cell)
- walk length = 40 (steps per random walk)
Graph Construction: Build a k-NN graph from the embeddings using cosine similarity as the distance metric. The resulting graph should have n_cells nodes with each node connected to its k nearest neighbors.
Random Walk Execution: Perform RWR on the k-NN graph. For each cell, initiate multiple random walks that explore the local graph neighborhood, with a probability of restarting at the original cell at each step.
Ontology Mapping: Map the cell type annotations to corresponding Cell Ontology terms. This may require terminology harmonization using natural language processing if the annotation labels don't exactly match ontology terms [64].
Similarity Computation: Calculate the similarity between the RWR visitation distributions and the ontological relationships. This involves measuring the correlation between the graph-derived similarities and the ontology-derived similarities for pairs of cell types.
Score Aggregation: Compute the final scGraph-OntoRWR score by averaging the consistency measurements across all cells. The score ranges from 0 to 1, with higher values indicating better biological consistency.

Interpretation Guidelines

When interpreting scGraph-OntoRWR results, consider the following guidelines:

Scores above 0.7 indicate strong alignment with biological knowledge
Scores between 0.5 and 0.7 suggest moderate biological consistency
Scores below 0.5 indicate poor capture of established biological relationships
Always compare scores against baseline methods (e.g., HVG, scVI, Harmony) for context

Benchmarking Results and Comparative Performance

scFM Performance on Biological Relevance

Comprehensive benchmarking of six major scFMs (Geneformer, scGPT, UCE, scFoundation, LangCell, and scCello) using scGraph-OntoRWR has revealed significant differences in their ability to capture biologically meaningful relationships [4]. The following table summarizes the quantitative performance of these models across multiple biological tasks, with scGraph-OntoRWR providing crucial insights into their biological relevance:

Table 1: Performance Comparison of Single-Cell Foundation Models Across Biological Tasks

Model	Batch Integration Rank	Cell Type Annotation Rank	Cancer ID Rank	Drug Sensitivity Rank	scGraph-OntoRWR Score	Overall Biological Relevance
Geneformer	2	3	1	2	0.72	High
scGPT	3	2	3	3	0.68	Medium-High
UCE	1	4	4	4	0.63	Medium
scFoundation	4	1	2	1	0.75	High
LangCell	5	5	5	5	0.61	Medium
scCello	6	6	6	6	0.58	Medium-Low
Traditional ML	7	7	7	7	0.49	Low
HVG Selection	8	8	8	8	0.45	Low

Key Findings from Biological Evaluation

The implementation of scGraph-OntoRWR in large-scale benchmarking has yielded several critical insights:

No single scFM dominates across all tasks: Each model exhibits strengths in different biological applications, with scFoundation showing particularly strong performance in capturing biological relationships [4].
Pretraining improves biological consistency: Models with larger and more diverse pretraining datasets generally achieve higher scGraph-OntoRWR scores, confirming the value of broad pretraining for biological relevance [4].
Zero-shot limitations are evident: Even the best-performing scFMs show room for improvement, with scGraph-OntoRWR scores typically ranging from 0.6-0.75, indicating that current models do not fully capture the complexity of biological systems [62] [5].
Simple baselines remain competitive: Surprisingly, traditional methods like highly variable genes (HVG) selection sometimes outperform foundation models in specific tasks, highlighting that biological relevance does not necessarily correlate with model complexity [5] [6].

The Scientist's Toolkit: Essential Research Reagents

Successful implementation of scGraph-OntoRWR evaluation requires both computational tools and biological resources. The following table details the essential components of the evaluation framework:

Table 2: Essential Research Reagents and Resources for scGraph-OntoRWR Implementation

Reagent/Resource	Function	Biological Significance	Example Sources
Gene Embeddings	Numerical representations of genes in latent space	Capture functional similarities between genes based on co-expression patterns	scGPT, Geneformer
Cell Ontologies	Structured vocabularies defining cell types and relationships	Provide biological ground truth for evaluating model relevance	OBO Foundry, Cell Ontology
Benchmark Datasets	Curated single-cell data with high-quality annotations	Enable standardized evaluation across different models	CELLxGENE, Tabula Sapiens
Attention Mechanisms	Model components that identify important relationships	Reveal gene-gene interactions learned from data	Transformer architectures
GO Term Annotations	Gene Ontology functional classifications	Serve as biological prior knowledge for validation	Gene Ontology Consortium

Application Notes for Zero-Shot Learning Research

Integration with Zero-Shot Evaluation Frameworks

scGraph-OntoRWR is particularly valuable in zero-shot learning scenarios, where models must generalize to new data without fine-tuning [62] [5]. When integrated into comprehensive evaluation pipelines, it helps researchers:

Identify models that genuinely understand biological principles versus those that merely memorize training data patterns
Select the most appropriate scFM for discovery-driven research where labeled data is unavailable
Diagnose specific limitations in model representations that can guide architectural improvements

Protocol for Cross-Dataset Validation

To ensure robust evaluation of biological relevance, implement the following cross-validation protocol:

Dataset Selection: Choose evaluation datasets that cover diverse tissues, species, and experimental conditions to assess generalizability.
Baseline Comparison: Always include established methods (Harmony, scVI, HVG selection) as benchmarks for scGraph-OntoRWR scores.
Ablation Studies: Systematically vary pretraining data composition and model architecture to identify factors that most significantly impact biological relevance.
Statistical Testing: Perform significance testing on scGraph-OntoRWR score differences to ensure observed variations in biological consistency are meaningful.

As single-cell foundation models continue to evolve, the scGraph-OntoRWR metric provides a foundation for several important methodological advances:

Multi-ontology integration: Future versions could incorporate additional ontological frameworks, such as the Gene Ontology and Protein Ontology, for a more comprehensive assessment of biological relevance.
Temporal dynamics: Extending the approach to capture developmental trajectories and temporal processes would enhance its utility for studying cellular differentiation and disease progression.
Spatial context integration: Incorporating spatial relationships from transcriptomic data would align the metric with the increasing importance of spatial context in biology.
Automated hyperparameter optimization: Developing adaptive methods for setting scGraph-OntoRWR parameters would improve its robustness across diverse datasets and applications.

The continued refinement of biology-driven metrics like scGraph-OntoRWR will be essential for ensuring that single-cell foundation models evolve from powerful pattern recognition tools to genuine instruments of biological discovery.

Single-cell foundation models (scFMs), pretrained on millions of single-cell transcriptomes, promise to revolutionize biological discovery by providing powerful, general-purpose representations for diverse downstream tasks [1]. The "zero-shot" learning paradigm, where models are applied without any task-specific fine-tuning, is particularly critical for exploratory research where predefined labels are unavailable [5]. This application note provides a structured evaluation of scFM performance on three fundamental tasks—cell clustering, batch integration, and perturbation prediction—synthesizing insights from recent benchmarking studies to guide researchers in model selection and application.

Performance Evaluation Tables

Cell Clustering Performance (Zero-Shot)

Table 1: Zero-shot clustering performance of scFMs compared to established baselines, measured by Average BIO score (higher is better).

Model	PBMC (12k)	Tabula Sapiens	Pancreas	Immune Dataset
HVG	0.78	0.75	0.72	0.74
scVI	0.75	0.77	0.75	0.76
Harmony	0.74	0.73	0.70	0.72
scGPT	0.79	0.74	0.71	0.70
Geneformer	0.65	0.62	0.60	0.58

Data derived from [5], which evaluated embeddings on known cell type separation. HVG (Highly Variable Genes) serves as a simple yet strong baseline.

Batch Integration Capabilities

Table 2: Batch integration scores across different datasets and methods (higher scores indicate better batch mixing).

Model	Pancreas	PBMC	Tabula Sapiens	Immune Dataset
HVG	0.89	0.91	0.87	0.85
scVI	0.85	0.88	0.82	0.75
Harmony	0.80	0.83	0.72	0.81
scGPT	0.75	0.78	0.84	0.83
Geneformer	0.45	0.50	0.48	0.42

Scores represent batch integration metrics evaluated in [5]. Performance varies significantly by dataset characteristics and batch effect types.

Perturbation Prediction Accuracy

Table 3: Performance comparison on predicting transcriptional responses to unseen genetic perturbations (PearsonΔ, higher is better).

Method	Adamson Dataset	Norman Dataset	Replogle Dataset
Perturbed Mean	0.68	0.65	0.62
Matching Mean	0.65	0.67*	0.60
scGPT	0.58	0.59	0.55
GEARS	0.55	0.60	0.52
CPA	0.52	0.56	0.50

Data from [65] evaluating prediction of unseen perturbation effects. *For combinatorial perturbations in the Norman dataset, Matching Mean performs best. Simple baselines surprisingly compete with or outperform specialized models.

Experimental Protocols

Protocol 1: Zero-Shot Cell Type Clustering

Purpose: To evaluate scFM embeddings for discriminating cell types without fine-tuning.

Workflow:

Embedding Extraction: Generate cell embeddings using the scFM's zero-shot mode. For scGPT, use the model.encode() method; for Geneformer, extract the final layer embeddings ( [5] [17]).
Dimensionality Reduction: Apply PCA (50 components) to the embeddings, followed by UMAP for 2D visualization.
Clustering: Perform Leiden clustering on the k-nearest neighbor graph (k=20) constructed from PCA-reduced embeddings.
Evaluation: Compare cluster labels to ground truth cell annotations using Adjusted Rand Index (ARI) and Normalized Mutual Information (NMI) metrics [66].

Key Controls: Ensure the evaluation dataset was not part of the model's pretraining corpus to avoid data leakage [5].

Protocol 2: Batch Effect Integration

Purpose: To assess scFM capability to remove technical batch effects while preserving biological variation.

Workflow:

Data Preparation: Select a dataset with known batch effects (e.g., pancreatic islet data from multiple labs).
Embedding Generation: Process batches through the scFM to obtain integrated embeddings.
Batch Correction Assessment: Calculate the graph integration local inverse Simpson's Index (iLISI) to quantify batch mixing [67].
Biological Preservation Assessment: Compute cell-type silhouette width and normalized mutual information (NMI) to ensure biological signals remain [67].

Troubleshooting: If biological information is lost (low NMI), consider sysVI, a specialized variational autoencoder method that combines VampPrior with cycle-consistency constraints for challenging integration scenarios [67].

Protocol 3: Perturbation Response Prediction

Purpose: To predict single-cell transcriptional responses to unseen genetic perturbations.

Workflow:

Data Partitioning: Split perturbation data using leave-one-out cross-validation, ensuring target perturbations are absent from training [65].
Model Setup: For scGPT, use the perturbation prediction head fine-tuned on similar data.
Prediction: Generate predicted expression profiles for the held-out perturbation.
Evaluation: Compute the average treatment effect (difference from control cells) and calculate Pearson correlation (PearsonΔ) between predicted and actual differential expression [65].

Critical Consideration: Use the Systema framework to control for systematic variation—consistent differences between perturbed and control cells that can inflate performance metrics [65].

Workflow Diagrams

Zero-Shot Evaluation Workflow

Zero-Shot Evaluation Workflow: This diagram illustrates the comparative evaluation process for scFMs against established baseline methods.

Perturbation Prediction with Systematic Variation Control

Perturbation Prediction Evaluation: This workflow highlights the critical distinction between systematic variation and perturbation-specific effects when evaluating prediction models.

The Scientist's Toolkit

Table 4: Essential research reagents and computational resources for scFM evaluation.

Resource	Type	Function	Example/Reference
CELLxGENE	Data Platform	Provides standardized, annotated single-cell datasets for pretraining and evaluation	[1]
BioLLM	Software Framework	Unified interface for integrating and benchmarking diverse scFMs	[8]
Systema	Evaluation Framework	Controls for systematic variation in perturbation prediction tasks	[65]
scICE	Clustering Tool	Enhances clustering reliability and efficiency for large datasets	[68]
sysVI	Integration Method	Specialized cVAE for datasets with substantial batch effects	[67]
PerturbNet	Prediction Model	Deep generative model for chemical and genetic perturbation prediction	[69]

Performance evaluations reveal that single-cell foundation models demonstrate promising but inconsistent capabilities in zero-shot settings. While they offer substantial utility as versatile, general-purpose tools, their performance is highly task-dependent and often matched or exceeded by simpler, specialized methods [5] [17]. For cell clustering, established baselines like HVG selection remain remarkably strong; for batch integration, scFMs show variable performance across different types of batch effects; and for perturbation prediction, simple mean-based baselines surprisingly compete with sophisticated models [65] [5]. These findings emphasize that biological context, dataset characteristics, and careful evaluation design are paramount in selecting the appropriate computational approach. The emerging framework of zero-shot evaluation provides critical insights into the true generalization capabilities of scFMs beyond fine-tuning scenarios, guiding their responsible application in biological discovery and therapeutic development.

Single-cell foundation models (scFMs) represent a transformative advancement in computational biology, promising to decode the intricate language of cellular function from vast transcriptomic datasets. These models, including Geneformer and scGPT, are pretrained on millions of single-cell transcriptomes using self-supervised objectives, analogous to how large language models learn from text corpora [1]. The anticipated benefit is zero-shot capability—applying these models directly to downstream tasks like cell type annotation and batch integration without task-specific fine-tuning. This approach is particularly valuable in exploratory biological research where predefined labels are unavailable [5]. However, a growing body of evidence reveals a critical disconnect: scFMs often achieve impressive technical metrics while failing to provide novel biological insights. This application note examines this gap through rigorous evaluation frameworks and provides protocols for implementing biologically-grounded assessment of scFM performance.

The Performance Gap: Technical Metrics Versus Biological Relevance

Quantitative Performance Shortfalls

Recent benchmarking studies demonstrate that scFMs underperform simpler methods in zero-shot settings across fundamental analytical tasks. As shown in Table 1, in cell type clustering, both Geneformer and scGPT are consistently outperformed by established methods like Harmony, scVI, and even simple highly variable genes (HVG) selection when measured by Average BIO (AvgBio) score and average silhouette width (ASW) [5].

Table 1: Zero-shot performance comparison in cell type clustering

Model/Method	AvgBio Score (Pancreas)	ASW (Tabula Sapiens)	Performance on Novel Cell Types
Geneformer	0.41	0.38	Limited generalization
scGPT	0.52	0.61	Variable performance
scVI	0.68	0.59	Consistent across datasets
Harmony	0.65	0.55	Consistent across datasets
HVG Selection	0.71	0.63	Consistent across datasets

In batch integration tasks, which aim to remove technical artifacts while preserving biological variation, the limitations are even more pronounced. Geneformer's embedding space frequently fails to retain meaningful cell type information, with clustering primarily driven by batch effects rather than biological signals [5]. scGPT shows somewhat better performance but still underperforms established methods on datasets with complex technical and biological batch effects [5].

Beyond Technical Scores: The Biological Insight Deficit

The performance gap extends beyond quantitative metrics to a fundamental disconnect in biological interpretability. Foundation models often lack transparency in how they represent cellular states, making it difficult to extract mechanistically meaningful insights [1] [70]. For instance, while a model might achieve reasonable clustering accuracy, the basis for these groupings may not align with established biological knowledge or reveal novel functional relationships.

This limitation is particularly problematic in drug discovery applications, where understanding the biological mechanism is as crucial as identifying patterns. Models that excel at technical benchmarks but fail to provide interpretable insights into gene regulatory networks or signaling pathways have limited utility in translational research [70] [71].

Experimental Protocols for Biologically-Grounded Evaluation

Protocol 1: Implementing Zero-Shot Cell Type Annotation

Purpose: To evaluate scFM performance in cell type identification without task-specific fine-tuning, simulating real-world discovery settings where cell compositions are unknown.

Materials:

Pretrained scFM (e.g., scGPT, Geneformer, LangCell)
Reference dataset with ground truth annotations (e.g., Tabula Sapiens, CELLxGENE Census)
Baseline methods (scVI, Harmony, Seurat)
Evaluation metrics (Accuracy, F1-score, LCAD, scGraph-OntoRWR)

Procedure:

Embedding Extraction: Process the target dataset through the scFM in zero-shot mode to obtain cell embeddings without fine-tuning [5] [4].
Baseline Comparison: Generate embeddings using established baseline methods (scVI, Harmony) and simple HVG selection [5].
Clustering Analysis: Apply standard clustering algorithms (e.g., Leiden, K-means) to all embedding types.
Biological Alignment Assessment:
- Calculate traditional metrics (ARI, AMI) for cluster quality against ground truth [4].
- Compute ontology-informed metrics (LCAD) to measure semantic distance between misclassified cell types [4].
- Apply scGraph-OntoRWR to evaluate consistency between embedding-derived relationships and established biological knowledge [4].
Interpretability Analysis: Extract attention weights or feature importance scores from the scFM to identify genes driving cell type classification [71].

Troubleshooting: If biological alignment is poor despite good technical metrics, prioritize methods with inherent interpretability, such as scKAN or scMKL, which provide more transparent feature importance [70] [71].

Protocol 2: Batch Integration with Biological Conservation Assessment

Purpose: To evaluate scFM capability to remove technical batch effects while preserving biologically meaningful variation.

Materials:

Dataset with known batch effects and biological ground truth (e.g., Pancreas benchmark with multiple experiments)
Batch integration methods (scGPT, Geneformer, Harmony, scVI, scMKL)
Evaluation framework with batch removal and bio-conservation metrics

Procedure:

Data Preparation: Select a dataset with significant technical variation (different sequencing technologies) but known biological states [5].
Embedding Generation: Apply scFMs and baseline methods to generate integrated embeddings.
Batch Effect Quantification:
- Calculate batch removal scores (PCR, LISI) to measure technical artifact removal [5].
- Visualize embeddings using UMAP/t-SNE to assess qualitative batch mixing.
Biological Conservation Assessment:
- Measure conservation of known biological groups (cell types, states) using clustering metrics.
- Evaluate preservation of trajectory structures where applicable.
- Assess conservation of rare cell populations that might be lost in over-correction.
Comparative Analysis: Rank methods by both batch removal and biological conservation, noting any trade-offs [5] [4].

Troubleshooting: If batch integration removes biological signal, adjust method parameters or consider hierarchical approaches that distinguish technical and biological variation.

Protocol 3: Pathway-Centric Interpretation of Model Predictions

Purpose: To move beyond gene-level importance to pathway-centric interpretation of scFM outputs.

Materials:

scFM with attention mechanisms or interpretable architectures (scGPT, scKAN, scMKL)
Prior biological knowledge bases (GO, KEGG, Hallmark gene sets)
Functional analysis tools (GSEA, enrichment analysis)

Procedure:

Feature Importance Extraction:
- For transformer models, extract attention weights across gene tokens [4] [71].
- For interpretable architectures (scKAN, scMKL), directly obtain feature importance scores [70] [71].
Gene Set Enrichment: Project gene-level importance scores onto curated pathway databases (GO, KEGG, Hallmark) [70] [4].
Pathway Activity Scoring: Derive pathway-level importance scores from enriched gene sets.
Biological Validation:
- Compare identified pathways to established biological knowledge of the system.
- Assess novelty of predictions and generate testable hypotheses.
- Where possible, validate predictions using orthogonal data (perturbation studies, proteomics).
Cross-Model Comparison: Evaluate consistency of pathway discoveries across different scFMs and baseline methods.

Troubleshooting: If pathway interpretations lack coherence, incorporate protein-protein interaction networks or gene regulatory information to contextualize findings.

Visualizing the Evaluation Framework

Figure 1: Zero-Shot scFM Evaluation Framework

Emerging Solutions: Toward Biologically Meaningful Models

Interpretable Architecture Designs

Novel architectures are addressing the interpretability gap by design. The scKAN framework uses Kolmogorov-Arnold Networks with learnable activation curves to model gene-cell relationships directly, providing transparent feature importance scores for cell-type-specific marker discovery [71]. Similarly, scMKL integrates multiple kernel learning with biological pathway information, enabling interpretable multimodal analysis of transcriptomic and epigenomic data [70].

Table 2: Interpretable scFM architectures and their applications

Model	Architecture	Interpretability Features	Best Applications
scKAN	Kolmogorov-Arnold Networks	Learnable activation curves, gene importance scores	Cell type annotation, marker discovery, drug repurposing
scMKL	Multiple Kernel Learning	Pathway-level interpretations, multimodal integration	Cancer subtyping, regulatory mechanism identification
TOSICA	Transformer with biological concepts	Biologically understandable entities, one-shot annotation	Novel cell type identification, tumor microenvironment
scBERT	BERT-style encoder	Gene-gene interaction capture, attention visualization	Cell type annotation, pattern discovery

Biology-Aware Evaluation Metrics

Moving beyond technical benchmarks, researchers are developing biology-grounded evaluation frameworks:

scGraph-OntoRWR: Measures consistency between cell type relationships in the embedding space and established biological ontologies [4].
Lowest Common Ancestor Distance (LCAD): Quantifies the semantic distance between misclassified cell types in biological ontologies, providing more nuanced error analysis than simple accuracy [4].
Roughness Index (ROGI): Evaluates the smoothness of cell-property landscapes in latent spaces, correlating with model generalization capability [4].

Multimodal Integration Approaches

Models like CellWhisperer are bridging the gap between transcriptomics and biological knowledge by creating joint embedding spaces of transcriptomes and textual descriptions, enabling natural language querying of cellular states [25]. This approach connects computational representations with rich biological context, facilitating more meaningful interpretation of results.

Table 3: Key research reagents and computational resources for scFM evaluation

Resource	Type	Function	Access
CELLxGENE Census	Data Resource	Curated single-cell data for pretraining and benchmarking	Public portal
CELLxGENE Explorer	Software Tool	Interactive visualization of single-cell data	Open source
CELLxGENE CellGuide	Reference Data	Standardized cell type definitions and markers	Public resource
scGPT	Foundation Model	Transformer-based scFM for multiple downstream tasks	GitHub repository
Geneformer	Foundation Model	Context-aware scFM for transcriptome analysis	GitHub repository
scVI	Baseline Method	Probabilistic modeling for scRNA-seq analysis	Python package
Harmony	Baseline Method	Integration method for scRNA-seq data	R/Python package
Seurat	Analysis Toolkit	Comprehensive scRNA-seq analysis suite	R package
CellWhisperer	Multimodal Tool	Natural language querying of transcriptomic data	Web interface

The disconnect between technical metrics and biological insight represents a critical challenge in single-cell foundation model development. While scFMs show promise for zero-shot learning in biological discovery, their current limitations in providing mechanistically meaningful insights necessitate careful evaluation strategies. Through the implementation of biology-aware assessment protocols, ontology-informed metrics, and interpretable model architectures, researchers can better navigate the gap between technical performance and biological relevance. As the field progresses, prioritizing biological insight over purely technical benchmarks will be essential for realizing the full potential of foundation models in accelerating therapeutic discovery and advancing our understanding of cellular biology.

Conclusion

Zero-shot learning with single-cell foundation models represents a paradigm shift with immense potential for biological discovery, yet its current state is one of cautious optimism. The synthesis of evidence reveals that while scFMs are versatile and can capture meaningful biological relationships, their zero-shot performance often lags behind simpler, established methods for tasks like cell type clustering and batch integration. Critical challenges remain in data quality, model architecture, and the fundamental pretraining objective. However, emerging strategies—such as biology-driven benchmarking, efficient fine-tuning, and models like scShift that theoretically disentangle variation—point toward a promising future. For researchers and clinicians, this underscores the need for rigorous, zero-shot-specific validation before deploying these tools in discovery pipelines. The trajectory of the field points toward more robust, interpretable, and biologically-grounded models that will eventually fulfill the promise of accelerating drug discovery and unlocking deeper insights into cellular function and disease mechanisms.