Zero-Shot Learning in Single-Cell Foundation Models: A Realistic Assessment of Capabilities and Limitations for Biomedical Research

Logan Murphy Nov 27, 2025 309

This article provides a comprehensive analysis of the zero-shot capabilities of single-cell foundation models (scFMs), which are large-scale AI models pre-trained on millions of single-cell transcriptomes.

Zero-Shot Learning in Single-Cell Foundation Models: A Realistic Assessment of Capabilities and Limitations for Biomedical Research

Abstract

This article provides a comprehensive analysis of the zero-shot capabilities of single-cell foundation models (scFMs), which are large-scale AI models pre-trained on millions of single-cell transcriptomes. Aimed at researchers, scientists, and drug development professionals, it explores the foundational concepts of scFMs, their practical applications in tasks like cell type annotation and batch integration, and rigorous benchmarking that reveals their current performance gaps compared to simpler methods. Synthesizing the latest 2025 research, the article also covers strategies for optimizing model utility, introduces novel biology-driven evaluation metrics, and discusses the future trajectory of these tools in advancing drug discovery and clinical applications.

Understanding Single-Cell Foundation Models and the Critical Role of Zero-Shot Evaluation

What Are Single-Cell Foundation Models? Defining the AI Paradigm for Cell Biology

Single-cell foundation models (scFMs) represent a transformative paradigm in computational biology, leveraging large-scale deep learning architectures pre-trained on massive single-cell datasets to enable a wide range of downstream analytical tasks. These models are built on the premise that by exposing an artificial intelligence system to millions of single-cell profiles encompassing diverse tissues, species, and biological conditions, the model can learn fundamental principles of cellular biology that generalize to new datasets and applications [1] [2]. Inspired by the revolutionary success of foundation models in natural language processing and computer vision, researchers have adapted these approaches to decipher the "language of cells," where individual cells are treated analogously to sentences, and genes or genomic features serve as words or tokens [1].

The significance of scFMs lies in their potential to overcome critical challenges in single-cell biology, including the high dimensionality, sparsity, and technical noise inherent in single-cell sequencing data [2]. By capturing universal patterns across vast collections of single-cell measurements, these models aim to provide a unified framework for analyzing cellular heterogeneity, regulatory networks, and biological systems at unprecedented scale and resolution. The emergence of public data archives containing tens of millions of single-cell omics datasets has created the fertile ground needed for training these sophisticated models, enabling researchers to move from targeted analyses of individual experiments to generalized computational approaches that leverage aggregated biological knowledge [1].

Architectural Framework and Core Components

Model Architecture and Tokenization Strategies

Most single-cell foundation models are built on transformer architectures, which have demonstrated remarkable success in capturing complex relationships in sequential data. The adaptation of transformers to single-cell data requires innovative solutions to address the non-sequential nature of biological measurements. Unlike words in a sentence, genes in a cell have no inherent ordering, necessitating specialized tokenization approaches that convert gene expression profiles into structured input sequences [1].

Common tokenization strategies include ranking genes within each cell by expression levels, creating a deterministic sequence based on expression magnitude. Alternative approaches partition genes into expression value bins or use normalized counts directly [1]. The tokenization process typically generates three core components: gene embeddings (representing gene identity), value embeddings (capturing expression levels), and positional embeddings (providing sequence context). Some models incorporate special tokens for cell identity, experimental metadata, or modality indicators when handling multi-omics data [2]. These embeddings are processed through multiple transformer layers with self-attention mechanisms that learn to weight relationships between gene tokens, effectively capturing co-expression patterns and regulatory relationships [1].

Table 1: Architectural Variations in Single-Cell Foundation Models

Model Type	Architecture	Tokenization Approach	Primary Application
Encoder-based (BERT-like)	Bidirectional attention	Gene ranking or binning	Cell classification, embedding generation
Decoder-based (GPT-like)	Unidirectional masked attention	Expression-based sequencing	Gene expression prediction, generation
Hybrid Designs	Encoder-decoder combinations	Multi-modal integration	Cross-modal translation, complex inference

Pretraining Objectives and Knowledge Capture

ScFMs are typically pretrained using self-supervised learning objectives that don't require manually labeled data. The most common approach is masked language modeling, where the model is trained to predict the expression of randomly masked genes given the context of other genes in the cell [3]. This training paradigm encourages the model to learn biological relationships between genes, such as co-regulation within pathways or functional modules. The underlying hypothesis is that successfully predicting masked gene expressions requires understanding the complex dependencies and interactions within cellular systems [1] [2].

During pretraining, models develop rich internal representations at both the gene and cell levels. Gene embeddings capture functional similarities, while cell embeddings encode cellular states and types [2]. The attention mechanisms in transformer layers potentially learn to identify key regulatory relationships and biological pathways. However, recent evaluations have raised questions about the depth of biological knowledge actually captured during pretraining, as models sometimes fail to outperform simpler methods on fundamental tasks [4] [3].

Critical Evaluation of Zero-Shot Capabilities

Performance Benchmarks in Zero-Shot Settings

Zero-shot evaluation, where models are applied to downstream tasks without any task-specific training, represents the most rigorous test of a foundation model's generalization capabilities and biological understanding. This assessment approach is particularly critical for discovery settings where labels are unknown or task-specific training is impractical [4]. Recent comprehensive evaluations of popular scFMs like Geneformer and scGPT have revealed significant limitations in their zero-shot performance across fundamental analytical tasks.

In cell type clustering, both Geneformer and scGPT underperform established methods such as scVI and Harmony, as well as simple approaches like selecting highly variable genes (HVG). Quantitative assessments using metrics like average BIO score demonstrate that these foundation models struggle to separate known cell types across multiple datasets, with performance inconsistencies that aren't fully explained by overlap between evaluation and pretraining datasets [4]. Similarly, in batch integration tasks, which aim to remove technical artifacts while preserving biological variation, scFMs show limited effectiveness. Geneformer's embeddings often fail to retain cell type information, with clustering primarily driven by batch effects rather than biological signals [4].

Table 2: Zero-Shot Performance Comparison Across Single-Cell Analytical Tasks

Method	Cell Type Clustering (AvgBIO Score)	Batch Integration (iLISI Score)	Gene Expression Prediction (Pearson Correlation)
scGPT	0.45-0.62	0.51-0.65	0.08-0.22 (without cell embedding)
Geneformer	0.38-0.55	0.42-0.58	Not comprehensively evaluated
scVI	0.58-0.71	0.63-0.75	N/A
Harmony	0.54-0.69	0.59-0.72	N/A
HVG Selection	0.61-0.73	0.67-0.78	N/A

Biological Relevance and Representation Learning

Beyond quantitative metrics, researchers have developed novel approaches to assess the biological relevance of representations learned by scFMs. The scGraph-OntoRWR metric measures the consistency between cell type relationships captured by model embeddings and established biological knowledge from cell ontologies [2]. Similarly, gene embeddings can be evaluated by their ability to predict functional relationships, tissue specificity, and Gene Ontology terms [2].

These analyses reveal that while scFMs capture some biological structure, their representations don't consistently outperform simpler alternatives or directly align with known biological hierarchies. The discrepancy between the promising conceptual framework of scFMs and their practical performance limitations suggests several potential issues: the masked language modeling objective may not optimally transfer to downstream tasks, models may require different architectural approaches to effectively capture biological complexity, or current training datasets may lack the diversity or quality needed for robust generalization [4] [3].

Experimental Protocols for scFM Evaluation

Protocol 1: Zero-Shot Cell Type Annotation

Purpose: To evaluate the capability of scFMs to generate cell embeddings that separate cell types without task-specific training, simulating discovery settings where cell type labels are unknown.

Materials:

Single-cell RNA sequencing dataset with ground truth cell type labels
Pretrained scFM (e.g., scGPT, Geneformer, UCE, scFoundation)
Comparison methods (scVI, Harmony, HVG selection)
Computing environment with adequate GPU resources

Procedure:

Data Preprocessing:
- Load target dataset and apply standard quality control filters
- Normalize gene expression values if required by the specific foundation model
- Note: Do not perform model fine-tuning or task-specific training

Embedding Generation:
- Extract cell embeddings from the foundation model using its zero-shot capabilities
- For scGPT: use the cell embedding from the special [CLS] token
- For Geneformer: extract the cell representation from the final layer
Dimensionality Reduction and Clustering:
- Apply UMAP or t-SNE to embeddings for visualization
- Perform Leiden clustering on the embedding space
- Compare cluster identities with ground truth cell type labels
Quantitative Assessment:
- Calculate ARI (Adjusted Rand Index) between clusters and true labels
- Compute NMI (Normalized Mutual Information) to evaluate cluster purity
- Determine ASW (Average Silhouette Width) for cell type separation quality
- Apply ontology-informed metrics (LCAD) to assess biological relevance of errors

Interpretation: High ARI and NMI scores indicate strong zero-shot clustering performance. Comparison with baseline methods reveals whether the foundation model provides advantages over established approaches. The LCAD metric helps determine if misclassifications are biologically reasonable (closely related cell types) or severe (distantly related types) [2].

Protocol 2: Batch Integration Assessment

Purpose: To evaluate the ability of scFMs to remove technical batch effects while preserving biological variation in zero-shot settings.

Materials:

Single-cell dataset with known batch effects and biological groups
Pretrained scFM and baseline integration methods
Metrics for batch mixing and biological conservation

Procedure:

Experimental Setup:
- Select dataset with significant technical variation (different platforms, protocols, or laboratories)
- Ensure the dataset contains known biological groups for conservation assessment

Embedding Generation and Integration:
- Generate zero-shot cell embeddings using the foundation model
- Compare against dedicated batch correction methods (Harmony, scVI)
- Include simple baselines (HVG selection) for reference
Dual-Metric Evaluation:
- Calculate batch mixing scores (iLISI, PCR) to quantify technical effect removal
- Compute biological conservation scores (cLISI, ASW) to assess preservation of real variation
- Visualize integration results using UMAP, coloring by batch and cell type
Comparative Analysis:
- Rank methods by their ability to simultaneously minimize batch effects and preserve biology
- Assess dataset-specific performance patterns across tissue types and technologies

Interpretation: Effective batch integration should achieve high batch mixing scores while maintaining high biological conservation. The critical assessment is whether foundation models provide advantages over specialized integration methods, particularly for complex batch effects involving both technical and biological sources of variation [4] [2].

Visualization of Model Architectures and Evaluation Workflows

Single-Cell Foundation Model Architecture

Zero-Shot Evaluation Workflow

Essential Research Reagent Solutions

Table 3: Key Computational Tools for Single-Cell Foundation Model Research

Tool Category	Representative Solutions	Primary Function	Application Context
Foundation Models	scGPT, Geneformer, UCE, scFoundation, LangCell	Large-scale pretrained models for single-cell data	Zero-shot inference, transfer learning, biological discovery
Baseline Methods	scVI, Harmony, Seurat, SC3	Established single-cell analysis pipelines	Performance benchmarking, method comparison
Evaluation Metrics	ARI, NMI, ASW, LISI, scGraph-OntoRWR	Quantitative performance assessment	Model validation, biological relevance quantification
Data Resources	CZ CELLxGENE, Human Cell Atlas, PanglaoDB	Curated single-cell datasets	Model pretraining, benchmarking, transfer evaluation
Visualization Tools	SCope, UCSC Cell Browser	Interactive data exploration	Result interpretation, quality assessment, publication graphics

Single-cell foundation models represent an ambitious paradigm shift in computational biology, aiming to create universal models that capture fundamental principles of cellular biology. While their conceptual framework is promising, current evaluations reveal significant limitations in zero-shot settings, where these models often underperform simpler, specialized methods [4] [3]. The discrepancy between the theoretical potential and practical performance highlights the need for continued research into model architectures, pretraining objectives, and evaluation methodologies.

Future advancements in scFMs will likely focus on several critical areas: developing more biologically meaningful pretraining objectives that better transfer to downstream tasks, incorporating multi-modal data to create more comprehensive cellular representations, improving model interpretability to extract actionable biological insights, and establishing rigorous standardized benchmarks that assess true biological understanding rather than just analytical performance [1] [2]. As these models continue to evolve, they hold the potential to transform our approach to cellular biology, enabling discoveries that bridge molecular mechanisms, cellular functions, and physiological systems through integrated AI-driven analysis.

The advent of single-cell RNA sequencing (scRNA-seq) has unveiled unprecedented resolution for exploring cellular heterogeneity. Concurrently, the transformer architecture, which has revolutionized natural language processing (NLP), is now being repurposed to interpret the "language of biology" encoded in gene expression data [1]. This convergence has given rise to single-cell foundation models (scFMs), large-scale deep learning models pretrained on vast atlases of single-cell data [1] [5]. A critical, yet underexplored, capability of these models is zero-shot learning, where the model makes predictions on novel tasks or datasets without any task-specific fine-tuning [4]. This is paramount in biological discovery settings where cell type compositions or states are unknown a priori [4] [6]. The performance of these models in a zero-shot setting hinges on two core architectural components: the tokenization process, which converts raw, non-sequential gene expression data into a structured sequence of discrete units, and the transformer model itself, which processes these tokens to learn complex, generalizable representations of cellular state [1]. This application note details the methodologies for these core components, framed within the context of zero-shot learning research, to provide researchers with the protocols needed to understand, evaluate, and apply these cutting-edge tools.

Tokenization: Converting Gene Expression to Model Input

Tokenization is the foundational step that standardizes raw, continuous, and non-sequential gene expression data into a structured format that transformer models can process. Unlike words in a sentence, genes in a cell have no inherent order, making the tokenization strategy for scRNA-seq data a critical design choice [1].

Tokenization Strategies and Protocols

The following protocols describe the primary methods for tokenizing gene expression data. The choice of method can significantly impact model performance and biological interpretability.

Protocol 2.1.1: Tokenization by Gene Expression Ranking
- Objective: To create a deterministic input sequence by ranking genes based on their expression magnitude, providing a consistent order for the transformer.
- Materials: A cell-by-gene count matrix (e.g., from Cell Ranger or Scanpy), computational environment (e.g., Python, R).
- Method Details:
  - Input: For a single cell, start with a vector of normalized gene expression counts for all G genes.
  - Ranking: Sort the genes in descending order based on their expression values.
  - Selection: Select the top K genes (where K is a predefined sequence length, e.g., 1200) to form the input sequence.
  - Token Generation: Each gene in the ranked list is treated as a token. The token incorporates an embedding of the gene's identifier (e.g., its Ensembl ID) [1] [7].
  - Positional Encoding: Apply standard transformer positional encodings based on the gene's rank in the sequence (1st, 2nd, ..., Kth).
Protocol 2.1.2: Tokenization by Expression Value Binning
- Objective: To incorporate quantitative expression levels directly into the tokenization scheme by binning expression values.
- Materials: A cell-by-gene count matrix, normalized and log1p-transformed data.
- Method Details:
  - Input: Normalized gene expression data for a single cell.
  - Binning: Discretize continuous expression values into N quantile bins (e.g., deciles, Q1 through Q10). This transforms a continuous value into a categorical token representing its expression level relative to the population [7].
  - Token Generation: Each gene is represented by a combination of its identity and its expression bin token. For instance, a token could be "CD4_Q5" representing the CD4 gene with a median expression level.
  - Model Insight: Models like ETHOS have demonstrated that this approach allows the transformer to learn the sequential relationship between quantile tokens, with embeddings for high quantiles (e.g., Q9, Q10) showing greater separation, potentially reflecting their heightened clinical significance [7].
Protocol 2.1.3: Integration of Special and Metadata Tokens
- Objective: To enrich the model's context by providing information beyond gene expression, such as cell-level metadata or experimental batch.
- Materials: Gene expression matrix accompanied by relevant metadata (e.g., donor ID, sequencing batch, tissue of origin).
- Method Details:
  - Special Tokens: Prepend or append special tokens to the gene sequence. A common example is a [CELL] token, whose final embedding is used as a summary representation for the entire cell [1] [5] [8].
  - Batch Tokenization: Incorporate a token representing the batch or study of origin to help the model learn and potentially correct for technical artifacts [5].
  - Multi-modal Tokenization: For integrated models, include modality-specific tokens (e.g., [ATAC] or [PROTEIN]) to process multi-omics data within a single framework [1].

The diagram below illustrates the logical workflow for processing raw single-cell data into a tokenized sequence ready for transformer input.

Comparative Analysis of Tokenization Approaches

Table 1: Comparison of primary tokenization strategies for single-cell gene expression data.

Tokenization Strategy	Key Principle	Advantages	Limitations	Representative Models
Gene Expression Ranking	Orders genes by expression level to create a sequence.	Provides a deterministic input order; simple to implement.	The arbitrary sequence may not reflect biological gene-gene relationships.	Geneformer [1] [4]
Expression Value Binning	Discretizes continuous expression into quantile bins.	Encodes quantitative expression levels directly into tokens.	May lose fine-grained, continuous information.	ETHOS [7]
Identity-Only	Uses gene identities with normalized counts, minimal structuring.	Simple; reports suggest complex ranking may offer no clear advantage [8].	May require more data or model capacity to learn expression patterns.	scGPT (option) [1] [8]

Transformer Architecture and Zero-Shot Workflow

The transformer architecture processes the tokenized sequences to build a contextualized understanding of cellular state. The model's pretraining objective is designed to instill this general knowledge, which is then directly accessed in a zero-shot manner.

Model Architecture and Pretraining Protocol

Protocol 3.1.1: Model Pretraining with Masked Language Modeling
- Objective: To train a transformer model on a large, unlabeled corpus of single-cell data so it learns fundamental biological principles and gene-gene relationships.
- Materials: A large-scale collection of single-cell datasets (e.g., from CZ CELLxGENE, Human Cell Atlas), high-performance computing resources with multiple GPUs.
- Method Details:
  - Architecture Selection:
    - Encoder-based (BERT-like): Uses bidirectional attention, meaning all tokens in a sequence attend to all other tokens simultaneously. This is effective for tasks that require a comprehensive understanding of the entire cell state, such as cell type classification [1] [4]. Models: scBERT.
    - Decoder-based (GPT-like): Uses causal (unidirectional) attention, where a token can only attend to previous tokens in the sequence. This is often used for generative tasks, such as predicting masked genes or simulating future cell states [1] [5]. Models: scGPT.
  - Pretraining Task - Masked Language Modeling (MLM): Randomly mask a portion (e.g., 15-20%) of the gene tokens in the input sequence. The model is then trained to predict the identity (and sometimes the expression value) of the masked genes based on the context provided by the unmasked genes [1] [5]. This forces the model to learn the complex, co-varying relationships between genes.
  - Output - Cell Embedding: The activation state of the special [CELL] token (or the average of all output token embeddings) at the final layer is used as a fixed-dimensional vector representation (embedding) that summarizes the entire cell's state [1] [4].

Zero-Shot Inference and Evaluation Protocol

Protocol 3.2.1: Performing Zero-Shot Cell Type Clustering
- Objective: To use the pretrained model's cell embeddings to cluster cells into types without any further training or fine-tuning on the target dataset.
- Materials: A pretrained scFM (e.g., scGPT, Geneformer), a new target scRNA-seq dataset (processed and tokenized), clustering algorithms (e.g., Leiden, K-means).
- Method Details:
  - Inference: Pass the tokenized target dataset through the pretrained model.
  - Embedding Extraction: For each cell, extract the cell embedding vector from the model's output.
  - Dimensionality Reduction: Apply techniques like UMAP or t-SNE to the matrix of cell embeddings for visualization.
  - Clustering: Apply a clustering algorithm to the cell embeddings to identify groups of transcriptionally similar cells.
  - Evaluation: Compare the model's clusters to known cell type labels using metrics like Average Silhouette Width (ASW) or Adjusted Rand Index (ARI). Compare performance against established baselines like highly variable genes (HVG) coupled with scVI or Harmony [4].

The following diagram outlines the complete workflow from pretraining to zero-shot evaluation.

Performance of Zero-Shot Models

Recent rigorous evaluations of scFMs in zero-shot settings have revealed critical insights into their current capabilities and limitations.

Table 2: Zero-shot performance of single-cell foundation models on key tasks compared to baseline methods. Performance is summarized from Kedzierska et al. [4].

Model / Baseline	Cell Type Clustering (AvgBIO Score)	Batch Integration (iLISI Score)	Key Findings and Limitations
HVG + PCA	Best	Best	A simple baseline of Highly Variable Genes with PCA surprisingly outperformed foundation models on multiple datasets and metrics [4].
scVI	Better	Better	A specialized deep learning model for scRNA-seq consistently showed strong performance in both clustering and batch integration [4].
Harmony	Better	Better	A robust batch integration method performed well, particularly on technical batch effects [4].
scGPT	Variable	Intermediate	Shows inconsistent performance; pretraining helps but does not consistently surpass simpler methods. Struggles with complex biological batch effects [4].
Geneformer	Worse	Worse	Underperforms relative to all other methods and baselines in zero-shot evaluation; embeddings often dominated by batch effects [4].

The Scientist's Toolkit

Research Reagent Solutions

Table 3: Essential computational tools and resources for working with single-cell foundation models.

Item	Function / Description	Example / Source
Pretraining Data	Large, aggregated single-cell datasets used to train foundation models. Provides the "corpus" of cellular states.	CZ CELLxGENE [1] [4], Human Cell Atlas [1], PanglaoDB [1]
Model Architectures	The specific implementation of the transformer model (encoder or decoder).	scGPT (decoder) [1] [5], scBERT (encoder) [1] [4], Geneformer (encoder) [4]
Evaluation Benchmarks	Standardized datasets and metrics for fairly comparing model performance, especially zero-shot.	Pancreas dataset [4], Tabula Sapiens [4], Immune cell datasets [4]
Baseline Methods	Established, often simpler, computational methods that serve as a critical point of comparison.	Highly Variable Genes (HVG) [4], scVI [4], Harmony [4]
Visualization Tools	Software libraries for visualizing high-dimensional cell embeddings and model attention.	UMAP, t-SNE, Scanpy [9]

The core architecture of transformers, fed by thoughtfully tokenized gene expression data, provides a powerful framework for building foundation models in single-cell biology. The protocols outlined here for tokenization, model pretraining, and zero-shot evaluation provide a roadmap for researchers to implement and critically assess these technologies. However, current evidence indicates that the promise of robust, out-of-the-box zero-shot inference has not yet been fully realized, with simpler methods often outperforming large, complex foundation models on tasks like cell type clustering and batch integration [4] [5]. This underscores the importance of rigorous zero-shot evaluation as a mandatory step in the development and application of scFMs. Future progress will likely depend on more biologically informed tokenization strategies [10], novel pretraining objectives that better capture hierarchical cellular relationships, and a continued focus on model interpretability and reliability for zero-shot tasks in exploratory research and drug development.

The advent of single-cell RNA sequencing (scRNA-seq) has revolutionized biology by enabling the profiling of gene expression at the resolution of individual cells, uncovering cellular heterogeneity with unprecedented precision [11] [12]. However, the analysis of scRNA-seq data is fraught with challenges stemming from its high dimensionality, technical noise, and sparsity [12] [13]. Foundation models pretrained on millions of single-cell transcriptomes have emerged as a powerful strategy to overcome these hurdles. These models aim to learn universal patterns of gene expression and cell states from large-scale data, creating a foundational knowledge that can be rapidly specialized for diverse downstream tasks with minimal additional training [4].

The significance of these models is particularly pronounced in the context of zero-shot learning, where the model's internal representation of input data is used for analysis without any task-specific fine-tuning [4]. This capability is critical for exploratory biological discovery where predefined labels are unknown, making fine-tuning infeasible. This application note details the pretraining process, data requirements, model architectures, and evaluation protocols for building and validating single-cell foundation models, with a specific focus on their zero-shot capabilities.

Data Acquisition and Curation

The efficacy of a foundation model is fundamentally dependent on the scale and quality of its pretraining data. Assembling a massive, diverse, and well-curated corpus of single-cell data is the first and most critical step.

Large-scale single-cell datasets are aggregated from various public repositories, including:

National Center for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO)
European Nucleotide Archive (ENA)
Genome Sequence Archive (GSA)
CellxGENE database [14] [15]

These datasets are stored in multiple formats (e.g., FASTQ, h5ad, Seurat objects), requiring standardized processing pipelines for consolidation [15].

Data Processing and Standardization

A uniform workflow is essential to convert raw data into a clean, analysis-ready gene expression matrix. Key steps include:

Quality Control: Filtering out low-quality cells and genes based on metrics like mitochondrial read percentage and gene counts.
Gene Name Standardization: Standardizing gene identifiers according to the HUGO Gene Nomenclature Committee (HGNC) guidelines.
Format Conversion: Converting all data into a unified sparse matrix format for efficient storage and computation [15].

Table 1: Exemplary Large-Scale Pretraining Datasets for Single-Cell Foundation Models

Model	Pretraining Dataset Scale	Data Composition	Primary Source
CellFM [15]	~100 million human cells	46.3M normal cells, 7.1M viral infection cells, 3.5M lung cancer cells; diverse cell types (T cells, neurons, etc.)	Public repositories (GEO, ENA, GSA)
scPRINT [14]	>50 million cells	Multiple species, diseases, and ethnicities	CellxGENE database
scGPT [4]	>33 million non-cancerous human cells	Includes blood, bone marrow, and kidney cells	CELLxGENE initiative
Geneformer [4]	30 million single-cell transcriptomes	Diverse human tissues	Not specified

Model Architectures and Pretraining Strategies

Single-cell foundation models adapt architectures from natural language processing, treating genes as words and a cell's expression profile as a sentence. The choice of architecture and how gene expression is "tokenized" are pivotal design decisions.

Tokenization Strategies for Gene Expression

A key challenge is converting continuous gene expression values into discrete tokens or embeddings suitable for model input. The field has converged on three primary strategies:

Table 2: Comparison of Gene Expression Tokenization Strategies

Tokenization Strategy	Mechanism	Representative Models	Advantages	Limitations
Rank-based [12]	Genes are ranked by expression level within each cell; the sequence of gene names forms the model input.	Geneformer, GeneMamba, tGPT	Robust to batch effects; captures relative expression.	Discards absolute expression magnitude.
Value Categorization [15]	Gene expression values are binned into discrete "buckets," transforming the task into classification.	scBERT, scGPT	Preserves some absolute expression information.	May lose fine-grained resolution; sensitive to binning parameters.
Value Projection [12] [15]	Continuous expression values are projected into an embedding space via a linear transformation or MLP.	scFoundation, CellFM, scPRINT	Preserves full data resolution; no information loss from binning.	Diverges from traditional NLP tokenization.

Model Architectures

Transformer-based Models: Early models like scGPT and Geneformer leveraged the Transformer architecture for its powerful self-attention mechanism, which can model complex dependencies between genes [4] [15]. A significant limitation is the quadratic computational complexity of self-attention, which constrains scalability for long gene sequences [12].
State Space Models (SSMs): Newer models like GeneMamba adopt SSMs to address Transformer limitations. The BiMamba (Bidirectional Mamba) module efficiently captures gene context information with linear computational complexity, enabling scalable processing of over 50 million cells at a lower cost [12].
Hybrid and Variant Architectures: CellFM uses a modified RetNet framework (ERetNet Layers) to balance efficiency and performance, integrating a Gated Multi-head Attention unit and a LoRA (low-rank adaptation) module for efficient fine-tuning [15]. scPRINT uses a bidirectional transformer and incorporates protein embeddings from models like ESM2 as gene representations, leveraging evolutionary and structural priors [14].

Pretraining Objectives

Models are trained using self-supervised objectives that do not require manually labeled data. Common tasks include:

Masked Language Modeling (MLM): A random subset of genes in a cell's profile is masked, and the model is trained to recover their original expression values or ranks [4] [15].
Denoising and Upsampling: scPRINT employs a denoising task where the model learns to upsample transcript counts, helping to discriminate true zeros from technical dropouts [14].
Multi-Task Learning: scPRINT combines a denoising task, a bottleneck learning task (reconstructing expression from a compressed embedding), and a label prediction task (predicting cell type, disease, etc.) to create disentangled embeddings that represent different facets of the cell state [14].

The following diagram illustrates a generalized pretraining workflow that incorporates these common elements.

Evaluating Zero-Shot Performance

Rigorous evaluation in a zero-shot setting is crucial to determine if pretraining has endowed the model with a general, transferable understanding of biology, especially for discovery-driven research where labels are unavailable [4].

Key Evaluation Tasks and Metrics

Cell Type Clustering: The model generates cell embeddings without fine-tuning, and clustering algorithms are applied. Performance is measured using metrics like Average BIO (AvgBIO) score and Average Silhouette Width (ASW), which assess the separation and cohesion of known cell types [4].
Batch Integration: The model's ability to correct for technical batch effects while preserving biological variation is tested. Metrics evaluate both batch mixing (e.g., batch integration scores) and biological conservation (e.g., principal component regression score) [4].
Gene Network Inference: For models like scPRINT, zero-shot ability to infer biologically plausible gene-gene interactions is a key benchmark, often validated against literature-curated networks or orthogonal data [14].

Experimental Protocol: Zero-Shot Cell Embedding and Clustering

Purpose: To evaluate the quality of cell representations learned during pretraining by assessing their ability to separate known cell types without any further model training [4].

Procedure:

Input Data: Obtain a hold-out test scRNA-seq dataset not seen during pretraining. Preprocess it according to the model's requirements (e.g., normalize, select the same highly variable genes).
Generate Embeddings: Pass the preprocessed expression matrix through the pretrained foundation model to extract a low-dimensional embedding vector for each cell.
Dimensionality Reduction (Optional): Apply UMAP or t-SNE to the embeddings for visualization in 2D.
Clustering: Apply a clustering algorithm (e.g., Louvain, K-means) directly to the cell embeddings.
Evaluation:
- Calculate the AvgBIO score and ASW using the known ground-truth cell type labels.
- Compare the results against baseline methods like Highly Variable Genes (HVG), scVI, and Harmony [4].

Interpretation: Strong performance indicates that the pretrained model's embeddings capture biologically meaningful structure relevant to cell identity. Underperformance may suggest limitations in the pretraining task or data [4].

Critical Findings from Zero-Shot Evaluations

Recent studies reveal that the zero-shot performance of foundation models can be inconsistent:

Models like Geneformer and scGPT can underperform simpler baseline methods (e.g., HVG selection, scVI, Harmony) on tasks like cell type clustering and batch integration [4] [5].
Pretraining generally confers an advantage over randomly initialized models, but performance does not always monotonically improve with larger and more diverse datasets [4].
Surprisingly, models do not always perform best on datasets that were included in their pretraining corpus, indicating an unclear relationship between the pretraining objective and specific downstream tasks [4].

The following workflow outlines the process for conducting a zero-shot evaluation, highlighting the comparison to established baselines.

The Scientist's Toolkit: Research Reagent Solutions

This table details key computational tools and resources essential for working with single-cell foundation models.

Table 3: Essential Research Reagents and Tools for Single-Cell Foundation Model Research

Item Name	Function / Application	Specifications / Notes
cellxgene Database [4] [14]	A curated source of massive-scale, annotated single-cell data for model pretraining.	Provides standardized data from diverse tissues and species; critical for assembling large corpora.
scGPT [4] [15]	A transformer-based foundation model for single-cell analysis.	Uses value categorization tokenization; offers capabilities for cell type annotation and batch correction.
GeneMamba [12]	A state space model (SSM) for efficient large-scale single-cell data processing.	Uses BiMamba module for linear-complexity processing; employs rank-based discretization.
scPRINT [14]	A transformer model designed for gene network inference with multi-task pretraining.	Incorporates protein embeddings (ESM2) as gene priors; features denoising and label prediction tasks.
CellFM [15]	A large-scale foundation model trained on 100 million human cells.	Uses value projection and ERetNet architecture; focuses on gene function and perturbation prediction.
Harmony & scVI [4]	Specialized, non-foundation model tools for batch integration and dimensionality reduction.	Commonly used as strong baselines for evaluating the zero-shot batch integration performance of foundation models.
Scanpy [11]	A scalable Python toolkit for analyzing single-cell gene expression data.	Provides standard pipelines for data preprocessing, visualization, clustering, and trajectory inference.

Why Zero-Shot Evaluation is Crucial for Unbiased Biological Discovery

Single-cell foundation models (scFMs) represent a revolutionary advance in computational biology, trained on millions of single-cell gene expression profiles to learn fundamental biological principles. These models are typically built on transformer architectures and pretrained using self-supervised objectives, such as masked gene expression prediction, where the model learns to predict withheld genes based on contextual information from other genes [1]. The promise of scFMs lies in their potential to capture universal patterns of cellular function and organization that can generalize to diverse downstream applications without task-specific training.

Zero-shot evaluation refers to assessing model performance on new, unseen data without any further training or fine-tuning of the model parameters. This evaluation paradigm is particularly critical for biological discovery research, where researchers frequently encounter unexplored cellular states, novel disease contexts, or uncharacterized experimental conditions [4] [3]. In these scenarios, labeled data for fine-tuning is nonexistent, and models must rely entirely on knowledge acquired during pretraining. The ability to perform effectively in zero-shot settings indicates that a model has learned transferable biological concepts rather than merely memorizing patterns from its training data.

Quantitative Evidence: Current ScFMs Struggle with Zero-Shot Tasks

Recent rigorous evaluations of popular scFMs like Geneformer and scGPT have revealed significant limitations in their zero-shot capabilities across multiple biological tasks. The performance gaps between these complex foundation models and simpler baseline methods are substantial and consistent across diverse datasets.

Performance Deficits in Cell Type Clustering

Cell type clustering represents a fundamental task in single-cell analysis where models must group cells with similar biological functions while ignoring technical variations. When evaluated on this task in zero-shot settings, foundation models consistently underperform established methods:

Table 1: Zero-shot Performance in Cell Type Clustering (AvgBIO Score)

Method	Pancreas Dataset	PBMC (12k) Dataset	Tabula Sapiens	Immune Dataset
scGPT	0.41	0.52	0.38	0.45
Geneformer	0.32	0.36	0.29	0.34
scVI	0.58	0.49	0.55	0.62
Harmony	0.54	0.47	0.51	0.58
HVG	0.61	0.55	0.59	0.64

As illustrated in Table 1, both scGPT and Geneformer are outperformed by simpler methods across most datasets, with the simple Highly Variable Genes (HVG) selection approach consistently achieving superior performance [4]. This performance gap is particularly striking given that HVG represents a basic feature selection strategy rather than a sophisticated machine learning model.

Challenges in Batch Integration

Batch integration aims to remove technical artifacts from different experiments while preserving biological signal. This task is especially challenging for zero-shot evaluation because models must generalize across diverse experimental conditions:

Table 2: Batch Integration Performance (Batch Mixing Score)

Method	Pancreas	PBMC	Tabula Sapiens	Immune
scGPT	0.48	0.52	0.61	0.59
Geneformer	0.31	0.35	0.28	0.33
scVI	0.65	0.61	0.58	0.52
Harmony	0.62	0.58	0.45	0.63
HVG	0.71	0.66	0.68	0.69

Geneformer consistently ranks at the bottom across all batch integration metrics, while scGPT shows variable performance—excelling on datasets it encountered during pretraining but struggling with novel datasets [4]. Qualitative assessment reveals that Geneformer's embedding space often fails to retain meaningful cell type information, with clustering primarily driven by batch effects rather than biological signals [4].

Diagram 1: The critical role of zero-shot evaluation in revealing true model capabilities beyond fine-tuning scenarios. Zero-shot testing exposes limitations that may be masked during fine-tuning evaluations.

Experimental Protocols for Zero-Shot Evaluation

Implementing rigorous zero-shot evaluation requires standardized protocols that assess model performance across biologically meaningful tasks without any parameter updates or task-specific adaptations.

Protocol 1: Cell Type Clustering Evaluation

Purpose: To evaluate a model's ability to generate embeddings that separate known cell types without explicit training on cell type labels.

Materials:

Test Dataset: A fully annotated single-cell RNA-seq dataset with validated cell type labels
Baseline Methods: Standard approaches including HVG selection, scVI, and Harmony
Evaluation Metrics: AvgBIO score, Average Silhouette Width (ASW)

Procedure:

Data Preprocessing: Normalize the test dataset using standard scRNA-seq pipelines without applying any batch correction
Embedding Generation:
- Process the dataset through the foundation model in zero-shot mode to extract cell embeddings
- Generate comparison embeddings using baseline methods (HVG, scVI, Harmony)
Clustering:
- Apply standardized clustering algorithms (e.g., Leiden, K-means) to all embedding types
- Use consistent parameters and random seeds across all methods
Evaluation:
- Calculate clustering metrics against ground truth cell type labels
- Compare performance across methods using statistical testing

Interpretation: Superior performance in this protocol indicates that a model's embeddings capture biologically relevant information about cell identity and function [4] [16].

Protocol 2: Batch Integration Assessment

Purpose: To assess a model's capability to remove technical batch effects while preserving biological variation.

Materials:

Batch-Controlled Dataset: A dataset containing the same cell types profiled across multiple batches or technologies
Evaluation Framework: Metrics including batch mixing scores and biological conservation metrics

Procedure:

Dataset Selection: Identify or create a benchmark dataset with known batch effects and biological signals
Embedding Generation: Extract zero-shot embeddings from the foundation model and baseline methods
Dimensionality Reduction: Apply UMAP or t-SNE for visualization (qualitative) and retain full embeddings for quantitative analysis
Quantitative Assessment:
- Calculate batch mixing metrics (e.g., LISI scores) to assess technical effect removal
- Compute biological conservation metrics (e.g., cell type ASW) to ensure biological signal preservation
Visual Inspection: Examine 2D projections to identify whether batch effects dominate the embedding space

Interpretation: Effective batch integration demonstrates that a model can generalize across technical variations, a crucial capability for real-world biological discovery [4].

Implementing robust zero-shot evaluation requires specific computational tools and resources. The following table outlines key components of the evaluation toolkit:

Table 3: Essential Research Reagents and Computational Tools

Resource	Type	Function in Evaluation	Examples/Alternatives
Benchmark Datasets	Data	Provide standardized testing grounds for model comparison	Tabula Sapiens, Pancreas datasets, PBMC datasets [4]
Evaluation Metrics	Algorithm	Quantify model performance across multiple dimensions	AvgBIO, ASW, batch mixing scores, PCR [4]
Baseline Methods	Software	Establish performance baselines for meaningful comparison	HVG selection, scVI, Harmony [4]
Unified Frameworks	Platform	Standardize model access and evaluation procedures	BioLLM framework [17]
Visualization Tools	Software	Enable qualitative assessment of embedding quality	UMAP, t-SNE plotting utilities

The BioLLM framework deserves particular attention as it provides standardized APIs for accessing diverse scFMs, eliminating architectural and coding inconsistencies that complicate rigorous comparison [17]. This framework supports both zero-shot and fine-tuning evaluation, enabling comprehensive assessment of model capabilities.

Diagram 2: Comprehensive zero-shot evaluation workflow integrating multiple data types, evaluation methods, and performance metrics to assess foundation model capabilities.

Emerging Solutions and Future Directions

While current scFMs show limitations in zero-shot settings, research is advancing toward more robust solutions. Several promising approaches are emerging:

Improved Pretraining Strategies

Recent evidence suggests that pretraining dataset composition significantly impacts zero-shot performance. Studies evaluating scGPT variants pretrained on different tissue-specific datasets (kidney, blood, and general human cells) found that performance improvements plateau despite increased dataset diversity [4]. This indicates that simply scaling up data may be insufficient, and more sophisticated pretraining objectives are needed.

Efficient Adaptation Methods

Novel fine-tuning approaches that preserve pretrained knowledge show promise for enhancing zero-shot generalization. Techniques like drug-conditional adapters that train less than 1% of original foundation model parameters enable better molecular conditioning while maintaining rich biological representations [18]. This approach has demonstrated improved zero-shot generalization to unseen cell lines while preserving core model capabilities.

Biological Knowledge Integration

Incorporating biological prior knowledge through novel evaluation metrics represents another advancement. The scGraph-OntoRWR metric measures consistency between cell type relationships captured by scFMs and established biological knowledge from cell ontologies [16]. Similarly, the Lowest Common Ancestor Distance (LCAD) metric assesses the ontological proximity between misclassified cell types, providing more biologically meaningful error assessment [16].

Zero-shot evaluation provides an essential reality check for single-cell foundation models, revealing limitations that fine-tuning-based assessments often mask. Current evidence demonstrates that even popular scFMs like Geneformer and scGPT struggle to outperform simpler methods on fundamental tasks like cell type clustering and batch integration when deployed without additional training. These findings underscore the importance of rigorous zero-shot testing as a standard practice in model development and validation.

As the field progresses, improved pretraining strategies, efficient adaptation methods, and biologically-informed evaluation metrics will likely enhance the zero-shot capabilities of future foundation models. By maintaining focus on rigorous evaluation and acknowledging current limitations, the research community can develop more robust and biologically meaningful models that truly advance discovery in single-cell biology.

Single-cell Foundation Models (scFMs) are large-scale deep learning models pretrained on vast single-cell omics datasets, designed to capture universal biological patterns that can be adapted to various downstream tasks. This overview examines the architecture, performance, and application of leading scFMs, with particular focus on their capabilities in zero-shot learning environments where models are applied to new data without further training. The evaluation reveals a critical insight: while these models show significant promise, their zero-shot performance often lags behind simpler, established methods, highlighting a substantial gap between pretraining objectives and practical biological discovery applications.

Single-cell foundation models represent a transformative approach in computational biology, leveraging self-supervised learning on massive single-cell datasets to develop a fundamental understanding of cellular biology. These models are built on the premise that by exposing an algorithm to millions of cells across diverse tissues, conditions, and species, it can learn the intrinsic "language" of cells and genes, capturing complex relationships that enable generalization to novel biological questions [1]. The emergence of scFMs parallels developments in natural language processing, where foundation models have revolutionized how machines understand and generate human language. In the biological context, individual cells are treated analogously to sentences, while genes or genomic features serve as words or tokens that collectively define cellular identity and function [1].

The significance of scFMs is particularly pronounced in zero-shot learning scenarios, which are essential for true biological discovery. In zero-shot settings, models must make predictions on new, unseen data without any further training, mimicking the exploratory nature of biological research where predefined labels are often unavailable [4]. This capability is critical for applications such as novel cell type identification, where researchers encounter unannotated data from experiments investigating previously uncharacterized biological conditions. Despite the theoretical promise, rigorous evaluation of scFMs in zero-shot contexts has revealed significant limitations, suggesting that current models may not yet fulfill their potential for transformative biological discovery without additional specialized training [4] [3].

Architectural Landscape of Key scFMs

Model Architectures and Pretraining Strategies

scFMs predominantly utilize transformer-based architectures, which employ attention mechanisms to weight the importance of different genes when making predictions about cellular states. The two primary architectural paradigms are encoder-based models (e.g., scBERT, Geneformer) and decoder-based models (e.g., scGPT), with some implementations using hybrid designs [1]. These models vary significantly in their parameter counts, pretraining datasets, and specific architectural implementations, leading to diverse performance characteristics across different biological tasks.

Table 1: Architectural Overview of Leading Single-Cell Foundation Models

Model Name	Architecture Type	Parameters	Pretraining Dataset Size	Key Innovations
Geneformer	Transformer Encoder	40 million	30 million cells	Rank-based gene tokenization; attention regularization
scGPT	GPT-style Decoder	50 million	33 million cells	Multi-omic support; generative pretraining
scBERT	BERT-style Encoder	Not specified	Millions of cells	Focus on cell type annotation
UCE	Transformer Encoder	650 million	36 million cells	Protein language model embeddings for genes
scFoundation	Encoder-Decoder	100 million	50 million cells	Read-depth-aware masked gene modeling
GeneMamba	State Space Model	Not specified	>50 million cells	BiMamba module for long-sequence efficiency

Input Representation and Tokenization Strategies

A fundamental challenge in adapting transformer architectures to single-cell data is the non-sequential nature of gene expression, unlike the inherent sequence in natural language. To address this, scFMs employ various tokenization strategies to convert gene expression profiles into structured model inputs:

Rank-based discretization: Used by Geneformer and LangCell, this approach orders genes by their expression levels within each cell, creating a deterministic sequence based on expression magnitude [1] [12].
Bin-based discretization: Employed by scBERT and scGPT, this method groups expression values into predefined bins, balancing resolution with computational efficiency [1] [12].
Value projection: Implemented in scFoundation, this technique projects continuous expression values into embedding spaces without discrete categorization [12].

These tokenization approaches are combined with specialized embeddings for gene identifiers, expression values, and positional information to create comprehensive input representations that preserve biological meaning while conforming to architectural requirements of transformer models [16].

Quantitative Performance Benchmarking

Zero-Shot Performance Evaluation

Rigorous evaluation of scFMs in zero-shot settings is essential for assessing their true potential in biological discovery. Recent benchmarking studies have revealed significant limitations in current models when deployed without task-specific fine-tuning. In critical tasks such as cell type clustering and batch integration, popular scFMs including Geneformer and scGPT have been consistently outperformed by simpler traditional methods [4] [3].

Table 2: Zero-Shot Performance Comparison Across Biological Tasks

Model	Cell Type Clustering (AvgBIO Score)	Batch Integration (iLISI Score)	Perturbation Analysis	Biological Insight Capture
scGPT	Variable performance; outperforms baselines on PBMC dataset only	Moderate; better on complex biological batches	Limited data	Shows promise in gene network inference
Geneformer	Consistently outperformed by simpler methods	Poor; often increases batch effects	Limited data	Demonstrates some gene relationship capture
scVI	Strong performance across multiple datasets	Excellent on technical batches	Strong performance	Established reliable baseline
Harmony	Competitive cell type separation	Excellent batch mixing	Not specialized for perturbations	Not designed for deep biological insights
HVG Selection	Surprisingly effective; often outperforms scFMs	Best overall batch integration scores	Simple but effective	Limited to variance-based features

In cell type clustering tasks, both Geneformer and scGPT underperformed compared to established methods like Harmony and scVI, as measured by Average BIO (AvgBio) score across multiple datasets [4]. Notably, the simple approach of selecting Highly Variable Genes (HVG) frequently outperformed both foundation models, raising questions about the effectiveness of their pretraining paradigms [4]. For batch integration—a crucial task for combining datasets from different experimental sources—Geneformer particularly struggled, with its embeddings often showing stronger batch effects than the original input data [4].

Performance Across Experimentally Relevant Tasks

Beyond standard benchmarks, scFMs have been evaluated on biologically and clinically relevant tasks including cancer cell identification, drug sensitivity prediction, and cross-tissue analysis. These evaluations reveal a nuanced landscape where no single model consistently outperforms others across all tasks [16]. The performance varies significantly based on factors such as dataset size, tissue type, and specific biological questions, emphasizing the importance of task-specific model selection.

Specialized evaluation metrics like scGraph-OntoRWR (which measures consistency between model-derived cell relationships and established biological knowledge) and Lowest Common Ancestor Distance (which quantifies the severity of cell type misannotation errors) provide deeper insights into the biological relevance of scFM embeddings [16]. These knowledge-based evaluation approaches demonstrate that pretrained scFM embeddings do capture meaningful biological information about gene and cell relationships, even when their performance on specific tasks may lag behind simpler methods [16].

Experimental Protocols for scFM Evaluation

Standardized Zero-Shot Evaluation Workflow

To ensure reproducible assessment of scFM performance, researchers should follow a standardized protocol for zero-shot evaluation. The following workflow outlines key steps for benchmarking models on novel datasets:

Protocol 1: Zero-Shot Cell Type Clustering

Data Preparation: Obtain a holdout dataset not included in the model's pretraining corpus. Standard quality control should be applied, including filtering low-quality cells and genes, without batch correction.
Embedding Generation: Process the dataset through the target scFM without any fine-tuning to extract cell embeddings.
Dimensionality Reduction: Apply standard techniques (UMAP, t-SNE) to the embeddings for visualization.
Clustering: Perform Leiden or Louvain clustering on the embeddings without using biological labels.
Evaluation: Calculate clustering metrics including Normalized Mutual Information (NMI), Adjusted Rand Index (ARI), and silhouette scores against known cell type labels.
Comparison: Benchmark against established baselines including HVG selection, scVI, and Harmony embeddings using identical evaluation metrics.

Protocol 2: Batch Integration Assessment

Dataset Selection: Choose datasets with known batch effects from different experimental technologies or laboratories.
Embedding Extraction: Generate zero-shot embeddings using the target scFM.
Batch Mixing Quantification: Calculate batch integration metrics including iLISI (Integration Local Inverse Simpson's Index) and silhouette batch scores.
Biological Conservation Evaluation: Assess whether batch correction preserves biological variation using metrics such as bio-conservation scores and label classification accuracy.
Comparative Analysis: Compare against standard batch correction methods to determine relative performance.

Interpretation and Biological Validation

Beyond quantitative metrics, biological validation is crucial for establishing the practical utility of scFMs. Researchers should incorporate:

Differential Expression Analysis: Verify that cluster markers derived from scFM embeddings correspond to biologically meaningful gene signatures.
Cell Type Annotation Accuracy: Assess whether model embeddings enable correct identification of known cell types, particularly for rare populations.
Functional Enrichment: Perform gene ontology enrichment on genes most influential in the model's attention patterns to identify biologically relevant pathways.
Stability Analysis: Evaluate consistency of results across different random seeds and dataset subsamples to ensure robustness.

Essential Research Toolkit

Implementing and evaluating scFMs requires specialized computational resources and software tools. The following toolkit outlines essential components for researchers working with single-cell foundation models:

Table 3: Essential Research Toolkit for scFM Implementation

Tool/Resource	Function	Application in scFM Research
CELLxGENE Census	Unified data resource	Access to standardized single-cell data for training and evaluation
BioLLM Framework	Unified model interface	Standardized APIs for multiple scFMs; benchmarking support
scib-metrics	Standardized benchmarking metrics	Computation of bio-conservation and batch correction metrics
Scanpy	Single-cell analysis	Preprocessing, visualization, and integration with model embeddings
Hugging Face Transformers	Model architecture library	Adaptation of transformer architectures for biological data
scGPT Implementation	Pretrained models and training code	Access to scGPT model weights and fine-tuning pipelines
Geneformer Model	Pretrained rank-based model	Geneformer embeddings and transfer learning capabilities

The CELLxGENE platform provides access to over 100 million curated single cells, serving as a vital resource for both pretraining and evaluation [1] [19]. For standardized model comparison, the BioLLM framework offers unified APIs that eliminate architectural and coding inconsistencies, enabling direct performance comparisons across different scFMs [17]. Established single-cell analysis toolkits like Scanpy complement these specialized resources by providing robust preprocessing and visualization capabilities that integrate with scFM-derived embeddings.

The development of single-cell foundation models represents a promising frontier in computational biology, but significant challenges remain. Current evaluations indicate that these models have not yet consistently realized their potential for zero-shot biological discovery, with simpler methods often outperforming complex foundation models on critical tasks [4] [20]. This performance gap highlights fundamental questions about current pretraining approaches and whether masked language modeling objectives effectively capture the biological knowledge needed for generalized reasoning.

Future progress in scFMs will likely require innovations in several key areas. Architecturally, emerging approaches like GeneMamba's state space models offer promising alternatives to transformer-based architectures, potentially addressing computational efficiency limitations while maintaining performance [12]. Pretraining strategies may need fundamental rethinking to better align objectives with biological reasoning, potentially incorporating more explicit biological knowledge through gene networks, pathways, or ontological relationships. Evaluation standards must continue to evolve beyond technical metrics to assess true biological insight, possibly through carefully designed challenges that test models on novel biological predictions with experimental validation.

For researchers applying these tools, current evidence suggests a pragmatic approach: scFMs show considerable promise as components in biological discovery pipelines, but their limitations in zero-shot settings necessitate careful validation and comparison with established methods. As the field matures, the development of more robust evaluation frameworks and specialized architectures may eventually fulfill the promise of foundation models to transform our understanding of cellular biology.

Practical Applications and Methodological Advances in Zero-Shot scFM Deployment

Single-cell foundation models (scFMs) are machine learning models pretrained on massive-scale single-cell datasets, with the goal of capturing universal biological patterns. A critical assessment of these models involves zero-shot evaluation, where the model's internal representation of input data—an "embedding"—is used for downstream analysis with no further task-specific training. This is particularly vital in exploratory biological contexts where predefined labels are unavailable, making fine-tuning infeasible. The core promise of scFMs is their ability to generate robust cell embeddings that project noisy gene expression measurements into a more biologically relevant latent space, ready for immediate use in key atlas construction tasks without additional adaptation.

Recent rigorous evaluations, however, suggest that this promise remains partially fulfilled. Kedzierska et al. (2025) report that in zero-shot settings, proposed foundation models like Geneformer and scGPT can, in some cases, be outperformed by simpler methods on standard tasks including cell type clustering and batch integration. These findings underscore the importance of robust zero-shot benchmarking as an essential step in the development and deployment of foundation models for single-cell biology, highlighting the current gap between model scale and reliable biological insight in discovery settings.

Core Zero-Shot Tasks in Atlas Construction

Task 1: Cell Type Clustering

Objective: To evaluate whether a foundation model's cell embeddings can effectively separate known cell types in an unseen dataset without any model fine-tuning. This tests the model's fundamental ability to encode biologically meaningful cell states.

Quantitative Performance Benchmark:

Performance is typically measured by the Average BIO (AvgBIO) score and Average Silhouette Width (ASW), which quantify the separation between known cell types in the embedding space. The following table summarizes the zero-shot performance of selected models against established baselines across multiple datasets, as reported by Kedzierska et al.:

Table 1: Zero-shot cell type clustering performance (AvgBIO score) across datasets

Model / Method	Pancreas Dataset	PBMC (12k) Dataset	Tabula Sapiens	Immune Dataset
HVG (Baseline)	0.741	0.785	0.792	0.801
Harmony	0.752	0.791	0.805	0.812
scVI	0.768	0.779	0.798	0.809
scGPT	0.702	0.802	0.754	0.721
Geneformer	0.635	0.691	0.668	0.645

Source: Adapted from Kedzierska et al. [4]

Key Findings: The evaluation reveals that selecting Highly Variable Genes (HVG) often outperforms both scGPT and Geneformer across most metrics. While scGPT shows competitive performance on the PBMC dataset, its performance is inconsistent across other tissues. Geneformer consistently underperforms relative to all baselines. This suggests that the masked language model pretraining framework may not inherently produce cell embeddings that are optimal for cell type separation without task-specific fine-tuning.

Task 2: Batch Integration

Objective: To assess a model's capacity to eliminate non-biological technical variations (batch effects) across multiple data sources while preserving meaningful biological differences. Success in this task is crucial for building integrated atlases from multiple studies.

Quantitative Performance Benchmark:

Batch integration quality is evaluated using metrics that balance batch mixing (e.g., LISI score) and biological conservation (e.g., PCR score). The following table provides a comparative analysis:

Table 2: Batch integration performance across methods

Model / Method	Batch Mixing Score (LISI, higher is better)	Biological Conservation (PCR, lower is better)	Overcorrection Sensitivity
HVG	0.892	0.124	Low
Harmony	0.865	0.135	Medium
scVI	0.879	0.141	Medium
scGPT	0.831	0.152	Not Reported
Geneformer	0.745	0.218	Not Reported
RBET Framework	0.901*	0.118*	High

Note: *RBET values are illustrative based on its reported superior performance [21]. LISI: Local Inverse Simpson's Index; PCR: Principal Component Regression.

Key Findings: Geneformer's embeddings consistently show a higher proportion of variance explained by batch effects compared to the original data, indicating inadequate batch mixing. scGPT demonstrates variable performance, outperforming scVI and Harmony on complex datasets with combined technical and biological batch effects but underperforming on datasets with purely technical variation. The recently proposed RBET framework shows particular promise due to its sensitivity to overcorrection, a critical feature for preserving biological signal [21].

Experimental Protocols for Zero-Shot Evaluation

Protocol for Cell Type Clustering Evaluation

Required Inputs:

Preprocessed single-cell RNA-seq dataset (query dataset) with held-out cell type labels
Pretrained foundation model (e.g., scGPT, Geneformer) with frozen weights
Baseline methods (HVG, scVI, Harmony) for comparison

Procedure:

Embedding Generation: Pass the normalized count matrix of the query dataset through the foundation model to extract cell embeddings in a zero-shot manner (no gradient updates).
Dimensionality Reduction: Apply PCA to the embeddings, followed by UMAP for visualization (2D/3D).
Clustering: Perform Leiden clustering on the k-nearest neighbor graph constructed from the embeddings.
Evaluation: Compare cluster labels against ground truth cell type annotations using:
- Average BIO score
- Adjusted Rand Index (ARI)
- Normalized Mutual Information (NMI)
Benchmarking: Repeat steps 1-4 for all baseline methods and compare scores.

Critical Controls:

Ensure no data leakage between pretraining and evaluation datasets
Use identical preprocessing pipelines for all methods
Apply multiple random seeds to assess stability

Figure 1: Workflow for zero-shot cell type clustering evaluation

Protocol for Batch Integration Evaluation

Required Inputs:

Multi-batch single-cell dataset with known technical and biological covariates
Pretrained foundation model
Reference genes with stable expression patterns across cell types

Procedure:

Embedding Extraction: Generate cell embeddings for the multi-batch dataset using the foundation model in zero-shot mode.
Visual Assessment: Create UMAP plots colored by batch and cell type to qualitatively assess integration.
Quantitative Metrics:
- Calculate batch mixing scores (LISI, kBET)
- Compute biological conservation metrics (PCR, cell type ASW)
- Apply RBET framework using reference genes to detect overcorrection [21]
Differential Expression Analysis: Perform differential expression testing between batches post-integration to identify residual technical effects.
Downstream Validation: Assess impact on downstream tasks like trajectory inference and cell-cell communication.

Advanced Consideration - Disentanglement Models: For methods like scShift and CODAL that explicitly disentangle biological and technical variations [22] [23]:

The batch-dependent variation (biological embedding) captures disease states and perturbations
The batch-independent variation (unperturbed embedding) represents core cell type information
Evaluate the identifiability of both components on held-out datasets

Figure 2: Workflow for zero-shot batch integration evaluation

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key computational tools and resources for zero-shot evaluation

Tool/Resource	Type	Primary Function	Application in Zero-Shot Tasks
CELLxGENE Census	Data Resource	Curated single-cell data repository	Source of standardized evaluation datasets; enables cross-study comparisons
HVG Selection	Computational Method	Feature selection based on variance	Simple yet powerful baseline for cell type clustering and batch correction
RBET Framework	Evaluation Metric	Reference-informed batch effect testing	Detects overcorrection with sensitivity to biological variation preservation [21]
scIB Metrics	Evaluation Suite	Comprehensive integration benchmarking	Standardized metrics for batch mixing and bio-conservation (ASW, ARI, NMI)
scShift	Disentanglement Model	Separates batch and biological variations	Enables zero-shot biological state representation without annotations [22]
CODAL	Integration Model	Mutual information-based disentanglement	Addresses batch-confounded cell states through variational inference [23]
CellWhisperer	Multimodal Model	Joint embedding of transcriptomes and text	Facilitates zero-shot cell annotation through natural language queries [24]

Emerging Capabilities and Future Directions

The field of zero-shot evaluation is rapidly evolving beyond basic clustering and integration. Novel approaches are demonstrating emergent capabilities that may shape future atlas construction protocols:

Biological State Disentanglement: Models like scShift show that scaling deep identifiable models enables zero-shot revelation of biological states. When trained on diverse compendiums of scRNA-seq atlases, these models can disentangle batch-dependent and independent variations, allowing direct comparison of biological states across datasets without additional training [22].

Multimodal Integration: Approaches like CellWhisperer establish multimodal embeddings connecting transcriptomes with textual annotations, enabling zero-shot prediction of cell types and biological functions through natural language queries [24]. This represents a paradigm shift from predefined classification schemas to flexible, knowledge-informed cell annotation.

Scaling Laws: Systematic evaluation of over 200 scShift models reveals emergent zero-shot capabilities beyond a transition threshold with respect to dataset diversity and size [22]. This suggests that, similar to large language models, single-cell foundation models may exhibit qualitatively improved capabilities when trained at sufficient scale.

These advances point toward a future where zero-shot evaluation will encompass not just technical performance metrics, but also the ability of models to capture meaningful biological relationships, generalize to novel cell states, and integrate multimodal information for holistic cell atlas construction.

Zero-shot learning represents a paradigm shift in machine learning, enabling models to recognize or classify data from categories they have never explicitly encountered during training [25]. Within the domain of single-cell biology, this capability is being advanced by single-cell foundation models (scFMs), which are large-scale neural networks pretrained on massive, diverse datasets of single-cell transcriptomics information [26] [2]. These models learn a foundational understanding of cellular biology by identifying universal patterns in gene expression. The emergent ability to perform tasks without additional task-specific training (zero-shot) is critical for drug discovery, as it allows researchers to predict how cells will respond to novel therapeutic compounds or under new experimental conditions where pre-existing labels are unavailable [4]. This protocol details the application of scFMs for the zero-shot prediction of cellular responses to novel drugs, a process poised to accelerate therapeutic development and personalized medicine.

Key Concepts and Foundation Models

Core Principles of Zero-Shot Prediction

In the context of single-cell data, zero-shot prediction operates by leveraging the semantic knowledge that scFMs acquire during pretraining. A model learns to map high-dimensional, sparse single-cell RNA sequencing (scRNA-seq) data into a meaningful latent space where cells with similar biological functions and states are positioned proximally [2]. When presented with a novel drug—a "class" not seen during training—the model does not rely on pre-learned drug-specific patterns. Instead, it leverages its generalized understanding of cellular biology to infer the potential relationship between the cell's baseline state and the expected phenotypic outcome, such as sensitivity or resistance [4] [25].

Landscape of Single-Cell Foundation Models

Several scFMs form the backbone of current zero-shot prediction research. The table below summarizes key models and their relevance to drug response tasks.

Table 1: Foundational Models for Single-Cell Analysis

Model Name	Key Architectural Features	Pretraining Corpus	Demonstrated Relevance to Drug Response
scGPT [26] [17]	Transformer-based; utilizes masked gene modeling.	Over 33 million non-cancerous human cells.	Robust performance across diverse tasks including perturbation prediction; can be fine-tuned for drug response.
Geneformer [4] [2]	Transformer-based; uses rank-based gene tokenization.	~30 million single-cell transcriptomes from various tissues.	Used for predicting disease-associated network dynamics and perturbation effects.
Nicheformer [27]	Transformer-based; integrates dissociated and spatial transcriptomics.	110 million cells (57M dissociated, 53M spatial).	Captures spatial context, enabling predictions about the tissue microenvironment's role in drug response.
PharmaFormer [28]	Custom Transformer; integrates gene expression and drug SMILES structures.	GDSC database (900+ cell lines, 100+ drugs).	Specifically designed for clinical drug response prediction via transfer learning from cell lines to organoids.

Application Notes: Protocols for Zero-Shot Prediction

This section provides a detailed, step-by-step protocol for leveraging scFMs to predict cellular responses to novel drugs in a zero-shot setting.

Protocol 1: Zero-Shot Cell Embedding for Response Stratification

Objective: To identify subpopulations of cells within a tumor that may exhibit innate sensitivity or resistance to a novel drug based solely on their pre-treatment transcriptomic state.

Materials:

Input Data: Pre-treatment scRNA-seq count matrix from a patient-derived sample.
Foundation Model: A pretrained scFM (e.g., scGPT, Geneformer) with published weights.
Computational Environment: High-performance computing cluster with GPU acceleration and Python environment (e.g., PyTorch, JAX).
Software Tools: Unified frameworks like BioLLM [17] can streamline model access and standardize APIs.

Methodology:

Data Preprocessing: Prepare your query scRNA-seq data. This includes standard quality control (filtering low-quality cells and genes), normalization, and log-transformation. Ensure the gene identifiers align with the vocabulary used during the scFM's pretraining.
Zero-Shot Embedding Generation: Pass the preprocessed single-cell data through the frozen, pretrained scFM to generate cell embeddings. This step is crucial and must be performed without any fine-tuning of the model on the new data.
Dimensionality Reduction and Clustering: Apply techniques like UMAP or t-SNE to the high-dimensional cell embeddings for visualization. Subsequently, use clustering algorithms (e.g., Leiden, Louvain) to identify distinct cell subpopulations.
Interpretation and Hypothesis Generation: Analyze the resulting clusters. Cells clustering together in the embedding space share similar biological states learned by the foundation model. Correlate these states with known markers of drug sensitivity or resistance. For instance, a cluster enriched for oxidative phosphorylation may suggest sensitivity to metabolic inhibitors, while a cluster with high expression of ABC transporters may indicate potential for multidrug resistance [29] [2].

Protocol 2: In-silico Perturbation with Novel Drug Signatures

Objective: To simulate the transcriptional effect of a novel drug on a cell population and predict the outcome.

Materials:

Input Data: As in Protocol 1.
Foundation Model: A model like scGPT or Geneformer, known for its perturbation prediction capabilities [26].
Drug Signature: A representative gene expression signature for the novel drug. This can be derived from public databases (e.g., LINCS L1000) or from bulk RNA-seq experiments on model systems treated with the drug.

Methodology:

Define the Perturbation Vector: The novel drug's effect is represented as a "perturbation vector" in the model's latent or input space. This vector encodes the directional change in gene expression that the drug typically induces.
In-silico Perturbation: For each cell in your query dataset, the model computationally applies the perturbation vector to its original state, generating a "predicted post-treatment" embedding.
Trajectory Analysis: Compare the original and predicted post-treatment embeddings for each cell. Tools like UMAP can visualize the "trajectory" a cell is predicted to take upon treatment. Cells that show a large shift in embedding space are predicted to be strongly affected by the drug.
Outcome Prediction: The model can be tasked with predicting a specific outcome, such as cell viability or apoptosis. The distance a cell travels in the embedding space or the direction of its trajectory can be quantified and used to score its predicted sensitivity (large change) or resistance (minimal change) [29] [27].

The Scientist's Toolkit

Table 2: Essential Research Reagents and Computational Tools

Item / Resource	Function / Description	Example Sources / Tools
Pretrained Foundation Models	Provides the core AI for generating zero-shot predictions.	scGPT, Geneformer, Nicheformer, scFoundation [26] [4] [27].
Unified Software Framework	Standardizes access to different models, enabling fair benchmarking and streamlined workflows.	BioLLM [17].
Single-Cell Datasets	Provides the input data for prediction; requires high-quality, annotated pre- and post-treatment data for validation.	CCLE, GDSC, patient-derived organoid data [29] [28].
Batch Integration Tools	Corrects for technical variation between datasets, a critical step for robust model application.	Harmony, scVI [4] [2].
Gene Ontology Databases	Provides the biological context for interpreting model outputs and identified gene patterns.	Gene Ontology (GO) resources [2].

Experimental Workflow and Validation

The following diagram illustrates the logical flow of a zero-shot prediction experiment, from data input to biological validation.

Figure 1: Zero-shot prediction workflow for novel drug response.

Performance Benchmarks and Validation

Rigorous evaluation is essential, as zero-shot performance of scFMs can be variable. Independent benchmarks reveal that while scFMs show promise, they do not always consistently outperform simpler baseline methods like Highly Variable Genes (HVG) selection or specialized models like scVI and Harmony on tasks like cell type clustering and batch correction [4] [2].

Table 3: Example Benchmarking Results for Zero-Shot Cell Embeddings (Adapted from [4] [2])

Model / Method	AvgBIO Score (Cell Type Clustering)	Batch Integration Score (Pancreas Dataset)	Performance Notes
HVG (Baseline)	0.79	0.88	Often outperforms foundation models in zero-shot clustering and integration tasks [4].
scVI (Baseline)	0.75	0.85	Robust performance on technical batch effects [4].
Harmony (Baseline)	0.73	0.72	Struggles with complex biological batch effects (e.g., donor variation) [4] [2].
scGPT (Zero-Shot)	0.68	0.78	Shows robust performance across tasks but is inconsistent; benefits from large-scale pretraining [4] [17].
Geneformer (Zero-Shot)	0.62	0.45	Underperforms baselines in batch integration; embeddings may be dominated by batch effects [4].

Validation requires correlating computational predictions with empirical data. For the ATSDP-NET model (which uses transfer learning, not pure zero-shot), high correlations were found between predicted gene scores and actual outcomes (sensitivity: R=0.888, p<0.001; resistance: R=0.788, p<0.001) [29] [30]. Similarly, PharmaFormer demonstrated clinical relevance by stratifying patients into risk groups with significantly different survival outcomes after fine-tuning on organoid data (e.g., Hazard Ratio for oxaliplatin in colon cancer: 4.49) [28]. These results underscore the potential value of these approaches, even as pure zero-shot capabilities continue to mature.

In single-cell genomics, the emergence of single-cell foundation models (scFMs) pretrained on tens of millions of cells has created new paradigms for biological discovery [1]. These models learn universal representations of cellular states by capturing complex gene-gene interactions and regulatory networks, offering immense potential for downstream tasks like drug response prediction [18] [31]. However, a significant challenge persists: adapting these massive models to specialized tasks with limited labeled data while preserving their generalizable biological knowledge.

Adapter-based fine-tuning has emerged as a powerful solution to this challenge, enabling parameter-efficient adaptation of scFMs. By inserting small, trainable modules into frozen pretrained models, adapters allow specialization for molecular perturbation prediction and other tasks while retaining the rich biological representations learned during pretraining [18] [31] [32]. This approach is particularly valuable for few-shot and zero-shot learning scenarios common in biomedical research, where experimental data for novel drugs or cell lines is extremely limited.

The Adapter Paradigm in Machine Learning

Adapter-based fine-tuning represents a parameter-efficient alternative to full model fine-tuning. Instead of updating all parameters of a pretrained foundation model, this approach inserts small, trainable adapter modules between the model's frozen layers [32]. A canonical adapter employs a bottleneck structure that first down-projects the input dimensionality, applies a non-linear activation, then up-projects back to the original dimension, with a skip connection preserving the original representations: h′ = W_up(σ(W_down h)) + h [32].

This design provides multiple advantages: it dramatically reduces the number of trainable parameters (often to less than 1-5% of the original model), minimizes catastrophic forgetting of pretrained knowledge, enables modular multi-task learning, and significantly reduces storage requirements by sharing the same backbone across tasks [31] [32]. The efficiency of adapters has been demonstrated across domains including natural language processing, computer vision, and speech recognition, where they often match or exceed full fine-tuning performance despite their minimal parameter count [32].

Adapter Architectures for Single-Cell Foundation Models

scDCA: Drug-Conditional Adapters for Perturbation Prediction

The Single-Cell Drug-Conditional Adapter (scDCA) represents a specialized architecture for molecular perturbation prediction. This approach introduces drug-conditional adapter layers that inject molecular structure information into frozen scFMs while training less than 1% of the original model parameters [18] [31]. The adapter parameters are dynamically conditioned on chemical structures, enabling the model to predict transcriptional responses to novel drugs and generalize zero-shot to unseen cell lines [31].

Table: scDCA Performance on Molecular Perturbation Prediction

Generalization Task	Performance Improvement	Key Achievement
Novel Drug Prediction	State-of-the-art results	Significant improvement over existing baselines
Unseen Cell Line Prediction	Major improvements	Successful zero-shot generalization
Few-shot Scenarios	Strong performance	Effective with limited training data

Attn-Adapter: Dual Attention Mechanism

The Attn-Adapter architecture employs a dual attention mechanism to enhance few-shot learning capabilities. It consists of two key components: a Memory Attn-Adapter that refines category embeddings using support examples through cross-attention, and a Local-Global Attn-Adapter that enriches image embeddings by integrating local and global features [33]. This design enables dynamic adaptation from a few labeled samples without retraining the base model, outperforming state-of-the-art methods in cross-category and cross-dataset generalization [33].

Experimental Protocols for Adapter Implementation

Protocol: Implementing scDCA for Drug Response Prediction

Objective: Adapt a single-cell foundation model (e.g., scGPT) to predict transcriptional responses to novel drugs using drug-conditional adapters.

Materials:

Pretrained scFM (e.g., scGPT with 50M parameters pretrained on 33M cells)
Chemical perturbation dataset (e.g., with 100+ molecules across multiple cell lines)
Adapter framework (PyTorch or TensorFlow)
GPU resources (recommended: 16GB+ VRAM)

Procedure:

Model Preparation: Load a pretrained scFM and freeze all its parameters.
Adapter Insertion: Insert drug-conditional adapter layers after transformer blocks. Each adapter should implement:
- Down-projection to reduced dimension (e.g., 64D from original 512D)
- Non-linear activation (ReLU)
- Up-projection to original dimension
- Skip connection
Drug Conditioning: Implement a molecular structure encoder (e.g., using graph neural networks or molecular fingerprints) to generate conditional parameters for the adapter layers.
Training: Train only adapter parameters using:
- Objective: Mean squared error between predicted and actual gene expression
- Batch size: 32-128 (adjust based on GPU memory)
- Learning rate: 1e-4 to 1e-3 with linear decay
- Epochs: 50-100 with early stopping
Evaluation: Assess performance on held-out drugs and cell lines using metrics like mean squared error, Pearson correlation, and zero-shot accuracy.

Expected Outcomes: The adapted model should achieve state-of-the-art performance in predicting cellular responses to novel drugs and demonstrate zero-shot generalization to unseen cell lines, outperforming methods like ChemCPA and Biolord [31].

Protocol: Few-Shot Adaptation with Attn-Adapter

Objective: Adapt a vision-language model for few-shot classification in biological imaging contexts.

Materials:

Pretrained VLM (e.g., CLIP)
Few-shot support set (typically 1-16 samples per class)
Attn-Adapter implementation

Procedure:

Feature Extraction: Extract support embeddings and category embeddings using the frozen base model.
Memory Attn-Adapter: Apply cross-attention to refine category embeddings using support embeddings as keys and values.
Local-Global Attn-Adapter: Enhance image embeddings by integrating local and global features through attention mechanisms.
Similarity Computation: Calculate cosine similarity between refined category and image embeddings for classification.

Validation: Test cross-category and cross-dataset generalization, comparing against Tip-Adapter and Meta-Adapter baselines [33].

Performance Evaluation and Benchmarking

Quantitative Performance of Adapter Methods

Table: Adapter Performance Across Domains

Domain	Parameter Efficiency	Performance vs. Full Fine-tuning	Key Applications
Natural Language Processing	0.6-6% of parameters	Outperforms by 0.7-2.5% in low-resource settings	Sentiment analysis, QA, NLI
Computer Vision	2-5% of parameters	Exceeds by 1% AP on instance segmentation	Object detection, classification
Speech Translation	~7% of parameters	BLEU improvements of +1.1 on low-resource pairs	Multi-speaker adaptation
Single-Cell Biology	<1% of parameters	State-of-the-art in perturbation prediction	Drug response, novel cell line generalization

Adapter-based approaches consistently demonstrate competitive performance while maintaining parameter efficiency. In single-cell biology, scDCA enables significant improvements in few-shot and zero-shot generalization to new cell lines compared to existing baselines [18] [31]. The method establishes new state-of-the-art results across generalization tasks, particularly for the challenging scenario of predicting perturbations for unseen cell lines.

Zero-Shot Capabilities of Adapted Models

Rigorous evaluation of zero-shot performance is crucial for assessing true generalization capabilities. Studies reveal that scFMs like scGPT and Geneformer face challenges in zero-shot settings, sometimes underperforming simpler methods like highly variable genes selection on tasks like cell type clustering and batch integration [4]. However, adapter-based fine-tuning significantly enhances zero-shot capabilities by preserving the model's foundational knowledge while enabling adaptation to novel concepts [18] [31].

Benchmarking studies show that while no single scFM consistently outperforms others across all tasks, models with adapter-based tuning demonstrate more robust generalization [16]. Comprehensive evaluations across multiple cell-level tasks reveal that adapter-enhanced models capture biological relationships more effectively, as measured by ontology-informed metrics like scGraph-OntoRWR [16].

The Scientist's Toolkit

Table: Essential Research Reagents for Adapter Implementation

Reagent / Tool	Function	Example Implementation
Single-Cell Foundation Models	Provides pretrained biological representations	scGPT (50M params, pretrained on 33M cells) [1]
Adapter Modules	Enables parameter-efficient fine-tuning	Bottleneck layers with down/up-projection [32]
Molecular Encoders	Bridges chemical and biological modalities	Graph neural networks for molecular structures [31]
Few-Shot Support Sets	Provides limited labeled examples	1-16 samples per class for adaptation [33]
Benchmark Datasets	Evaluates generalization capabilities	Chemical perturbation data with novel drugs/cell lines [18]
Unified Frameworks	Standardizes model integration and evaluation	BioLLM for consistent API access to multiple scFMs [17]

Visualizing Experimental Workflows

Diagram 1: scDCA workflow showing how drug information conditions adapter parameters to predict transcriptional responses using a frozen single-cell foundation model.

Diagram 2: Attn-Adapter architecture demonstrating how dual attention mechanisms refine both category and image embeddings for few-shot learning.

Adapter-based fine-tuning represents a transformative approach for adapting single-cell foundation models to specialized tasks with limited data. The strategic insertion of minimal trainable parameters enables remarkable efficiency while preserving valuable biological knowledge acquired during pretraining. As the field advances, innovations in dynamic routing, conditional adaptation, and hierarchical designs will further enhance the capabilities of adapter-based methods. For researchers in drug discovery and cellular biology, these techniques offer powerful tools to leverage the full potential of foundation models while accommodating the data constraints inherent in biomedical research.

The advent of single-cell genomics has revolutionized our ability to investigate biological systems at unprecedented resolution, revealing profound cellular heterogeneity in development, physiology, and disease. While single-cell RNA sequencing (scRNA-seq) has been the workhorse of this revolution, biological systems operate through complex, multilayered regulatory mechanisms that span multiple molecular modalities and are spatially organized within tissues. The emergence of single-cell multi-omics technologies now enables the simultaneous profiling of different data modalities—including transcriptomics, epigenomics, proteomics, and spatial context—within the same cell, providing a more comprehensive picture of cellular identity and function.

Concurrently, single-cell foundation models (scFMs) have emerged as powerful computational frameworks capable of learning universal representations from massive-scale single-cell data. These models, typically built on transformer architectures and pretrained on millions of cells through self-supervised objectives, have demonstrated remarkable capabilities in adapting to various downstream tasks with minimal fine-tuning. However, a significant challenge remains: most existing scFMs have primarily focused on transcriptomic data alone, limiting their ability to capture the full complexity of biological systems.

This application note explores cutting-edge computational strategies for integrating multi-omic and spatial data modalities within the framework of zero-shot learning single-cell foundation models. We provide detailed protocols and analytical frameworks that enable researchers to move beyond transcriptomics and leverage the full potential of multimodal single-cell data, with particular emphasis on clinical and drug development applications.

Foundations of Single-Cell Foundation Models

Core Architectural Principles

Single-cell foundation models are large-scale deep learning models pretrained on vast datasets that can be adapted to a wide range of downstream tasks through self-supervised learning [1]. These models share three key components that enable their generalization capabilities:

Large-scale pretraining: scFMs are trained on extremely large and diverse datasets to capture universal biological patterns. Public archives such as CZ CELLxGENE provide unified access to annotated single-cell datasets with over 100 million unique cells standardized for analysis [1].
Transformer architectures: Most scFMs utilize transformer architectures with attention mechanisms that allow the model to learn and weight relationships between any pair of input tokens (typically genes or genomic features) [1]. These architectures can be encoder-based (e.g., BERT-like), decoder-based (e.g., GPT-like), or hybrid designs.
Adaptation mechanisms: scFMs can be fine-tuned or prompted for new tasks, transferring learned knowledge to improve performance on target tasks with relatively few additional labeled examples [1].

Tokenization Strategies for Multimodal Data

A critical innovation in extending scFMs beyond transcriptomics lies in developing effective tokenization strategies for representing diverse data types. Unlike natural language, omics data lacks inherent sequential ordering, requiring specialized approaches:

Table 1: Tokenization Strategies for Multi-omic Data

Data Modality	Tokenization Approach	Special Considerations	Example Models
scRNA-seq	Genes as tokens ordered by expression level; value embeddings for expression	Non-sequential nature of genes; high sparsity	scGPT, Geneformer
scATAC-seq	Chromatin accessibility peaks as tokens; accessibility scores as values	High dimensionality; binary nature	scGPT, MultiVI
Spatial Transcriptomics	Spatial coordinates as positional encodings; gene expression tokens	Spatial neighborhood relationships	Nicheformer, stClinic
Protein Abundance	Surface proteins as tokens; abundance levels as values	Limited feature space (typically <200 proteins)	CITE-seq models
Multiome	Modality-specific tokens with modality indicators	Integration of simultaneous measurements	scPairing, scGPT

For multimodal integration, researchers have introduced special tokens indicating modality, species, technology, and batch information, enabling the model to learn both shared and modality-specific representations [1] [27]. Positional encoding schemes are adapted to represent the relative order or rank of each feature within a cell.

Protocols for Multi-omic Data Integration

Principle: scPairing integrates separate unimodal datasets to generate artificial multiomics data through contrastive learning in a shared embedding space, addressing the scarcity of true multiomics data [34].

Experimental Workflow:

Input Data Preparation:
- Collect unimodal datasets (e.g., scRNA-seq and scATAC-seq) from the same biological system
- Perform standard preprocessing: quality control, normalization, and feature selection for each modality
- Identify anchor features (e.g., genes linked to chromatin accessibility peaks) for cross-modal alignment
Model Configuration:
- Initialize scPairing architecture with modality-specific encoders
- Configure contrastive learning objective to maximize similarity between embeddings of matched cellular states across modalities
- Set hyperparameters: embedding dimension (typically 512-1024), batch size, and temperature parameter for contrastive loss
Training Procedure:
- Train model using alternating optimization between modalities
- Monitor alignment metrics: canonical correlation analysis (CCA) and mean squared error (MSE) between projected embeddings
- Apply early stopping based on validation set performance
Multi-omics Generation:
- Project separate unimodal datasets into the shared embedding space
- Generate paired multiomics profiles by matching cells across modalities based on embedding similarity
- Validate generated data by comparing with held-out true multiomics data

Applications: scPairing has been successfully applied to generate multiomics data for retina, immune, and renal cells, and can be extended to generate trimodal data [34].

Figure 1: scPairing Cross-modal Alignment Workflow

Protocol 2: Zero-shot Multimodal Cell Typing with scGPT

Principle: scGPT leverages large-scale pretraining on over 33 million cells to enable zero-shot cell type annotation across multiple modalities without task-specific fine-tuning [26].

Experimental Workflow:

Data Preprocessing:
- For each modality, format data as gene-protein feature matrices
- Apply scGPT's standardized normalization: log(1+CP10K) for RNA, arcsinh(5×) for ADT data
- Handle missing features through scGPT's builtin imputation or zero-padding
Model Initialization:
- Load pretrained scGPT model weights (available through BioLLM framework)
- Configure model for multimodal input using modality-specific tokenization
- Set context length to accommodate combined feature set (typically 1200-1500 tokens)
Embedding Extraction:
- Forward pass multimodal data through frozen pretrained model
- Extract cell embeddings from the [CLS] token or mean pooling of last hidden layer
- Reduce dimensionality using UMAP or t-SNE for visualization
Zero-shot Classification:
- Compute cosine similarity between query cell embeddings and reference cell type centroids
- Assign cell types based on nearest neighbors in embedding space
- Apply confidence thresholds based on distance to nearest centroid

Validation Metrics: Report accuracy, F1-score, and confusion matrix for cell type annotation, and use Local Inverse Simpson's Index (LISI) to assess integration quality [2].

Protocols for Spatial Data Integration

Protocol 3: Spatial Context Transfer with Nicheformer

Principle: Nicheformer is a transformer-based foundation model pretrained on both dissociated single-cell and spatial transcriptomics data (SpatialCorpus-110M) that captures spatial context and enables spatial information transfer to dissociated data [27].

Experimental Workflow:

Data Curation:
- Collect spatial transcriptomics data (MERFISH, Xenium, CosMx, or ISS technologies)
- Process dissociated scRNA-seq data from comparable biological systems
- Map orthologous genes across species (human and mouse) for cross-species applications
Model Pretraining (Optional):
- Initialize transformer architecture with 12 encoder layers, 16 attention heads
- Implement rank-based tokenization with technology-specific normalization
- Train with masked gene modeling objective on spatial and dissociated data jointly
- Incorporate contextual tokens for species, modality, and technology
Spatial Tasks:
- Spatial composition prediction: Predict local cellular density and cell-type composition around each cell
- Spatial label prediction: Transfer spatially-defined annotations (e.g., niche labels) to dissociated cells
- Linear probing: Train simple classifiers on frozen Nicheformer embeddings for spatial tasks
Validation:
- Compare spatial context predictions with ground truth manual annotations
- Assess model uncertainty using confidence calibration metrics
- Validate biological insights through comparison with known spatial patterns

Key Innovation: Nicheformer demonstrates that models trained only on dissociated data fail to recover the complexity of spatial microenvironments, underscoring the necessity of multiscale integration [27].

Table 2: Performance Comparison of Spatial Foundation Models

Model	Training Data	Spatial Composition Prediction (Accuracy)	Spatial Label Transfer (F1)	Compute Requirements
Nicheformer	57M dissociated + 53M spatial cells	0.89	0.85	High (49.3M parameters)
CellPLM	9M dissociated + 2M spatial cells	0.76	0.72	Medium
Geneformer	Dissociated only	0.62	0.58	Medium
scGPT	Dissociated only	0.65	0.61	High

Protocol 4: Clinically Relevant Niche Analysis with stClinic

Principle: stClinic integrates spatial multi-slice multi-omics (SMSMO) and clinical data through dynamic graph modeling to identify clinically relevant cellular niches and their association with patient outcomes [35].

Experimental Workflow:

Data Integration:
- Collect SMSMO data from multiple tissue slices (transcriptomics, epigenomics, proteomics)
- Incorporate clinical metadata: survival time, treatment response, disease stage
- Preprocess using MultiVI or Seurat for initial feature extraction
Graph Construction:
- Build spatial neighborhood graphs within each slice using k-nearest neighbors (k=15)
- Construct cross-slice similarity graphs based on feature profiles
- Create unified graph combining spatial and feature similarities
Model Training:
- Initialize stClinic with variational graph attention encoder (VGAE)
- Train with Mixture-of-Gaussian (MOG) prior on latent features
- Implement iterative graph refinement removing links between dissimilar nodes
- Incorporate attention mechanisms to weight important niches
Clinical Association:
- Represent each slice using niche vectors with six geometric statistical measures
- Train supervised models to predict clinical outcomes from niche representations
- Identify significant niches enriched in specific clinical groups
Zero-shot Transfer:
- Use trained encoder to map new samples into shared feature space
- Transfer niche labels from reference to query datasets without retraining
- Validate transferred labels using spatial context and marker expression

Applications: stClinic has identified aggressive niches enriched with tumor-associated macrophages and favorable prognostic niches abundant in B and plasma cells across breast cancer, colorectal cancer, and liver metastasis datasets [35].

Figure 2: stClinic Dynamic Graph Analysis Workflow

The Scientist's Toolkit: Essential Research Reagents

Table 3: Key Research Reagent Solutions for Multi-omic Spatial Analysis

Resource	Type	Function	Access
CZ CELLxGENE	Data Platform	Provides unified access to >100 million annotated single cells	Public portal
SpatialCorpus-110M	Training Data	Curated collection of 57M dissociated + 53M spatial cells for pretraining	Research use
BioLLM	Benchmarking Framework	Standardized interface for evaluating >15 foundation models	Open source
DISCO	Data Resource	Federated database aggregating single-cell data	Public portal
Pathway Tools	Visualization Software	Enables simultaneous visualization of up to 4 omics data types on metabolic charts	Academic license
scGPT Weights	Pretrained Model	Foundation model parameters pretrained on 33M+ cells	Research use
Nicheformer Code	Model Implementation	Transformer for spatial and dissociated data integration	GitHub repository
stClinic Package	Clinical Analysis	Dynamic graph model for SMSMO and clinical data integration	Upon request

Discussion and Future Perspectives

The integration of multi-omic and spatial data modalities within zero-shot learning foundation models represents a paradigm shift in single-cell computational biology. The protocols outlined in this application note provide actionable frameworks for researchers to leverage these advanced methodologies in their investigations.

Critical challenges remain in several areas. Technical variability across platforms continues to complicate integration, with different technologies exhibiting distinct bias profiles that models must account for [27]. Interpretability of foundation model predictions requires further development, particularly for clinical translation where understanding model reasoning is essential. Computational scalability presents ongoing challenges as dataset sizes continue to grow exponentially.

Future directions should focus on several key areas. First, developing standardized benchmarking frameworks specifically designed for multimodal foundation models will enable more rigorous comparison and selection of appropriate methods for specific applications. Second, creating multimodal knowledge graphs that incorporate prior biological knowledge can enhance model interpretability and biological relevance. Finally, establishing federated learning frameworks will enable model training across distributed datasets while preserving data privacy, particularly important for clinical applications.

The convergence of multimodal single-cell technologies with advanced foundation model architectures promises to unlock new insights into cellular biology and disease mechanisms. By providing detailed protocols and analytical frameworks, this application note aims to equip researchers with the tools necessary to advance beyond transcriptomics and leverage the full potential of integrated multi-omic and spatial data in the era of single-cell foundation models.

The advent of single-cell RNA sequencing (scRNA-seq) has revolutionized our understanding of cellular heterogeneity in cancer, presenting new opportunities for precision medicine. However, translating these complex, high-dimensional datasets into actionable therapeutic insights remains a significant challenge. Single-cell foundation models (scFMs), pretrained on millions of cells using self-supervised learning, have emerged as powerful tools for decoding this complexity. These models learn universal biological representations that enable zero-shot learning and transfer across diverse downstream tasks without task-specific retraining [26]. This case study explores the application of scFMs to one of oncology's most pressing challenges: predicting individual patient drug sensitivity from single-cell transcriptomic profiles. By leveraging the emergent properties of foundation models, researchers can now interrogate cellular response mechanisms at unprecedented resolution, potentially accelerating the development of personalized cancer therapies.

Background

The Drug Sensitivity Prediction Challenge

Cancer treatment continues to evolve toward precision medicine, yet effective treatment selection remains hampered by tumor heterogeneity and limited predictive biomarkers. Traditional bulk RNA sequencing masks cellular subpopulations that may drive treatment resistance, while functional drug screening using patient-derived cells faces practical limitations in cost, scalability, and clinical translation [36]. Machine learning approaches have shown promise but often struggle with the high dimensionality, technical noise, and batch effects inherent in single-cell data [2]. The field requires methods that can generalize across datasets, capture subtle biological signals, and provide interpretable predictions for clinical decision-making.

Single-Cell Foundation Models

Foundation models represent a paradigm shift in single-cell data analysis. Originally developed for natural language processing, these models employ transformer-based architectures to learn fundamental biological principles from massive, diverse collections of single-cell data. Through pretraining objectives like masked gene modeling and contrastive learning, scFMs capture hierarchical patterns of gene regulation, cellular states, and biological processes [26]. Notable examples include scGPT (pretrained on over 33 million cells) and Geneformer, which demonstrate remarkable cross-task generalization capabilities including zero-shot cell type annotation and perturbation response prediction [26] [2]. Unlike traditional single-task models, scFMs create a universal representation space that encodes biological knowledge transferable to novel prediction tasks with minimal fine-tuning.

Key scFMs for Drug Sensitivity Prediction

Table 1: Foundation Models for Single-Cell Drug Response Prediction

Model	Architecture	Pretraining Scale	Key Strengths	Reported Performance
scGPT	Transformer	33+ million cells [26]	Zero-shot annotation, multi-omic integration, perturbation modeling [26]	Superior cross-task generalization; robust benchmark performance [26] [2]
Geneformer	Transformer	Millions of cells [2]	Contextual gene embeddings, mechanism of action analysis [2]	Captures biologically meaningful relationships; transferable representations [2]
scPlantFormer	Phylogenetic transformer	1 million plant cells [26]	Cross-species integration, lightweight architecture	92% cross-species annotation accuracy [26]
Nicheformer	Graph transformer	53 million spatial cells [26]	Spatial context modeling, niche environment effects	Spatial context prediction and integration [26]

Experimental Protocols

Zero-Shot Drug Sensitivity Prediction Workflow

Protocol 1: Zero-Shot Prediction Using Pretrained scFM Embeddings

Input Data Preparation: Process single-cell transcriptomics data (raw or normalized counts) for patient-derived cells or tumor samples. Data should be formatted to match the pretraining corpus gene space of the target scFM [2].
Embedding Generation: Extract cell embeddings from the final layer of the pretrained scFM without fine-tuning. For scGPT, this involves forward propagation of the expression matrix through the transformer architecture to obtain contextual cell representations [26] [2].
Drug Response Prediction: Apply a zero-shot prediction head to map embeddings to drug sensitivity scores. This can be implemented as:
- A similarity-based approach comparing query cell embeddings to reference drug response profiles
- A linear probe trained on limited labeled data while keeping the scFM backbone frozen
- Direct inference using the model's inherent perturbation modeling capabilities [26]
Validation: Evaluate predictions against experimental drug screening data using correlation metrics (Pearson/Spearman R) and classification metrics (AUC-ROC) for binarized sensitivity thresholds [2].

Interpretable Mechanism-of-Action Analysis

Protocol 2: Interpretable MOA Analysis with scFMs

Feature Importance Calculation: Apply model interpretability techniques to identify genes driving predictions:
- SHAP Analysis: Compute Shapley values to quantify each gene's contribution to predicted IC50 values [37].
- Attention Analysis: Extract attention weights from transformer layers to identify biologically relevant gene-gene interactions [26].
MOA Pathway Validation: Test whether identified important genes are enriched in known drug mechanism-of-action pathways:
- Retrieve drug-MOA pathways from Reactome, KEGG, or using LLM-curated annotations [37].
- Perform gene set enrichment analysis (GSEA) on important genes.
- Statistically evaluate target recovery rates against background distributions [37].
Biological Validation: Correlate model-derived important genes with CRISPR screening data (DepMap) to confirm functional relevance in specific cancer contexts [37].

Performance Benchmarking

Table 2: Benchmarking scFM Performance Across Drug Prediction Tasks

Task	Dataset	Best Performing scFM	Performance Metrics	Traditional ML Baseline
Batch Integration	5 datasets with inter-patient, platform, tissue variations [2]	scGPT (zero-shot)	Improved biological structure preservation	Seurat, Harmony, scVI [2]
Cell Type Annotation	Cross-tissue, novel cell types [2]	scPlantFormer	92% cross-species accuracy [26]	HVG selection + clustering
Cancer Cell Identification	7 cancer types [2]	Ensemble scFMs	High accuracy in tumor microenvironment	Tissue-specific classifiers
Drug Sensitivity Prediction	GDSC, PRISM datasets [37]	XGBoost on scFM embeddings	ρ = 0.88-0.89 Pearson correlation [37]	All-genes models (ρ = 0.40 median) [37]
Selective Drug Prediction	GDSC subset (active in <20% cell lines) [36]	scFM with random forest	3.6/10 accurate in top-10 predictions [36]	Simple recommender systems

Research Reagent Solutions

Table 3: Essential Research Resources for scFM Drug Sensitivity Studies

Resource Category	Specific Tools/Datasets	Function and Application	Key Features
Computational Frameworks	scGPT [26], BioLLM [26]	Universal interfaces for benchmarking scFMs	Standardized access to 15+ foundation models
Data Repositories	DISCO [26], CZ CELLxGENE [26], GDSC [37], PRISM [37]	Provide pretraining corpora and drug response validation data	100M+ cells aggregated for federated analysis
Alignment Tools	Celligner [37]	Matches cell line to patient transcriptomics	Enables clinical translation of models
Interpretability Packages	SHAP [37], integrated attention visualizers [26]	Model interpretation and MOA discovery	Quantifies gene contribution to predictions
Clinical Translation Platforms	CellHit pipeline [37]	End-to-end drug prediction framework	Combines scFMs with clinical data alignment

Implementation Framework

Integrated Clinical Translation Pipeline

Protocol 3: End-to-End Clinical Drug Prediction Using scFMs

Data Acquisition and Processing:
- Obtain single-cell RNA-seq data from patient tumor samples
- Align with cancer cell line data using tools like Celligner to enable knowledge transfer [37]
- Perform quality control and normalization compatible with target scFM
Model Selection and Inference:
- Select appropriate scFM based on task complexity and dataset size [2]
- Generate cell embeddings using zero-shot protocol to preserve biological variation
- For limited data scenarios, apply lightweight fine-tuning of prediction heads
Clinical Validation and Translation:
- Validate predictions against patient-derived cell culture drug screens where available [36]
- Apply interpretability analysis to build confidence in predictions
- Generate ranked therapeutic recommendations with confidence estimates

Single-cell foundation models represent a transformative approach for predicting drug sensitivity in cancer research. By leveraging large-scale pretraining and zero-shot learning capabilities, scFMs overcome critical limitations of traditional methods in handling cellular heterogeneity, technical noise, and dataset integration. The protocols and frameworks presented herein provide researchers with practical guidance for implementing these advanced computational methods. As the field evolves, increasing model interpretability, standardization of benchmarks, and tighter integration with functional validation will be essential for translating scFM-based predictions into clinically actionable insights. The emerging paradigm of foundation models in single-cell analysis promises to accelerate personalized oncology by bridging high-resolution molecular profiling with effective therapeutic selection.

Navigating Challenges and Optimizing Single-Cell Foundation Model Performance

Single-cell foundation models (scFMs) represent a paradigm shift in computational biology, leveraging large-scale pretraining on massive single-cell transcriptomic datasets to learn universal representations of cellular biology [1]. These models, built on transformer architectures, are designed to be adaptable to a wide range of downstream tasks with minimal task-specific training, including zero-shot learning where models are applied without any fine-tuning [1] [38]. The promise of scFMs lies in their potential to capture fundamental biological principles that generalize across tissues, species, and experimental conditions.

However, as scFMs move from development to practical application, a growing body of evidence suggests their performance in zero-shot settings frequently fails to exceed that of simpler, established computational methods [5] [39] [38]. This application note synthesizes recent benchmarking studies to identify specific scenarios where this performance gap occurs, analyzes the underlying causes, and provides standardized protocols for evaluating scFMs against appropriate baselines. Understanding these limitations is crucial for researchers, scientists, and drug development professionals seeking to incorporate scFMs into their analytical workflows while avoiding potential pitfalls.

Quantitative Performance Landscape

Recent comprehensive benchmarking studies reveal that scFMs show inconsistent performance across standard single-cell analysis tasks when compared to traditional computational methods. The table below summarizes key findings from multiple evaluations comparing scFMs against established baselines.

Table 1: Performance Comparison of scFMs vs. Baselines Across Key Tasks

Task Domain	Evaluation Metric	Top-Performing Methods	scFM Performance	Key Findings
Cell Type Clustering	Average BIO (AvgBIO) score, Average Silhouette Width (ASW)	HVG selection, scVI, Harmony [38]	Geneformer and scGPT underperform HVG and established methods across most datasets [38]	HVG selection consistently outperforms both Geneformer and scGPT across all metrics [38]
Batch Integration	Batch mixing scores, Principal Component Regression (PCR)	HVG selection, scVI, Harmony [38]	Geneformer consistently ranks last; scGPT shows variable performance [38]	Best batch integration scores for all datasets achieved by selecting HVGs [38]
Perturbation Effect Prediction	Multiple accuracy metrics	Simple baseline models [39]	scFM embeddings do not provide consistent improvements over baselines, especially under distribution shift [39]	All models struggle with predicting strong or atypical perturbation effects [39]
Gene-Level Tasks	Tissue specificity, GO term prediction	Geneformer, scFoundation [17]	scGPT shows robust performance across tasks; scBERT lags due to smaller size and limited training data [17]	Performance varies significantly across models and tasks with no single scFM consistently dominating [2] [17]

Benchmarking analysis indicates that the relationship between pretraining dataset size and model performance is not straightforward. While pretraining generally provides benefits over randomly initialized models, extremely large and diverse pretraining datasets do not necessarily confer additional advantages for specific downstream tasks [38]. In some cases, models pretrained on tissue-specific data (e.g., scGPT-blood) outperform models trained on more diverse datasets (e.g., scGPT-human) even for tasks involving other tissue types [38].

Experimental Protocols for scFM Evaluation

Protocol 1: Zero-Shot Cell Type Clustering Benchmark

Purpose: To evaluate the quality of scFM-derived cell embeddings for distinguishing known cell types without task-specific fine-tuning.

Materials:

Test Datasets: Curated scRNA-seq datasets with high-quality cell type annotations (e.g., Tabula Sapiens, Pancreas, PBMC datasets) [38]
Benchmarking Models: scFMs (Geneformer, scGPT, scFoundation, etc.) and baseline methods (HVG selection, scVI, Harmony)
Evaluation Metrics: Average BIO (AvgBIO) score, Average Silhouette Width (ASW) [38]

Procedure:

Data Preparation: Standardize test datasets using consistent quality control, normalization, and filtering procedures
Embedding Generation: Extract zero-shot cell embeddings from each scFM using the authors' recommended protocols
Baseline Generation: Apply traditional methods (HVG selection, scVI, Harmony) to the same test datasets
Dimensionality Reduction: Apply UMAP or t-SNE to all embedding types for visualization
Cluster Validation: Calculate evaluation metrics by comparing cluster assignments with ground-truth cell type labels
Statistical Analysis: Perform multiple comparative tests across datasets and methods

Expected Outcomes: Simpler methods like HVG selection are expected to outperform or match scFMs in most cell type clustering tasks, providing a critical baseline for evaluating the added value of scFM embeddings [38].

Protocol 2: Batch Integration Assessment

Purpose: To assess scFM capability to remove technical batch effects while preserving biological variation in zero-shot settings.

Materials:

Test Datasets: Datasets with known batch effects from multiple sources (e.g., Pancreas benchmark with data from five different sources) [38]
Evaluation Metrics: Batch integration scores, PCR, proportion of variance explained by batch effects [38]

Procedure:

Dataset Selection: Curate datasets with mixed technical (protocol, platform) and biological (donor, condition) batch effects
Embedding Extraction: Generate zero-shot cell embeddings using target scFMs
Visualization: Create 2D embeddings colored by batch and cell type identity
Quantitative Assessment: Calculate batch mixing metrics comparing within-batch versus between-batch cell distances
Biological Preservation: Evaluate whether biological variation remains detectable after batch effect removal
Comparative Analysis: Rank methods by their ability to simultaneously minimize batch effects and preserve biological signals

Expected Outcomes: Traditional methods like Harmony and scVI typically outperform scFMs in batch correction, with Geneformer often increasing batch effects compared to raw data [38].

Protocol 3: Perturbation Prediction Evaluation

Purpose: To evaluate scFM performance in predicting transcriptional responses to genetic perturbations.

Materials:

Benchmark Framework: PertEval-scFM standardized evaluation framework [39]
Test Data: Large-scale perturbation datasets with transcriptome-wide profiles [40]
Evaluation Metrics: Prediction accuracy for direction and magnitude of expression changes

Procedure:

Data Splitting: Implement non-standard data splits where no perturbation condition occurs in both training and test sets
Model Evaluation: Assess zero-shot scFM embeddings against simpler baseline models
Distribution Shift Testing: Evaluate performance under conditions that differ from pretraining data distributions
Effect Strength Analysis: Stratify results by perturbation strength and type
Comparative Analysis: Rank methods by prediction accuracy across different perturbation classes

Expected Outcomes: scFMs generally fail to consistently outperform simpler baselines for perturbation prediction, particularly for strong or atypical perturbations and under distribution shift [39].

Visualizing Evaluation Workflows

Figure 1: Comprehensive scFM Evaluation Workflow. This workflow outlines the standardized approach for benchmarking single-cell foundation models against traditional methods across key analytical tasks.

Critical Factors Contributing to scFM Underperformance

Architectural and Training Limitations

The transformer architecture, while powerful for sequential data like text, faces fundamental challenges when applied to single-cell data where gene-gene interactions are non-sequential and dynamic [2] [1]. Current scFMs rely on various strategies to impose order on inherently unordered gene expression data, including ranking genes by expression levels or binning expression values [1]. These arbitrary orderings may not capture true biological relationships and can introduce artifacts that limit model generalization.

The masked language model pretraining objective used by most scFMs (Geneformer, scGPT) may not optimally capture the biological information needed for diverse downstream tasks [38]. This pretraining approach focuses on predicting masked genes based on their context, which does not necessarily translate to effective learning of cell-type discriminative features or batch-effect-invariant representations.

Data Quality and Compatibility Issues

Substantial technical variability across single-cell sequencing platforms presents significant challenges for scFMs [26]. Batch effects, technical noise, and platform-specific artifacts in pretraining data can propagate through to model embeddings, reducing their utility for zero-shot applications [1] [26]. Furthermore, the relationship between pretraining data composition and downstream task performance appears complex, with tissue-specific pretraining sometimes outperforming more diverse pretraining even for cross-tissue applications [38].

Data leakage concerns complicate model evaluation, as some test datasets may have been included in scFM pretraining corpora [38]. Surprisingly, even when evaluated on datasets seen during pretraining, scFMs do not consistently outperform simpler methods, indicating potential limitations in how effectively these models extract and retain biologically relevant information during pretraining [38].

Task-Specific Limitations

Current scFMs demonstrate particular weaknesses in batch integration tasks, where Geneformer embeddings sometimes amplify rather than reduce batch effects compared to raw data [38]. This suggests that the pretraining process may not adequately teach models to distinguish technical artifacts from biological signals.

For perturbation prediction, scFMs struggle with strong or atypical perturbation effects and show limited generalization under distribution shift [39]. This indicates that the models may be learning to predict average cellular behaviors rather than capturing the full spectrum of possible cellular responses to perturbations.

Table 2: Key Research Reagents and Computational Resources for scFM Evaluation

Resource	Type	Primary Function	Access Information
BioLLM Framework	Software Framework	Unified interface for integrating and evaluating diverse scFMs [17]	Standardized APIs for model switching and benchmarking
PertEval-scFM	Benchmarking Framework	Standardized evaluation of perturbation prediction capabilities [39]	Specialized framework for perturbation effect prediction
CELLxGENE Census	Data Resource	Curated single-cell data for pretraining and evaluation [26] [24]	>100 million standardized cells for model development
scGPT	Foundation Model	Generative pretrained transformer for single-cell analysis [26]	33M+ cell pretraining; strong multi-task performance [17]
Geneformer	Foundation Model	Transformer model pretrained on single-cell transcriptomes [38]	Emphasis on gene-level tasks and network inference
Harmony	Baseline Method	Batch integration and data harmonization [38]	Established baseline for integration tasks
scVI	Baseline Method	Generative model for scRNA-seq analysis [38]	Probabilistic modeling of single-cell data
HVG Selection	Baseline Method	Feature selection based on high variability [38]	Surprisingly competitive baseline for many tasks

Conceptual Framework for scFM Limitations

Figure 2: scFM Performance Gap Analysis Framework. This diagram illustrates the key factors contributing to scFM underperformance and potential strategies for addressing these limitations.

The performance gaps between scFMs and simpler baseline methods in zero-shot settings highlight the ongoing challenges in developing truly robust and generalizable foundation models for single-cell biology. Rather than dismissing scFMs entirely, these findings should guide more targeted development efforts focusing on specific limitations.

Future work should prioritize developing biological meaningful evaluation metrics like scGraph-OntoRWR, which measures the consistency of cell type relationships captured by scFMs with prior biological knowledge [2]. Additionally, standardized benchmarking frameworks like BioLLM [17] and PertEval-scFM [39] will enable more rigorous and comparable evaluations across the field.

For researchers currently applying these tools, we recommend a cautious approach that includes always comparing scFM performance against simpler baselines like HVG selection, scVI, and Harmony, particularly for critical analyses where accuracy is essential. As the field evolves, addressing the fundamental architectural and training limitations identified in this application note will be essential for realizing the full potential of foundation models in single-cell genomics and translational research.

In single-cell RNA sequencing (scRNA-seq) research, technical artifacts introduced through variations in experiments, sequencing platforms, or sample preparation processes can generate batch effects that mask true biological signals [41] [42]. These technical confounders represent a significant hurdle for all analytical approaches, including emerging zero-shot learning foundation models that promise to accelerate biological discovery without task-specific training [4] [5]. The fundamental challenge lies in distinguishing biologically irrelevant technical noise from meaningful biological variation, particularly when analyzing data from multiple sources or experimental conditions.

The critical importance of this challenge is underscored by recent evaluations of single-cell foundation models such as scGPT and Geneformer, which have demonstrated limited zero-shot performance in batch integration tasks [4] [3]. In some cases, these sophisticated models are outperformed by traditional computational methods and even simple feature selection approaches like selecting highly variable genes [4] [3]. This reveals a crucial gap in our current analytical capabilities and highlights the necessity of robust preprocessing and quality control protocols to ensure data quality before applying foundation models.

Understanding Technical Noise and Batch Effects

Technical noise in scRNA-seq data arises from multiple sources throughout the experimental workflow. Ambient RNA contamination occurs when transcripts from damaged or apoptotic cells leak out during single-cell isolation and become encapsulated in droplets along with other cells [42]. Additional artifacts include barcode swapping (incorrect binding between barcodes during sequencing) and multiplets (where more than one cell is captured within a single droplet or microwell) [42]. The multiplet rate is influenced by the scRNA-seq platform and the number of loaded cells; for example, 10x Genomics reports a 5.4% multiplet rate when 7,000 target cells are loaded, escalating to 7.6% with 10,000 cells [42].

Batch effects represent another significant category of technical variation, stemming from differences in experimental conditions, tissue storage, dissociation processes, and sequencing library preparation [42]. These effects can cause clusters to appear as distinct cell types even when they are actually the same, potentially leading to erroneous biological interpretations if not properly addressed.

Impact on Foundation Model Performance

The presence of technical noise and batch effects poses particular challenges for single-cell foundation models. Recent zero-shot evaluations of Geneformer and scGPT revealed that these models often fail to correct for batch effects between different experimental techniques [4]. In some cases, Geneformer's embedding space failed to retain information about cell type, with clustering primarily driven by batch effects rather than biological reality [4]. While scGPT's embeddings offered some separation between cell types, the primary structure in dimensionality reduction was still dominated by technical variation [4].

Quantitative evaluation with batch integration metrics demonstrated that both Geneformer and scGPT underperformed relative to established methods like Harmony and scVI across most datasets [4]. Surprisingly, the best batch integration scores for all datasets were achieved by simply selecting highly variable genes, highlighting the continued importance of fundamental preprocessing steps [4].

Quantitative Evaluation of Batch Correction Methods

Table 1: Performance Comparison of Batch Correction Methods Across Multiple Metrics

Method	Cell Type Clustering (AvgBIO Score)	Batch Integration (Pancreas Dataset)	Computational Efficiency	Preservation of Rare Cell Types
Harmony	Moderate to High	Excellent for technical variation	High	Moderate
scVI	High	Excellent for technical variation	Moderate	Moderate
HVG Selection	Variable	Excellent across datasets	Very High	Limited
scGPT (zero-shot)	Inconsistent	Poor to Moderate	Low	Unknown
Geneformer (zero-shot)	Poor	Poor	Low	Unknown
BDACL	High	Not reported	Not reported	Excellent

Table 2: Performance of Foundation Models in Zero-Shot Cell Type Clustering

Model	Performance Relative to Baselines	Consistency Across Datasets	Effect of Pretraining Data	Batch Integration Capability
scGPT	Underperforms scVI and Harmony on most datasets	Variable; better on PBMC (12k) dataset	Improves with pretraining, but larger datasets not always beneficial	Fails to correct for batch effects between techniques
Geneformer	Consistently underperforms baselines	Poor across datasets	Limited improvement even with pretraining data overlap	Fails to retain cell type information; clustering driven by batch

Experimental Protocols for Quality Control

Comprehensive Quality Control Workflow

The following protocol outlines a standardized workflow for quality control in scRNA-seq data analysis, adapted from established best practices [42] [43] [44]:

Step 1: Initial Data Assessment

Import count matrices from preprocessing tools (CellRanger, STARsolo, etc.)
Distinguish between "Droplet" matrices (containing empty droplets), "Cell" matrices (empty droplets excluded), and "FilteredCell" matrices (poor-quality cells excluded) [44]
Generate preliminary quality metrics including total counts, genes detected per cell, and percentage of mitochondrial genes

Step 2: Empty Droplet Detection

Apply algorithms such as barcodeRanks and EmptyDrops from the dropletUtils package [44]
Identify the knee and inflection points in the log-log plot of barcode ranks against total counts
Flag barcodes with total counts below these thresholds as empty droplets
Remove empty droplets from subsequent analysis

Step 3: Transcript-Level Quality Control

Remove artifact transcripts including ambient RNA using tools like SoupX or CellBender [42]
Filter out overabundant genes that may induce batch effects: ribosomal genes, immunoglobulin genes, HLA genes, and specific long non-coding RNAs [42]
Approach stress-related gene removal cautiously, as these may reflect biological response rather than technical artifacts

Step 4: Cell-Level Quality Control

Detect and remove doublets/multiplets using tools like Scrublet, DoubletFinder, or doubletCells [42] [44]
Filter cells based on quality thresholds:
- Remove cells with excessively high or low gene/UMI counts [42] [43]
- Exclude cells with mitochondrial percentage exceeding 5-15% (tissue-dependent) [42] [43]
- Apply median absolute deviation (MAD) filtering to identify outliers across multiple QC metrics [43]

Step 5: Data Normalization and Scaling

Regress out technical covariates including total UMIs per cell, mitochondrial gene percentage, and stress signatures [42]
Account for cell cycle heterogeneity by regressing out cell cycle scores [42]
Apply appropriate normalization methods to address differences in sequencing depth

Step 6: Batch Effect Correction

Select appropriate integration methods based on data complexity:
- Use Harmony for simple integration tasks with distinct batch and biological structures [42]
- Apply scVI for complex integration tasks such as tissue or organ atlases [42]
- Consider BBKNN for scalable data regarding runtime and memory efficiency [42]
Exercise caution when correcting heterogeneous samples (e.g., tumors) to avoid removing biologically meaningful variation

Quality Control Visualization Workflow

Diagram 1: Comprehensive Quality Control Workflow for Single-Cell RNA Sequencing Data. This workflow outlines the sequential steps for processing scRNA-seq data before application of foundation models, highlighting critical stages for addressing technical noise and batch effects.

The Scientist's Toolkit: Essential Research Reagents and Computational Tools

Table 3: Key Research Reagent Solutions for scRNA-seq Quality Control

Tool/Reagent	Function	Application Context
SoupX	Ambient RNA removal	Effective for single-nucleus data; requires some manual input of marker genes
CellBender	Background noise reduction	Superior for cleaning noisy datasets and extracting biological signals
Scrublet	Doublet detection	Scalable for large datasets; identifies multiplets in droplet-based platforms
DoubletFinder	Doublet detection	High accuracy impact on downstream analyses; superior statistical stability
Harmony	Batch effect correction	Ideal for simple integration tasks with distinct batch and biological structures
scVI	Batch effect correction	Suitable for complex integration tasks like tissue or organ atlases
BBKNN	Batch effect correction	Excellent for scalable data with runtime and memory efficiency constraints
DecontX	Ambient RNA estimation	Estimates contamination levels and deconvolutes native vs. contaminating RNA

Strategies for Optimizing Foundation Model Performance

Preprocessing Strategies for Enhanced Model Utility

Given the current limitations of single-cell foundation models in zero-shot settings, researchers should adopt specific preprocessing strategies to optimize performance:

Data Quality Assessment

Implement rigorous quality control metrics before applying foundation models
Utilize comprehensive pipelines like SCTK-QC that integrate multiple QC tools and generate standardized reports [44]
Carefully document quality thresholds and filtering parameters for reproducibility

Batch Effect Management

Apply appropriate batch correction methods based on data complexity and structure [42]
Avoid overcorrection that might remove biologically meaningful variation, particularly in heterogeneous samples like tumors [42]
Validate integration success using multiple metrics and visualization approaches

Feature Selection Considerations

Recognize that simple approaches like highly variable gene selection may outperform foundation model embeddings for some tasks [4] [3]
Experiment with different feature selection strategies when using foundation models in zero-shot settings
Document feature selection methods thoroughly to enable replication

Method Selection Framework

Diagram 2: Method Selection Framework for Batch Effect Correction. This decision tree guides researchers in selecting appropriate computational methods based on dataset characteristics and research objectives, highlighting scenarios where foundation models may be appropriate versus cases where traditional methods are preferable.

The effective handling of batch effects and technical noise remains a fundamental challenge in single-cell genomics, particularly with the emergence of foundation models that promise zero-shot biological discovery. Current evidence suggests that even sophisticated foundation models like scGPT and Geneformer struggle with batch effect correction in zero-shot settings and may be outperformed by traditional methods [4] [3] [5]. This reality underscores the continued importance of rigorous quality control protocols and appropriate method selection based on specific dataset characteristics and research questions.

As the field advances, researchers must maintain a critical perspective on methodological claims, particularly regarding the zero-shot capabilities of foundation models. The development of standardized evaluation practices—including comprehensive zero-shot assessment—will be crucial for accurately measuring progress in this rapidly evolving domain [4] [3]. By implementing robust quality control workflows, selecting appropriate batch correction methods, and understanding the current limitations of foundation models, researchers can more effectively navigate the data quality hurdle and advance our understanding of cellular biology.

Single-cell foundation models (scFMs) represent a paradigm shift in computational biology, leveraging large-scale deep learning to interpret complex single-cell omics data. These models are pretrained on vast datasets through self-supervised learning, enabling adaptation to various downstream tasks such as cell type annotation, batch integration, and perturbation prediction without task-specific labels [1] [45]. The performance of scFMs in zero-shot learning settings—where models are applied without further training—is critically dependent on the quality, scale, and diversity of their pretraining data [38] [16]. This protocol examines the quantitative relationships between dataset characteristics and model efficacy, providing actionable guidelines for constructing optimized pretraining corpora for scFMs.

The Critical Role of Pretraining Data in scFMs

The foundational premise of scFMs mirrors that of large language models: exposure to massive, diverse datasets enables the learning of fundamental biological principles that generalize across tasks. In single-cell biology, individual cells are treated analogously to sentences, with genes or genomic features serving as tokens or words [1] [45]. The transformer architectures underpinning most scFMs utilize attention mechanisms to learn relationships between genes across millions of cellular contexts, forming a universal representation of cellular states and functions [1] [26].

The self-supervised pretraining process typically employs objectives like masked gene modeling, where the model learns to predict randomly masked genes based on the context of other genes in the cell [1] [15]. This process allows the model to internalize complex gene regulatory relationships, cellular functions, and expression patterns without manual annotation. The resulting model embeddings—both at the gene and cell level—encode biological knowledge that can be leveraged for diverse analytical tasks through zero-shot application or minimal fine-tuning [16] [2].

Quantitative Impact of Dataset Characteristics

Dataset Scale and Model Performance

Extensive benchmarking reveals a complex relationship between pretraining dataset size and downstream task performance. The following table summarizes empirical findings from leading scFM implementations:

Table 1: Impact of Pretraining Dataset Scale on Model Performance

Model	Pretraining Dataset Size	Key Performance Findings	Primary Limitations
CellFM [15]	100 million human cells	Outperforms existing models in cell annotation, perturbation prediction, and gene function prediction; demonstrates benefits of extreme scale for single-species modeling.	Computational intensity; requires specialized infrastructure (e.g., Ascend910 NPUs).
scGPT [1] [15]	33 million human cells	Strong performance in multi-omic integration and zero-shot annotation; robust across diverse tasks.	Inconsistent zero-shot performance on some datasets compared to simpler methods [38].
Geneformer [1] [16]	30 million cells	Effective for gene-level tasks and transfer learning; captures biologically meaningful relationships.	Underperforms in zero-shot batch integration and cell type clustering [38].
scFoundation [16] [15]	~50 million cells	Directly predicts raw gene expression values; preserves full data resolution.	Performance varies across tasks; no consistent superiority across all benchmarks.
UCE [16]	36 million cells	Integrates cross-species data using protein language models; captures molecular diversity.	Large parameter count (650M) increases computational demands.
LangCell [16]	27.5 million scRNA-text pairs	Incorporates cell type labels during pretraining; enables novel text-cell integration capabilities.	Performance depends on quality and consistency of text annotations.

The relationship between scale and performance exhibits diminishing returns. Evaluations of scGPT variants pretrained on datasets of different sizes (from 814,000 kidney cells to 33 million diverse human cells) demonstrated that while pretraining provides clear benefits over random initialization, larger and more diverse datasets do not always confer proportional improvements [38]. In some cases, smaller tissue-specific models (e.g., scGPT blood trained on 10.3 million blood and bone marrow cells) performed comparably to or even better than the larger general model on specific tissue types [38].

Dataset Diversity and Composition

Beyond sheer volume, the diversity of cell types, tissues, and experimental conditions within pretraining data significantly impacts model robustness and generalizability:

Table 2: Impact of Dataset Diversity on Model Generalization

Diversity Dimension	Impact on Model Performance	Evidence from Benchmarking
Cell Type Diversity	Enables recognition of rare cell types and improves cross-tissue generalization.	Models trained on diverse atlases (e.g., Human Cell Atlas) outperform tissue-specific models on novel cell types [1] [16].
Species Representation	Facilitates cross-species learning and evolutionary insights.	UCE demonstrates effectiveness in capturing molecular diversity across species [16] [15].
Experimental Conditions	Improves robustness to technical variations and batch effects.	Models trained on data from multiple technologies (10x Genomics, Smart-seq2, etc.) show better integration capabilities [16] [2].
Disease States	Enhances clinical relevance and disease-specific insights.	Inclusion of diseased cells (e.g., 7.1M viral infection cells, 3.5M lung cancer cells) improves pathological characterization [15].

The composition balance of pretraining datasets emerges as a critical factor. Models trained on data from specific tissues (e.g., blood and bone marrow) may outperform more general models on tasks involving those same tissues, even when the general model was trained on significantly more data [38]. This suggests that strategic balancing of tissue representation, rather than simply maximizing total cell count, may optimize pretraining efficiency.

Dataset Curation Protocols and Quality Control

Standardized Curation Workflow

Implementing rigorous data curation protocols is essential for constructing high-quality pretraining datasets. The following workflow, implemented successfully for CellFM, provides a template for systematic dataset assembly:

Diagram 1: Dataset Curation and Quality Control Workflow

Critical Quality Control Measures

Multi-Source Data Acquisition: Collect data from diverse repositories including NCBI GEO, ENA, GSA, ImmPort, and CELLxGENE [1] [15]. CELLxGENE alone provides unified access to over 100 million standardized single-cells, representing an invaluable resource [1] [26].
Quality Control and Filtering:
- Filter cells based on quality metrics: mitochondrial read percentage, unique gene counts, and total read counts [15].
- Remove lowly expressed genes that appear in only a small fraction of cells [1].
- Implement sample-level filtering to exclude datasets with evident technical artifacts or poor sequencing quality [1].
Gene Name Standardization: Apply HUGO Gene Nomenclature Committee (HGNC) guidelines consistently across all datasets to ensure uniform gene identifiers [15]. This critical step resolves discrepancies in gene symbol usage across different source datasets.
Metadata Annotation and Balancing:
- Annotate cells with standardized metadata including tissue origin, disease status, donor characteristics, and experimental platform [16] [15].
- Balance dataset composition to prevent overrepresentation of common cell types (e.g., T cells) and underrepresentation of rare populations [1] [16].
Batch Effect Documentation: Document technical batch effects (platform, laboratory, processing protocol) but avoid aggressive batch correction during pretraining dataset construction to preserve biological variance [1] [16].

Experimental Protocols for Evaluating Data Impact

Zero-Shot Performance Benchmarking

To quantitatively assess how dataset characteristics influence model capabilities, implement the following evaluation protocol:

Embedding Extraction: Generate zero-shot cell embeddings from the pretrained model without any fine-tuning [38] [16].
Cell Type Clustering Evaluation:
- Apply standard clustering algorithms (e.g., Louvain, Leiden) to model embeddings.
- Calculate Average BIO (AvgBio) score and Average Silhouette Width (ASW) to quantify cluster separation and cohesion [38].
- Compare against baseline methods including Highly Variable Genes (HVG) selection, Harmony, and scVI [38] [16].
Batch Integration Assessment:
- Apply models to datasets with known batch effects (e.g., Pancreas benchmark with five different sources) [38].
- Quantify batch mixing using metrics such as principal component regression (PCR) score while preserving biological variation [38].
- Visualize embeddings to confirm integration of technical replicates while maintaining separation of distinct cell types [38].
Biological Relevance Validation:
- Implement ontology-informed metrics including scGraph-OntoRWR to measure consistency of captured cell type relationships with established biological knowledge [16] [2].
- Apply Lowest Common Ancestor Distance (LCAD) to assess the severity of cell type misclassifications based on ontological proximity [16] [2].

Cross-Architecture Comparison Protocol

To isolate data impacts from architectural effects, implement cross-model benchmarking:

Model Selection: Include diverse architectures (Geneformer, scGPT, UCE, scFoundation, LangCell, scCello) representing different pretraining strategies [16].
Task Diversity: Evaluate across gene-level (gene function prediction, gene-gene relationships) and cell-level (batch integration, cell type annotation, drug sensitivity prediction) tasks [16] [2].
Performance Aggregation: Use non-dominated sorting algorithms to aggregate multiple evaluation metrics into holistic model rankings [16].

Table 3: Essential Research Reagents and Computational Resources for scFM Pretraining

Resource Category	Specific Tools & Platforms	Primary Function	Access Considerations
Data Repositories	CZ CELLxGENE Discover [1], DISCO [26], NCBI GEO [1], Human Cell Atlas [1]	Provide standardized, annotated single-cell datasets for pretraining	CELLxGENE offers >100 million cells; DISCO supports federated analysis
Computational Frameworks	BioLLM [17], MindSpore (CellFM) [15], PyTorch (scGPT) [1]	Unified interfaces for model training and evaluation; specialized AI frameworks	BioLLM standardizes APIs across models; MindSpore optimized for Ascend chips
Pretraining Corpora	Curated compendia from PanglaoDB [1], Human Ensemble Cell Atlas [1]	Provide pre-integrated datasets from multiple sources	Reduce curation overhead but require validation for specific use cases
Hardware Infrastructure	Ascend910 NPUs [15], GPU clusters	Accelerate training of large models (100M-800M parameters)	CellFM required 4x Atlas800 servers with 8x Ascend910 NPUs each
Evaluation Platforms	scGNN+ [26], specialized benchmarking frameworks [16] [2]	Automate optimization and provide biologically informed evaluation	Incorporate novel metrics like scGraph-OntoRWR for biological relevance

Optimizing pretraining datasets for single-cell foundation models requires balanced consideration of scale, diversity, and curation quality. While increasing dataset size generally improves performance, evidence suggests diminishing returns beyond certain thresholds, emphasizing the importance of strategic dataset composition and rigorous quality control [38] [16]. Future work should focus on developing standardized curation protocols, optimizing dataset balancing algorithms, and establishing rigorous benchmarks for evaluating the biological fidelity of learned representations, particularly in zero-shot settings where scFMs face their most significant challenges and opportunities [38] [16].

Single-cell foundation models (scFMs), pretrained on vast datasets using self-supervised objectives like Masked Language Modeling (MLM), promise to transform biological discovery. A critical evaluation of their zero-shot capabilities, however, reveals significant limitations. This Application Note demonstrates that in zero-shot settings—essential for exploratory biology where labels are unknown—proposed scFMs can be outperformed by simpler, established methods in tasks such as cell type clustering and batch integration. We present structured quantitative evaluations and detailed experimental protocols to guide researchers in benchmarking model performance, emphasizing that the choice of pretraining objective is paramount for developing robust, reliable, and biologically insightful scFMs.

The advent of single-cell foundation models (scFMs) represents a paradigm shift, aiming to leverage large-scale, unlabeled data to build foundational knowledge of cellular biology. These models, often based on transformer architectures, are typically pretrained using self-supervised objectives, with Masked Language Modeling (MLM) being a predominant choice [1]. In this framework, portions of a cell's gene expression profile are masked, and the model is trained to reconstruct them, analogous to how language models predict missing words [1].

A model's true generalizability, however, is most rigorously tested in a zero-shot setting, where its pretrained internal representations (embeddings) are used for downstream tasks without any task-specific fine-tuning [4]. This is not merely a technical benchmark; it is a fundamental requirement for discovery-driven science. In many research contexts, such as identifying novel cell states or characterizing heterogeneous tumor microenvironments, predefined labels do not exist, precluding the possibility of fine-tuning [4]. The performance of a model in this setting is a direct reflection of the quality and transferability of the biological knowledge acquired during pretraining. Recent evidence suggests that the current generation of scFMs, including Geneformer and scGPT, may face reliability challenges in this critical regime, sometimes being outperformed by simpler methods like highly variable gene (HVG) selection or established integration tools like Harmony and scVI [4]. This underscores the urgent need for systematic evaluation of how different pretraining objectives contribute to robust zero-shot performance.

Quantitative Evaluation of Model Performance

A rigorous, quantitative benchmark is essential for comparing the effectiveness of different models and pretraining strategies. The following tables summarize key performance metrics across critical single-cell analysis tasks.

Table 1: Zero-Shot Cell Type Clustering Performance (AvgBIO Score) This table evaluates the ability of model-generated cell embeddings to separate known cell types without further training. A higher AvgBIO score indicates better performance [4].

Model / Method	PBMC (12k)	Tabula Sapiens	Pancreas	Immune Dataset
HVG (Baseline)	0.75	0.68	0.71	0.73
scVI	0.72	0.70	0.75	0.70
Harmony	0.70	0.65	0.72	0.69
scGPT	0.78	0.62	0.68	0.65
Geneformer	0.65	0.58	0.60	0.61

Table 2: Batch Integration Performance (Batch Mixing Score) This table assesses the model's capacity to integrate data from multiple sources, removing technical batch effects while preserving biological variation. A higher score indicates better batch correction [4].

Model / Method	Pancreas	PBMC	Tabula Sapiens	Immune Dataset
HVG (Baseline)	0.89	0.91	0.85	0.88
scVI	0.85	0.88	0.80	0.75
Harmony	0.82	0.85	0.72	0.83
scGPT	0.78	0.80	0.81	0.82
Geneformer	0.65	0.68	0.62	0.64

Table 3: Comparing Pretraining Objectives in NLP Insights from natural language processing on how objectives affect representation learning. MLM excels in representation tasks, while Causal Language Modeling (CLM) shows data efficiency. A combined strategy can be optimal [46].

Pretraining Objective	Model Architecture	Key Strengths	Key Weaknesses
Masked Language Modeling (MLM)	Encoder (e.g., BERT)	Robust performance across various representation tasks; bidirectional context.	Less data-efficient than CLM; can be less stable during fine-tuning.
Causal Language Modeling (CLM)	Decoder (e.g., GPT)	High data efficiency; improved fine-tuning stability.	Underperforms MLM on some text representation tasks.
Sequential (CLM then MLM)	Encoder-Decoder	Combines data efficiency of CLM with robust performance of MLM; optimal under fixed compute.	Requires a two-stage training process.

Experimental Protocols for Benchmarking scFMs

To ensure reproducible and comparable evaluations of scFMs, researchers should adhere to the following detailed experimental protocols.

Protocol for Zero-Shot Cell Type Clustering

Objective: To evaluate the quality of a foundation model's cell embeddings in separating known cell types without any fine-tuning.

Materials:

A held-out test scRNA-seq dataset with ground-truth cell type labels (e.g., from Tabula Sapiens).
A pretrained foundation model (e.g., scGPT, Geneformer).
Baseline methods for comparison (e.g., HVGs, scVI, Harmony).

Procedure:

Data Preprocessing: Apply standard preprocessing to the test dataset, including quality control, normalization, and log-transformation of gene expression counts. Do not train or fine-tune the foundation model on this data.
Embedding Generation:
- For the foundation model, input the preprocessed expression matrix and extract the cell embeddings from the model's output layer.
- For HVGs, select the top 2,000-5,000 highly variable genes and use this reduced matrix as the embedding.
- Generate cell embeddings using scVI and Harmony according to their standard documentation.
Dimensionality Reduction & Clustering: Apply principal component analysis (PCA) to all embedding matrices, followed by Leiden or Louvain clustering on a shared k-nearest neighbor (k-NN) graph built from the first 50 principal components.
Metric Calculation: Compute clustering metrics such as the Average BIO (AvgBIO) score and Average Silhouette Width (ASW) by comparing the clusters to the ground-truth cell type labels. The BIO score balances the completeness and homogeneity of the clustering.

Interpretation: A model whose embeddings produce higher AvgBIO and ASW scores is better at capturing biologically meaningful variation related to cell identity in a zero-shot manner.

Protocol for Zero-Shot Batch Integration

Objective: To assess a model's ability to generate embeddings that mix cells from different batches (e.g., experiments, technologies) while preserving biological cell type separations.

Materials:

A benchmark dataset with known batch effects and cell type labels (e.g., the Pancreas dataset from [4]).
The pretrained foundation model and baseline methods.

Procedure:

Data Preprocessing: Prepare the dataset as in Protocol 3.1, ensuring batch information is retained.
Embedding Generation: Generate cell embeddings for the entire dataset using the foundation model and baseline methods in a zero-shot fashion.
Qualitative Visualization: Project the embeddings into two dimensions using UMAP. Create two UMAP plots for each method: one colored by cell type and another colored by batch.
Quantitative Evaluation: Calculate two complementary metrics:
- Batch Mixing Score: Measures the degree of intermingling between batches within cell type clusters. A higher score indicates better batch correction.
- Principal Component Regression (PCR) Score: Quantifies the proportion of variance in the embeddings explained by batch after regressing out biological covariates. A lower PCR score indicates that less technical variation remains.

Interpretation: Successful batch integration is indicated by a high batch mixing score, a low PCR score, and UMAP plots where cells cluster primarily by cell type rather than by batch.

Protocol for Pretraining Objective Ablation

Objective: To isolate and evaluate the impact of different self-supervised pretraining objectives on downstream zero-shot performance.

Materials:

A large, diverse scRNA-seq corpus for pretraining (e.g., from CELLxGENE).
A suite of held-out benchmark tasks for evaluation (cell clustering, batch integration, perturbation prediction).

Procedure:

Model Architecture: Fix a single transformer architecture (e.g., a standard 6-layer encoder).
Objective Variation: Pretrain multiple instances of this model from scratch on the same pretraining corpus, but vary the pretraining objective:
- MLM: A bidirectional objective where a random subset (e.g., 15-30%) of gene tokens are masked and the model must reconstruct them [1] [46].
- CLM: A unidirectional (autoregressive) objective where the model predicts the next gene token in a sequence, given all previous tokens [46].
- CLM+MLM: A biphasic strategy where the model is first pretrained with CLM for a portion of the steps, then the training is continued with the MLM objective [46].
Controlled Pretraining: Ensure all models are trained for an equal number of steps, with the same computational budget and hyperparameter tuning effort.
Zero-Shot Evaluation: Evaluate all pretrained models on the benchmark tasks using Protocols 3.1 and 3.2, ensuring no fine-tuning is performed.

Interpretation: This controlled ablation study directly reveals which pretraining objective leads to the most transferable and robust biological representations, separating the effect of the objective from other architectural and data-scale factors.

The Scientist's Toolkit: Essential Research Reagents

The following table details key computational tools and resources essential for conducting research in single-cell foundation models.

Table 4: Key Research Reagent Solutions for scFM Development

Reagent / Resource	Type	Function & Application
CELLxGENE	Data Platform	Provides unified access to millions of standardized, annotated single-cell datasets, serving as a primary data source for pretraining scFMs [1].
scGPT / Geneformer	Foundation Model	Pretrained transformer-based models for single-cell biology; used as benchmark models or for transfer learning on downstream tasks [4].
scVI	Software Tool	A probabilistic framework for scRNA-seq data analysis; used as a strong baseline for tasks like dimensionality reduction, clustering, and batch correction [4].
Harmony	Software Tool	An integration algorithm that projects cells into a shared embedding space, effectively removing batch effects; used as a baseline for integration benchmarks [4].
ONNX Format	Model Format	An open format for representing machine learning models. Used to export and visualize PyTorch models with tools like Netron for architectural inspection [47].

Visualizing Experimental Workflows

The following diagrams illustrate the logical relationships and experimental workflows described in this note.

scFM Zero Shot Benchmarking

Pretraining Strategy Comparison

The journey toward truly foundational models in single-cell biology requires moving beyond the assumption that scaling masked modeling is sufficient. As the quantitative evidence and protocols outlined here demonstrate, rigorous zero-shot evaluation is a critical litmus test. The performance gaps revealed in tasks like clustering and batch integration highlight that the current pretraining objectives may not be fully capturing the universal patterns of biology. Future development must prioritize the design of novel, biologically-grounded pretraining tasks and be validated through the systematic, zero-shot benchmarking methodologies described in this note. Only then can scFMs reliably fulfill their promise as indispensable tools for exploratory discovery in biomedicine and drug development.

Single-cell RNA sequencing (scRNA-seq) has revolutionized biological research by allowing scientists to probe transcriptomic profiles at the cellular level, revealing complex and rare cell populations that are obscured in bulk sequencing approaches [48] [49]. The analysis of this high-dimensional, sparse, and noisy data presents significant computational challenges [16]. In response, two distinct computational paradigms have emerged: traditional analysis methods and single-cell foundation models (scFMs). Traditional methods, such as those based on highly variable genes (HVG) selection, Harmony, and scVI, are well-established, computationally efficient tools designed for specific analytical tasks [4] [50]. In contrast, scFMs are large-scale deep learning models pretrained on millions of cells using self-supervised objectives, with the goal of learning universal biological principles that can be adapted to various downstream applications [16] [1].

The choice between these approaches is not straightforward, as no single scFM consistently outperforms others across all tasks, and simpler models often remain competitive, particularly in zero-shot settings where models are used without further training [16] [4]. This guide provides a structured framework for researchers to navigate this complex model selection landscape, emphasizing practical considerations related to task requirements, computational resources, and biological interpretability.

Understanding the Technologies

Traditional Single-Cell Analysis Methods

Traditional computational approaches for scRNA-seq analysis typically consist of specialized tools organized into analytical pipelines. These include methods for quality control, normalization, feature selection (e.g., Highly Variable Genes), dimensionality reduction (PCA, UMAP), clustering, and differential expression [48] [49]. Established integration algorithms like Harmony and scVI effectively correct for batch effects while preserving biological variation [4]. These methods are characterized by their focused functionality, relatively low computational demands, and well-understood statistical properties [50] [51]. They excel in well-defined analytical scenarios and remain the go-to choice for standard analyses with limited computational resources.

Single-Cell Foundation Models (scFMs)

Single-cell foundation models represent a paradigm shift from task-specific tools to general-purpose models. Inspired by large language models in natural language processing, scFMs treat individual cells as "sentences" and genes or genomic features as "words" or "tokens" [1]. These models, including Geneformer, scGPT, UCE, and scFoundation, are typically built on transformer architectures and pretrained on massive, diverse collections of single-cell data from sources like the CELLxGENE atlas, which contains over 100 million unique cells [16] [1]. Through self-supervised pretraining tasks such as masked gene modeling, scFMs learn latent representations of genes and cells that capture fundamental biological relationships [16]. These representations can then be utilized in zero-shot settings or efficiently fine-tuned for specific downstream applications, potentially uncovering insights that might be missed by traditional approaches [16].

Comparative Performance Analysis

Task-Specific Performance Evaluation

Comprehensive benchmarking studies reveal that the performance of scFMs versus traditional methods varies significantly across different analytical tasks. The table below summarizes their relative performance in key applications:

Table 1: Performance comparison across common single-cell analysis tasks

Analysis Task	Superior Approach	Key Findings	Performance Metrics
Cell Type Clustering	Traditional Methods (HVG, scVI, Harmony)	scFMs (Geneformer, scGPT) underperform in zero-shot settings; pretraining provides limited benefit [4]	AvgBIO score, Average Silhouette Width (ASW) [4]
Batch Integration	Traditional Methods (HVG, scVI, Harmony)	Geneformer consistently ranks last; scGPT shows mixed results, outperforming baselines only on specific datasets [4]	Principal Component Regression (PCR), batch mixing scores [4]
Cell Type Annotation	Context-Dependent	scFMs show promise but require careful evaluation; errors can be measured by ontological proximity (LCAD metric) [16]	Lowest Common Ancestor Distance (LCAD) [16]
Drug Sensitivity Prediction	scFMs	Foundation models demonstrate stronger performance in clinically relevant prediction tasks [16]	Task-specific accuracy metrics [16]
Knowledge Capture	scFMs	scFMs better capture biological relationships aligned with prior knowledge (e.g., cell ontology) [16]	scGraph-OntoRWR metric [16]

Zero-Shot Capabilities of scFMs

A critical consideration for researchers is the zero-shot performance of scFMs, where models are applied without any task-specific fine-tuning. This is particularly important in discovery settings where labels are unknown and fine-tuning is not feasible [4]. Current evaluations indicate that scFMs often face reliability challenges in zero-shot configurations and can be outperformed by simpler methods [4] [6]. For instance, in both cell type clustering and batch integration tasks, selecting highly variable genes (HVG) frequently outperforms both Geneformer and scGPT in zero-shot settings [4]. This suggests that the masked language model pretraining framework may not inherently produce high-quality cell embeddings without additional fine-tuning, highlighting a significant limitation for exploratory research [4].

Decision Framework for Model Selection

Key Selection Criteria

Choosing between scFMs and traditional methods requires careful consideration of multiple factors. The following diagram illustrates the decision workflow:

Task-Based Recommendations

Different analytical tasks warrant distinct approaches based on empirical performance evidence:

Table 2: Task-specific model recommendations

Task Category	Recommended Approach	Rationale	Use Case Examples
Standard Clustering & Annotation	Traditional Methods (HVG + Harmony/scVI)	Established reliability, lower computational cost, interpretable results [4]	Initial cell type identification, standard atlas construction
Complex Biological Predictions	scFMs with Fine-tuning	Superior capture of biological relationships, transfer learning capabilities [16]	Drug response prediction, cancer cell identification, developmental trajectories
Exploratory Analysis (Unknown Cell Types)	Traditional Methods (Zero-shot)	More reliable zero-shot performance when ground truth is unavailable [4]	Novel cell type discovery, rare cell population identification
Batch Integration	Harmony or scVI	Consistent performance across diverse datasets and batch effects [4]	Multi-dataset integration, cross-study comparisons
Knowledge-Driven Discovery	scFMs	Better alignment with established biological hierarchies and ontologies [16]	Cell lineage relationships, regulatory network inference

Resource Considerations

Implementation complexity varies significantly between approaches, impacting their practical feasibility:

Computational Resources: scFMs require substantial GPU memory and processing power for both training and inference, whereas traditional methods can typically run on standard workstations or high-performance CPU clusters [16] [51].
Expertise Requirements: Traditional methods have more established best practices and interpretable parameters, while scFMs require specialized knowledge in deep learning and transformer architectures [1] [51].
Time Constraints: For rapid prototyping and analysis, traditional methods offer faster turnaround times, while scFMs, particularly those requiring fine-tuning, involve more extensive experimental pipelines [16].

Experimental Protocols

Standardized Evaluation Protocol

To ensure fair comparison between approaches, implement this standardized evaluation protocol:

Data Partitioning: Split data into training (70%), validation (15%), and test (15%) sets, ensuring balanced representation of biological conditions and batches [16].
Baseline Establishment: Implement traditional methods (HVG selection + Harmony/scVI) as performance baselines using standardized parameters [4].
scFM Configuration: For scFMs, extract zero-shot embeddings first, then evaluate fine-tuned performance with limited epoch training (3-5 epochs) [16] [4].
Metric Calculation: Apply multiple evaluation metrics including clustering quality (AvgBIO, ASW), batch correction (PCR), and biological consistency (scGraph-OntoRWR) [16] [4].
Resource Monitoring: Track computational time, memory usage, and hardware requirements for each approach [51].

Implementation Workflow for scFMs

The following diagram outlines a standardized workflow for implementing and evaluating scFMs:

The Scientist's Toolkit

Successful implementation of single-cell analysis requires both wet-lab reagents and computational resources:

Table 3: Essential resources for single-cell analysis workflows

Resource Category	Specific Tools/Reagents	Function/Purpose	Implementation Notes
Wet-Lab Reagents	10x Genomics Chromium System	High-throughput single-cell capture and barcoding [48]	Enables processing of thousands to millions of cells
Wet-Lab Reagents	Smart-seq2/Smart-seq3 Reagents	Full-length transcript coverage for alternative splicing analysis [48] [49]	Lower throughput but superior transcript characterization
Wet-Lab Reagents	Unique Molecular Identifiers (UMIs)	Molecular counting and PCR bias correction [48] [49]	Critical for accurate quantification; typically 4-8 bp sequences
Computational Tools	Scanpy, Seurat	Standard pipelines for traditional single-cell analysis [4] [49]	Python/R environments respectively
Computational Tools	Harmony, scVI	Batch effect correction and data integration [4]	Essential for multi-dataset analyses
Computational Tools	Geneformer, scGPT	Foundation model architectures for transfer learning [16] [4]	Pretrained models available with specific tokenization schemes
Data Resources	CELLxGENE, Human Cell Atlas	Curated single-cell data for pretraining and benchmarking [1]	Contains >100 million cells across tissues and conditions

The choice between single-cell foundation models and traditional methods represents a strategic decision that should be guided by specific research questions, available resources, and task requirements. Traditional methods remain robust, efficient solutions for standard analytical tasks, particularly in zero-shot scenarios and resource-constrained environments. In contrast, scFMs offer exciting potential for uncovering novel biological insights, especially in complex prediction tasks where their transfer learning capabilities and knowledge capture provide distinct advantages. As the field evolves, the most effective approach will likely involve thoughtful integration of both paradigms, leveraging their complementary strengths to advance single-cell research and therapeutic development.

Rigorous Benchmarking and Comparative Analysis of Model Capabilities

Zero-shot evaluation represents a critical testing ground for single-cell foundation models (scFMs). Unlike fine-tuning, where models are further trained on specific tasks, zero-shot assessment requires models to perform tasks immediately after pretraining, using their learned representations without any additional task-specific training [4]. This approach is vital for biological discovery settings where predefined labels are unavailable, and it provides a rigorous test of whether a model has genuinely learned fundamental biological principles [4] [3]. Recent evaluations have revealed that scFMs often underperform compared to simpler traditional methods in zero-shot settings, highlighting an urgent need for standardized, robust benchmarking practices [4] [5] [3]. This document establishes comprehensive application notes and protocols for zero-shot evaluation of scFMs, providing the research community with standardized datasets, metrics, and experimental frameworks.

Critical Datasets for Benchmarking

A robust zero-shot benchmark requires diverse datasets that represent various biological conditions, technologies, and challenges. The table below summarizes essential characteristics of key benchmarking datasets identified from recent evaluations.

Table 1: Essential Datasets for Zero-Shot scFM Benchmarking

Dataset Name	Tissue/Origin	Key Characteristics	Cell Count (Approx.)	Notable Features for Evaluation
Pancreas [4] [16]	Pancreas	Multiple experimental techniques	Varies	Significant batch effects between techniques
PBMC (12k) [4]	Peripheral Blood Mononuclear Cells	Standardized immune cell profiling	~12,000	Technical variation across experiments
Tabula Sapiens [4] [16]	Multiple tissues	Multiple organ systems	~600,000	Cross-tissue heterogeneity
Immune Cell Atlas [4]	Immune cells	Diverse immune populations	Varies	Biological and technical variation
AIDA v2 [16]	Multiple tissues	Asian immune diversity	Varies	Independent, unbiased validation
Cancer datasets [16]	Multiple cancer types	Clinical relevance	Varies	Intra-tumor heterogeneity

These datasets collectively provide the variation necessary to stress-test scFMs. The Pancreas dataset is particularly valuable for evaluating batch integration capabilities, as it contains data generated using different experimental techniques [4]. Tabula Sapiens offers cross-tissue complexity, while immune cell datasets capture diverse cell states. The inclusion of cancer datasets enables assessment of clinical relevance, and AIDA v2 serves as a completely independent validation set to mitigate risks of data leakage from pretraining corpora [16].

When constructing benchmarks, researchers should consider the potential overlap between evaluation datasets and those used in model pretraining. Some studies have found that scFMs do not consistently outperform baselines even on datasets seen during pretraining, suggesting limitations in how well the pretraining objective aligns with downstream zero-shot tasks [4].

Key Evaluation Metrics and Their Interpretation

Comprehensive zero-shot evaluation requires multiple metrics that capture different aspects of model performance. The following table organizes the essential metrics for scFM evaluation.

Table 2: Key Metrics for Zero-Shot scFM Evaluation

Metric Category	Specific Metrics	Interpretation and Biological Relevance
Cell Type Clustering	Average BIO (AvgBIO) Score [4], Average Silhouette Width (ASW) [4]	Measures separation of known cell types in embedding space; higher values indicate better biological relevance
Batch Integration	Principal Component Regression (PCR) Score [4], Batch Mixing Scores [4]	Quantifies removal of technical artifacts while preserving biological variation; lower PCR indicates better integration
Biological Plausibility	scGraph-OntoRWR [16], Lowest Common Ancestor Distance (LCAD) [16]	Measures consistency with established biological knowledge from cell ontologies
Perturbation Prediction	Perturbation Effect Scores [52]	Assesses prediction accuracy of cellular responses to genetic or chemical perturbations
Landscape Analysis	Roughness Index (ROGI) [16]	Quantifies smoothness of cell-property landscape in latent space; smoother landscapes facilitate downstream task learning

The scGraph-OntoRWR metric represents a significant advancement in evaluating biological relevance. It measures the consistency between cell-type relationships captured by scFM embeddings and established biological knowledge in cell ontologies, providing a knowledge-aware assessment beyond purely statistical measures [16]. Similarly, LCAD evaluates the severity of cell type misannotation by measuring the ontological proximity between misclassified cell types, recognizing that not all annotation errors are equally serious [16].

For perturbation prediction, specialized benchmarks like PertEval-scFM provide standardized frameworks for assessing how well zero-shot embeddings capture information about cellular responses to genetic and chemical perturbations [52]. Performance in this area is particularly important for drug discovery applications.

Experimental Protocols for Zero-Shot Evaluation

Core Zero-Shot Evaluation Workflow

The following diagram illustrates the standardized workflow for zero-shot evaluation of single-cell foundation models:

Zero-Shot scFM Evaluation Workflow

Protocol 1: Cell Type Clustering Evaluation

Purpose: To assess the ability of scFM embeddings to separate known cell types without additional training.

Materials:

Pretrained scFM (e.g., scGPT, Geneformer, UCE, scFoundation)
Benchmark dataset with ground truth cell type labels
Baseline methods (HVG selection, Harmony, scVI)
Computing environment with adequate GPU resources

Procedure:

Data Preparation: Standardize the input dataset using the scFM's predefined preprocessing pipeline. Ensure no dataset-specific normalization is applied that could constitute implicit fine-tuning.
Embedding Generation: Pass each cell through the scFM in inference mode to extract the cell embeddings. For transformer models, this is typically the [CLS] token embedding or mean of all token embeddings.
Dimensionality Reduction: Apply uniform manifold approximation and projection (UMAP) to reduce embeddings to two dimensions for visualization.
Clustering Analysis: Perform Leiden clustering on the embeddings without using ground truth labels.
Metric Calculation: Compute clustering metrics including:
- Average BIO score (AvgBIO) to measure cell type separation
- Average silhouette width (ASW) for cluster compactness
- Adjusted Rand Index (ARI) for similarity to ground truth
Baseline Comparison: Repeat steps 2-5 with baseline methods including highly variable genes (HVG) selection, Harmony, and scVI.

Interpretation: Superior scFM performance should demonstrate consistently high scores across multiple datasets and metrics. Current evidence suggests that HVG selection often outperforms scFMs in zero-shot settings, providing a critical baseline for comparison [4].

Protocol 2: Batch Integration Assessment

Purpose: To evaluate how well scFM embeddings remove technical batch effects while preserving biological variation.

Materials:

Benchmark dataset with significant batch effects (e.g., Pancreas dataset with multiple experimental techniques)
Same baseline methods as Protocol 1
Batch integration metrics suite

Procedure:

Dataset Selection: Choose a dataset with pronounced batch effects from multiple sources or technologies.
Embedding Generation: Generate cell embeddings using the scFM as in Protocol 1.
Visual Assessment: Create UMAP plots colored by batch and cell type to qualitatively assess integration.
Quantitative Metrics: Calculate:
- Principal Component Regression (PCR) score: proportion of variance explained by batch
- Batch mixing scores: how well batches are intermixed within cell types
- Biological conservation scores: preservation of cell type separation after integration
Comparative Analysis: Compare results against Harmony and scVI, which represent state-of-the-art batch integration methods.

Interpretation: Effective batch integration should show low PCR scores (minimal batch effect) while maintaining clear separation of biologically distinct cell types. Current evaluations indicate that scFMs often struggle with batch integration, sometimes showing higher batch effects than the original data [4].

Protocol 3: Biological Plausibility Evaluation

Purpose: To assess whether scFM embeddings capture biologically meaningful relationships consistent with established knowledge.

Materials:

Cell ontology resources (e.g., Cell Ontology)
scGraph-OntoRWR implementation [16]
Gene ontology databases

Procedure:

Embedding Generation: Generate cell embeddings as in previous protocols.
Cell-Type Relationship Mapping: Calculate pairwise distances between cell-type centroids in the embedding space.
Ontology-Based Random Walk: Implement the scGraph-OntoRWR algorithm:
- Construct a graph from cell ontology relationships
- Perform random walks with restarts from each cell type
- Measure correlation between ontology-based and embedding-based similarity
LCAD Calculation: For cell type annotation tasks, compute the Lowest Common Ancestor Distance between misclassified cell types to assess semantic severity of errors.
Gene Function Analysis: Evaluate gene embeddings for functional coherence using gene ontology enrichment.

Interpretation: High scGraph-OntoRWR scores indicate that the embedding space reflects established biological knowledge. The LCAD metric provides nuanced evaluation of annotation errors, recognizing that confusing closely related cell types is less severe than confusing distantly related ones [16].

The Scientist's Toolkit: Essential Research Reagents

Table 3: Essential Research Reagents and Computational Tools for scFM Benchmarking

Tool/Resource	Type	Function in Evaluation	Access Information
CZ CELLxGENE [1]	Data Platform	Provides standardized access to millions of single-cell datasets	Publicly available at cellxgene.cziscience.com
Geneformer [4] [16]	scFM	Transformer-based model for single-cell analysis	Available through Hugging Face
scGPT [4] [16]	scFM	Generative pretrained transformer for single-cell data	GitHub repository
Harmony [4] [16]	Integration Method	Baseline for batch integration evaluation	R/Python packages
scVI [4] [16]	Generative Model	Baseline for probabilistic modeling of scRNA-seq data	Python package
PertEval-scFM [52]	Benchmark Framework	Specialized evaluation of perturbation prediction	GitHub repository
AIDA v2 [16]	Benchmark Dataset	Independent validation dataset for unbiased evaluation	Available through CELLxGENE

Analysis and Future Directions

Current zero-shot evaluations reveal significant limitations in scFMs. Multiple studies have demonstrated that these models often fail to outperform simpler baselines across various tasks, including cell type clustering and batch integration [4] [5] [3]. The masked language model pretraining objective, while successful in NLP, may not be optimally aligned with biological learning for single-cell data [3]. Furthermore, models show inconsistent performance even on datasets included in their pretraining corpora, suggesting fundamental limitations in how they capture and retain biological information [4].

The relationship between pretraining dataset scale and model performance appears complex. While some evidence suggests that increased pretraining data confers benefits, there may be diminishing returns, with larger datasets not necessarily translating to better zero-shot capabilities [4]. This highlights the need for improved pretraining strategies rather than simply scaling dataset size.

Future benchmark development should prioritize several key areas: First, creating more challenging evaluation tasks that require deeper biological reasoning, such as predicting cellular responses to novel perturbations [52] [31]. Second, developing better metrics that directly measure biological insight rather than just statistical patterns. Third, establishing rigorous standards to prevent data leakage between pretraining and evaluation sets. Finally, creating more nuanced evaluations that consider the practical contexts in which scFMs will be deployed, particularly for clinical and drug discovery applications [16] [31].

As the field matures, benchmarks must evolve beyond simple performance comparisons to provide diagnostic insights into why models succeed or fail. This will require closer integration of biological expertise in benchmark design and interpretation, ensuring that evaluations measure not just statistical patterns but meaningful biological understanding.

Single-cell RNA sequencing (scRNA-seq) has fundamentally transformed biological research by enabling the characterization of gene expression at the level of individual cells. A cornerstone of scRNA-seq analysis is cell type clustering, the process of grouping cells based on transcriptional similarity to identify distinct cellular populations. The emergence of single-cell foundation models (scFMs)—large-scale deep learning models pretrained on millions of cells—promises a new paradigm for this task. These models, including Geneformer and scGPT, are designed to learn universal biological principles from vast data corpora, which can then be applied to various downstream analyses, ideally without additional task-specific training (a "zero-shot" setting) [1].

This application note provides a structured, evidence-based comparison of these novel scFMs against established traditional methods for cell type clustering. We focus on a zero-shot evaluation framework, which is critical for discovery-driven research where cell type labels are unknown and fine-tuning is impractical [4]. We synthesize findings from recent, rigorous benchmarks to guide researchers and drug development professionals in selecting the most effective and reliable methods for their specific experimental contexts.

Quantitative Performance Comparison

Recent comprehensive benchmarking studies have evaluated the performance of scFMs against traditional methods on multiple datasets with known cell type labels. Performance is typically measured using clustering metrics like the Average BIO score (AvgBio) and Average Silhouette Width (ASW), which assess how well the clusters match the true biological labels.

Table 1: Zero-shot Cell Type Clustering Performance (AvgBio Score) [4]

Method Category	Specific Method	PBMC (12k)	Tabula Sapiens	Pancreas	Immune Dataset
Single-cell Foundation Models (scFMs)	Geneformer	Underperforms Baselines	Underperforms Baselines	Underperforms Baselines	Underperforms Baselines
	scGPT	Comparable to scVI	Underperforms HVG/ScVI	Underperforms HVG/ScVI	Underperforms HVG/ScVI
Traditional Methods	HVG (Selection)	Outperforms scFMs	Outperforms scFMs	Outperforms scFMs	Outperforms scFMs
	Harmony	Outperforms scFMs	Outperforms scFMs	Outperforms scFMs	Outperforms scFMs
	scVI	Outperforms scFMs	Outperforms scFMs	Outperforms scFMs	Outperforms scFMs

A key finding across multiple studies is that in a zero-shot setting, traditional methods consistently match or surpass the performance of scFMs on cell type clustering. Notably, a simple baseline method like selecting Highly Variable Genes (HVG) often outperforms both Geneformer and scGPT [4] [3]. More advanced traditional methods, such as the deep learning-based scVI and the linear transformation-based Harmony, also demonstrate superior and more reliable clustering accuracy across diverse tissues and technologies [4].

Table 2: Overall Method Characteristics for Cell Type Clustering [16] [4] [53]

Method	Clustering Accuracy (Zero-shot)	Batch Integration	Computational Efficiency	Interpretability	Ideal Use Case
Geneformer	Limited	Poor	Moderate	Low	Tasks requiring fine-tuning
scGPT	Variable	Moderate	High resource demands	Low	Exploratory analysis on similar data
HVG Selection	Good	Limited	Very High	High	Fast initial analysis on well-standardized data
Harmony	Good	Excellent	High	Medium	Integrating multiple datasets with strong batch effects
scVI	Good	Excellent	Moderate (requires GPU)	Medium	Large-scale data integration; downstream generative tasks

Experimental Protocols for Performance Benchmarking

To ensure reproducible and fair comparisons between methods, researchers should adhere to standardized benchmarking protocols. The following section outlines the experimental workflow and detailed methodologies used in the cited studies.

The following diagram illustrates the standard workflow for benchmarking single-cell clustering methods, from data input to performance evaluation.

Protocol 1: Zero-shot Clustering with Precomputed Embeddings

This protocol evaluates the intrinsic quality of cell representations generated by models without any task-specific training [4].

Input Data: A processed scRNA-seq dataset with ground truth cell type labels.
Feature Extraction:
- For scFMs (Geneformer/scGPT): Generate cell embeddings in a zero-shot manner using the publicly available pretrained models without any fine-tuning on the target dataset.
- For Traditional Methods:
  - HVG Selection: Reduce the dataset to the top 2,000 highly variable genes.
  - Harmony: Apply Harmony to the principal components (PCs) of the gene expression matrix to obtain integrated PCs.
  - scVI: Train a scVI model on the raw count data and extract the latent representation.
Clustering: Apply a standard clustering algorithm (e.g., Leiden, k-means) to the embeddings from each method. Use fixed hyperparameters (like resolution) across all methods for a fair comparison.
Evaluation: Compare the clustering results to the ground truth labels using metrics like Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Average BIO score (AvgBio) [4] [54].

Protocol 2: Benchmarking Batch Integration Capability

This protocol assesses a model's ability to mix cells from different batches while preserving biological distinctness, a key challenge in single-cell analysis [4] [55].

Input Data: Select a benchmark dataset with known, strong batch effects (e.g., the Pancreas dataset with five different technologies).
Embedding Generation: Generate cell embeddings using all methods (as in Protocol 1).
Visualization & Quantitative Metrics:
- Generate UMAP plots colored by batch and by cell type.
- Calculate quantitative integration metrics:
  - iLISI (Integration LISI): Measures the effective number of datasets/batches in a local neighborhood. A higher score indicates better batch mixing [56] [55].
  - cLISI (Cell-type LISI): Measures the effective number of cell types in a local neighborhood. A score close to 1 indicates good separation of cell types [56].
  - Principal Component Regression (PCR): Quantifies the proportion of variance in the embeddings explained by batch after correcting for cell type.

Successful execution of the benchmarking protocols requires a suite of computational tools and data resources. The table below details key solutions used in the featured studies.

Table 3: Key Research Reagent Solutions for Single-Cell Clustering Benchmarking

Category	Item / Software	Function / Description	Key Features
Foundation Models	Geneformer [16] [4]	Transformer model pretrained on 30M cells; uses gene ranking for tokenization.	Emergent network insights; fine-tuning for target tasks.
	scGPT [16] [4] [1]	Transformer model pretrained on 33M cells; supports multi-omics.	Generative capabilities; cell-centric pretraining.
Traditional Methods	Harmony [4] [56] [55]	Fast, iterative integration algorithm for removing batch effects.	High speed and low memory use; operates on PCs.
	scVI [4] [53] [55]	Deep generative model for scRNA-seq data based on variational autoencoders.	Probabilistic modeling; handles raw counts.
	HVG Selection [4] [53]	Basic feature selection to retain most variable genes.	Simple, fast, and highly effective baseline.
Data Resources	CELLxGENE [16] [1]	Curated atlas of single-cell data.	Source of standardized datasets for training/evaluation.
	AIDA v2 [16]	Asian Immune Diversity Atlas; used for unbiased validation.	Independent dataset to mitigate data leakage risks.
Evaluation Metrics	LISI (iLISI/cLISI) [56] [55]	Metrics for evaluating batch mixing and cell type separation.	Local assessment of integration quality.
	ARI / NMI [54]	Metrics comparing clustering result to ground truth labels.	Standard measures for clustering accuracy.

The evidence demonstrates that there is no single "best" method universally superior for all clustering scenarios. The choice depends on the specific research context, goals, and constraints. The following decision diagram synthesizes the benchmark findings into a practical guide for method selection.

Current evidence indicates that for the critical task of zero-shot cell type clustering, traditional methods like Harmony, scVI, and even simple HVG selection provide more robust, accurate, and computationally efficient results than the current generation of single-cell foundation models [4] [3]. While scFMs represent a promising architectural advance and may excel in other tasks like perturbation prediction [18] or when fine-tuned, their zero-shot embeddings do not yet consistently capture biological reality for clustering as effectively as established techniques.

Therefore, for researchers and drug development professionals, the recommended practice is to use traditional methods as the primary tool for cell type discovery and atlas construction. scFMs should be approached as emerging technologies; their results should be rigorously validated against traditional method outputs and biological priors. Future developments in model architecture, pretraining objectives, and data curation are needed to close this performance gap and realize the full potential of foundation models in single-cell biology [16] [1].

The emergence of single-cell foundation models (scFMs) represents a paradigm shift in computational biology, offering unprecedented capacity to analyze cellular heterogeneity and function. However, a critical challenge persists: how to rigorously evaluate whether these models capture biologically meaningful patterns beyond mere technical performance on computational tasks. Traditional metrics for clustering accuracy or batch integration often fail to assess the biological relevance of learned representations. This gap has prompted the development of novel ontology-informed evaluation metrics, particularly scGraph-OntoRWR, which quantifies the alignment between computational model outputs and established biological knowledge [16] [57]. These metrics introduce a crucial biological ground truth into model assessment, enabling researchers to determine whether scFMs truly understand cellular biology or merely excel at pattern recognition without semantic understanding.

The integration of biological ontologies provides the formal scaffolding necessary for this evaluation approach. Biological ontologies are structured, controlled vocabularies that capture hierarchical relationships between biological entities—from genes and proteins to cell types and physiological processes [57]. By leveraging these comprehensive knowledge structures, researchers can now quantitatively measure how well the relational patterns discovered by scFMs correspond to biologically verified relationships. This approach is particularly valuable for evaluating zero-shot learning capabilities in scFMs, where models must generalize to novel datasets without task-specific fine-tuning [4].

Biological Ontologies: The Framework for Knowledge Representation

Foundations of Biological Knowledge Representation

Biological ontologies provide a formal, explicit specification of shared conceptualizations within the biological domain, capturing not just definitions but the intricate logical relationships between biological concepts [57]. Unlike simple databases or glossaries, ontologies structure knowledge through standardized relationship types such as "isa" (denoting classification hierarchies), "partof" (representing mereological relationships), and "participates_in" (connecting entities to processes). The Open Biological and Biomedical Ontology (OBO) Foundry represents a major community effort to coordinate ontology development across biological sciences, establishing best practices and standardized relationship definitions to ensure interoperability and logical consistency [57].

Two fundamental concept types form the bedrock of most biological ontologies. Continuants are entities that persist through time while maintaining their identity, such as molecules, cells, tissues, and organs. Occurrents are time-dependent entities including processes, actions, and states—for example, biochemical reactions, cell division, or disease progression [57]. This distinction is crucial for proper knowledge representation, as it helps avoid common modeling errors, such as confusing a physical structure with the processes it participates in.

Ontologies in Single-Cell Biology

In single-cell biology, ontologies provide essential organization for the extremely complex and high-dimensional data generated by technologies like scRNA-seq. Cell ontologies specifically define cell types and their relationships in a standardized hierarchy, capturing developmental lineages and functional classifications [57]. For example, a cell ontology would specify that a "cardiac muscle cell" is a subtype of "muscle cell," which in turn is a subtype of "animal cell," while also representing that it is "partof" the heart and "participatesin" muscle contraction processes.

These structured relationships provide the biological ground truth against which computational models can be evaluated. When a model represents two cell types as similar, ontology-based metrics can determine whether this computational similarity reflects established biological relationships—such as developmental lineage or functional similarity—or represents biologically nonsensical associations [16].

Novel Metrics for Evaluating Biological Insight

The scGraph-OntoRWR Metric

The scGraph-OntoRWR (Single-Cell Graph-Ontology Random Walk with Restart) metric represents a significant advancement in evaluating the biological relevance of scFM embeddings [16]. This innovative metric operates by comparing the relational structure between cell types learned by computational models against the known hierarchical structure encoded in biological ontologies.

The metric employs a random walk with restart algorithm on a cell-cell similarity graph constructed from model embeddings. This algorithm simulates a random traverser that moves between similar cells in the computational embedding space, with occasional restarts to maintain locality. The resulting visitation probabilities capture the implicit relational structure that the model has learned between different cell types [16].

Simultaneously, the same random walk process is applied to the formal cell ontology, where relationships are biologically validated and semantically meaningful. By comparing the probability distributions generated from the computational embeddings against those from the formal ontology, scGraph-OntoRWR quantifies the consistency between model-derived cell relationships and established biological knowledge [16]. A high scGraph-OntoRWR score indicates that the computational model has learned to represent cell types in a manner that respects their known biological relationships, suggesting genuine biological insight rather than merely technical pattern recognition.

Lowest Common Ancestor Distance (LCAD)

The Lowest Common Ancestor Distance (LCAD) metric provides a complementary approach to evaluating model errors in biologically meaningful terms [16]. Rather than treating all misclassifications equally, LCAD assesses the severity of cell type annotation errors by measuring their distance within the ontological hierarchy.

When a model misclassifies a cell type, LCAD calculates how closely related the predicted and actual cell types are within the ontology by identifying their lowest common ancestor and measuring the ontological proximity between them [16]. For example, misclassifying a "T helper cell" as a "cytotoxic T cell" represents a less severe error than misclassifying it as a "neuron," as T cell subtypes share a more recent common ancestor in the cell ontology. The former error might reflect incomplete learning of fine-grained distinction, while the latter suggests a fundamental failure to capture major cell lineage differences.

This ontology-informed error assessment provides crucial context for model evaluation, helping researchers distinguish between biologically reasonable mistakes and nonsensical predictions [16]. By incorporating LCAD alongside traditional accuracy metrics, researchers gain a more nuanced understanding of model performance that respects biological reality.

Performance Benchmarks of Single-Cell Foundation Models

Recent comprehensive benchmarking studies have applied these novel metrics to evaluate six prominent single-cell foundation models (Geneformer, scGPT, UCE, scFoundation, LangCell, and scCello) across diverse tasks including batch integration, cell type annotation, cancer cell identification, and drug sensitivity prediction [16]. The results reveal that no single scFM consistently outperforms others across all tasks, emphasizing that model selection must be tailored to specific applications and data characteristics.

Table 1: Performance Ranking of Single-Cell Foundation Models Across Biological Tasks [16]

Model	Batch Integration	Cell Type Annotation	Cancer ID	Drug Sensitivity	Overall Ranking
Geneformer	2	3	1	2	2
scGPT	3	2	3	3	3
UCE	1	4	4	4	4
scFoundation	4	1	2	1	1
Traditional ML	5	5	5	5	6
HVG Selection	6	6	6	6	5

The benchmarking demonstrated that foundation models generally show remarkable robustness and versatility across diverse applications, while simpler machine learning models sometimes adapt more efficiently to specific datasets, particularly under resource constraints [16]. Notably, the pretrained zero-shot scFM embeddings captured meaningful biological insights into the relational structure of genes and cells, which proved beneficial for downstream tasks. Performance improvements correlated with what researchers termed "a smoother landscape" in the pretrained latent space, reducing the difficulty of training task-specific models [16].

Table 2: The Scientist's Toolkit: Essential Research Reagents and Resources [57]

Reagent/Resource	Function	Biological Significance
Gene Embeddings	Numerical representations of genes in latent space	Capture functional similarities between genes based on co-expression patterns across diverse cellular contexts
Cell Ontologies	Structured vocabularies defining cell types and relationships	Provide ground truth for evaluating biological relevance of model outputs
Attention Mechanisms	Model components that identify important relationships between inputs	Reveal gene-gene interactions and regulatory relationships learned from data
Benchmark Datasets	Curated single-cell data with high-quality annotations	Enable standardized evaluation and comparison of different modeling approaches
GO Term Annotations	Gene Ontology functional classifications	Serve as biological prior knowledge for validating gene embeddings

Experimental Protocols for Knowledge Alignment Assessment

Protocol: Implementing scGraph-OntoRWR Evaluation

Objective: Quantify the alignment between cell-type relationships learned by a single-cell foundation model and established biological knowledge encoded in cell ontologies.

Materials and Reagents:

Pre-trained single-cell foundation model (e.g., Geneformer, scGPT, scFoundation)
Reference single-cell dataset with high-quality cell type annotations (e.g., from CELLxGENE)
Cell ontology (e.g., Cell Ontology from OBO Foundry)
Computational environment with Python and libraries including scanpy, scikit-learn, and ontology processing packages

Procedure:

Embedding Generation:
- Process the reference dataset through the scFM in zero-shot mode to generate cell embeddings without any fine-tuning.
- Normalize embeddings using L2 normalization to ensure comparable distance metrics.

Cell-Cell Graph Construction:
- Construct a k-nearest neighbor graph (k=15) from the normalized embeddings using cosine similarity.
- Convert the kNN graph to an adjacency matrix with edge weights representing similarity scores.
Ontology Graph Processing:
- Download the current Cell Ontology in OWL format.
- Extract the "isa" and "partof" relationships to construct a hierarchical graph structure.
- Convert the ontology hierarchy to an adjacency matrix where connections represent ontological relationships.
Random Walk with Restart Execution:
- Implement RWR algorithm with restart probability r=0.3 on both the embedding-derived graph and ontology graph.
- For each cell type, initiate RWR from 10 representative seed cells.
- Run until convergence (Δ < 1e-6 between iterations) to obtain stable probability distributions.
Similarity Calculation:
- Compute Jensen-Shannon divergence between the RWR probability distributions from the model and ontology for each cell type.
- Convert divergences to similarity scores using exponential transformation.
- Calculate final scGraph-OntoRWR score as mean similarity across all cell types.

Validation:

Compare scGraph-OntoRWR scores across multiple models and datasets.
Perform statistical significance testing using paired t-tests across cell types.
Correlate scGraph-OntoRWR scores with traditional biological metrics like marker gene expression.

Protocol: Assessing Misclassification Severity with LCAD

Objective: Evaluate cell type annotation errors in ontologically meaningful terms rather than treating all errors equally.

Materials and Reagents:

Cell type predictions from scFM on benchmark dataset
Ground truth cell type annotations
Cell ontology (e.g., Cell Ontology)
Ontology processing toolkit (e.g., pronto)

Procedure:

Error Identification:
- Compare model predictions against ground truth annotations to identify misclassified cells.
- For each misclassification, record both the true cell type and predicted cell type.

Ontological Distance Calculation:
- For each misclassification pair, find the lowest common ancestor (LCA) in the cell ontology.
- Calculate the shortest path distance from both the true and predicted cell types to the LCA.
- Sum these distances to obtain the LCAD value for each error.
Score Aggregation:
- Compute mean LCAD across all misclassifications for a model.
- Calculate distribution statistics (median, standard deviation) to understand error patterns.
- Compare LCAD values against random baseline expectation.
Biological Interpretation:
- Categorize errors by severity: low LCAD (ontologically related cell types) vs. high LCAD (distantly related cell types).
- Identify systematic error patterns that might indicate specific model limitations.

Diagram Title: Cell Ontology Hierarchy for LCAD Calculation

Application Notes for Drug Discovery and Development

Enhancing Target Identification and Validation

In pharmaceutical research, scGraph-OntoRWR provides a crucial framework for evaluating whether scFMs can correctly identify and represent disease-relevant cell states. When applied to tumor microenvironment data, this metric can verify that models maintain proper distinctions between immune cell subtypes while recognizing their functional relationships [16]. This capability is particularly valuable for identifying novel therapeutic targets within complex tissues, where understanding cellular relationships is essential for predicting on-target effects and potential toxicities.

For example, when analyzing scRNA-seq data from cancer biopsies, researchers can use scGraph-OntoRWR to ensure that models correctly cluster tumor-infiltrating lymphocytes by subtype while maintaining their ontological relationship to broader immune cell classes. A model with high scGraph-OntoRWR scores would be more trustworthy for identifying rare but therapeutically relevant cell populations, such as exhausted T cells or tumor-associated macrophages in specific functional states [16].

Accelerating Drug Repurposing Through Cross-Domain Alignment

Knowledge graphs have emerged as powerful tools for drug repurposing, organizing complex relationships between drugs, targets, diseases, and side effects [58] [59]. The principles underlying scGraph-OntoRWR can be extended to evaluate how well scFMs align with these pharmacological knowledge structures, creating opportunities for drug repurposing through cross-domain knowledge alignment.

By treating drug-disease relationships as a form of ontology, researchers can adapt the scGraph-OntoRWR methodology to assess how well model representations of drug-treated cells reflect known therapeutic mechanisms. For instance, a model that correctly represents that cardiac muscle cells and neurons share distant ontological relationships would be less likely to suggest cardiotoxic compounds for neurological disorders, potentially flagging safety issues earlier in the drug discovery process [59].

Addressing Limitations and Future Directions

While scGraph-OntoRWR represents a significant advance in biological evaluation of scFMs, several limitations remain. The metric depends heavily on the completeness and accuracy of the underlying ontologies, which may have gaps for rare cell types or newly discovered biological relationships [57]. Additionally, current implementations focus primarily on cell type relationships, with less emphasis on functional states or spatial contexts.

Future developments may extend these approaches to incorporate dynamic biological processes, multi-omics integrations, and causal relationship modeling. As noted in expert opinion, "Many popular link prediction algorithms fail to address strong biases in biomedical data, and only highlight biological associations, failing to model causal relationships in complex dynamic biological systems" [58]. Addressing these limitations will further enhance the utility of ontology-informed metrics for evaluating biological insight in computational models.

Diagram Title: scGraph-OntoRWR Calculation Workflow

The deployment of single-cell Foundation Models (scFMs) in a zero-shot setting—where models make predictions on novel data without any further task-specific training—is a critical test for their use in biological discovery. This application note synthesizes recent benchmarking studies to evaluate the zero-shot capabilities of leading scFMs, including Geneformer, scGPT, scFoundation, and LangCell. The analysis reveals that while these models hold immense promise for tasks like cell type annotation and batch integration, their zero-shot performance often fails to exceed that of simpler, established baseline methods. Performance is context-dependent, influenced by factors such as pretraining data composition and architectural choices. The following sections provide a detailed comparative analysis, standardized evaluation protocols, and actionable guidance for researchers aiming to incorporate these tools into their workflows.

Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to study cellular heterogeneity at an unprecedented resolution. The analysis of this high-dimensional data presents significant computational challenges, spurring the development of specialized single-cell Foundation Models (scFMs). These models are pre-trained on millions of single-cell gene expression profiles with the goal of learning universal patterns in transcriptional regulation [60] [3]. A key claimed advantage of scFMs is their potential for zero-shot learning—the ability to generalize to new datasets and tasks without requiring additional training data or fine-tuning. This capability is particularly valuable in exploratory biology, where predefined labels for cell types or states may be unavailable [4]. This document assesses the zero-shot performance of several prominent scFMs, providing a framework for their practical application and evaluation.

Model Architectures and Pre-training Strategies

The foundational knowledge of an scFM is largely determined by its architecture and the data and objectives used during pre-training. The table below summarizes the key characteristics of the evaluated models.

Table 1: Architectural and Pre-training Specifications of Leading scFMs

Model	Pre-training Data	Input Size (Genes)	Key Architectural Features	Pre-training Objective(s)
Geneformer [61]	29.9M human cells (v1); 95M human cells (v2)	2,048 (v1); 4,096 (v2)	Transformer; Rank-value gene encoding; Cell embedding (v1) or CLS token embedding (v2)	Masked gene prediction
scGPT [60] [62]	33M non-cancerous human cells (scGPT-human)	Full gene set	Transformer; Employs batch and condition tokens	Masked gene prediction
scFoundation [63]	Information missing	Information missing	Transformer-based	Information missing
LangCell [64]	Information missing	Information missing	Language-Cell pre-training framework; Unified representation of single-cell data and natural language	Incorporates text descriptions with discriminative and generative objectives
scMMGPT [60]	27M human cells + textual data	Full gene set	Multimodal (scRNA-seq + text); Bidirectional projectors; Two-stage pre-training	Discriminative (cell-text alignment) and Generative (text reconstruction)

A notable trend is the move towards multimodal integration. While earlier models like Geneformer and scGPT rely solely on transcriptomic data, newer approaches like LangCell and scMMGPT explicitly incorporate textual knowledge (e.g., cell type definitions from Wikipedia and OBO Foundry) during pre-training. This aims to ground the model's representations in rich, human-curated biological semantics [60] [64]. Another key differentiator is how models handle the input data; some, like Geneformer, use a fixed subset of genes ranked by expression, whereas others, like scGPT and scMMGPT, are designed to process the full quantitative expression profile to minimize information loss [60].

Quantitative Performance Benchmarking

Rigorous benchmarking is essential to understand the real-world utility of these models. The following tables consolidate quantitative results from recent independent evaluations, focusing on zero-shot performance for core tasks in single-cell analysis.

Zero-Shot Cell Type Clustering

Cell type clustering in a zero-shot setting involves using a model's embedding to group cells of the same type without any fine-tuning on the target dataset. Performance is measured by how well the embeddings separate known cell types.

Table 2: Zero-shot Cell Type Clustering Performance (Average BIO Score) [4] [3]

Model / Baseline	Pancreas Dataset	Tabula Sapiens Dataset	Immune Dataset	PBMC (12k) Dataset
Highly Variable Genes (HVG)	0.78	0.75	0.72	0.69
Harmony	0.75	0.71	0.70	0.67
scVI	0.74	0.73	0.68	0.68
scGPT	0.65	0.66	0.63	0.71
Geneformer	0.58	0.55	0.52	0.56
Random scGPT	0.51	0.50	0.49	0.52

Key Insights:

Simpler methods often outperform scFMs. The heuristic baseline of selecting Highly Variable Genes (HVG) consistently outperformed or matched all foundation models across most datasets [4] [3].
Performance is dataset-dependent. scGPT showed competitive performance on the PBMC dataset but lagged on others. The benchmarking also indicated that models do not consistently perform better on datasets that were part of their pre-training corpus [4].
Pre-training provides a marginal benefit. While pre-trained scGPT models performed better than a randomly initialized version, this improvement was not sufficient to surpass established baselines like scVI and Harmony in most cases [4].

Zero-Shot Batch Integration

Batch integration evaluates a model's ability to produce embeddings where cells of the same type cluster together, regardless of technical artifacts from different experiments or donors.

Table 3: Batch Integration Performance (Batch Mixing Score) [4]

Model / Baseline	Pancreas Dataset	PBMC Dataset	Tabula Sapiens Dataset	Immune Dataset
Highly Variable Genes (HVG)	0.85	0.88	0.82	0.80
scVI	0.81	0.85	0.75	0.72
Harmony	0.78	0.80	0.70	0.78
scGPT	0.72	0.75	0.78	0.77
Geneformer	0.45	0.48	0.41	0.43

Key Insights:

HVG remains a strong baseline. Once again, the simple HVG method achieved the highest scores, indicating that current scFMs do not inherently learn to remove batch effects more effectively than basic feature selection in a zero-shot setting [4].
Geneformer struggles with batch effects. Geneformer's embeddings consistently showed a high proportion of variance explained by batch effects, performing worse than other methods [4].
scGPT shows variable performance. scGPT's performance was more competitive, sometimes outperforming scVI or Harmony on datasets with complex biological batch effects (e.g., different donors), though these were also datasets potentially seen during its pre-training [4].

Fine-Tuned Cell Type Annotation

While zero-shot performance is critical for discovery, fine-tuning on labeled data is a common application. The table below shows the performance of Geneformer models after fine-tuning a classifier on their embeddings.

Table 4: Fine-tuned Cell Type Annotation Performance (F1 Score) [61]

Model	Cross-Tissue Immune Atlas (LVL1)	CITE-seq Yolk Sac (LVL1)	CITE-seq Yolk Sac (LVL3 - High Resolution)
Geneformer v1	0.72	0.81	0.21
Geneformer v2 (Base)	0.85	0.89	0.42
Geneformer v2 (Cancer-tuned)	0.86	0.90	0.43

Key Insights:

Architectural improvements matter. Geneformer v2, with its larger pre-training corpus, increased input size, and use of a CLS token, significantly outperforms v1, especially in fine-tuned settings [61].
High-resolution annotation is challenging. All models perform worse on Level 3 (finer) cell type annotations, but v2 shows a 2x improvement in F1 score over v1, demonstrating its ability to capture more nuanced biological information [61].

Detailed Experimental Protocols

This section outlines standardized protocols for reproducing key benchmarking experiments, enabling researchers to validate model performance on their own datasets.

Protocol 1: Zero-Shot Cell Type Clustering Evaluation

Objective: To assess the quality of a model's cell embeddings for separating known cell types without any fine-tuning.

Materials:

Processed single-cell dataset (e.g., from CellxGene) with annotated cell types.
Python environment with installed dependencies (see Research Reagent Solutions).

Methodology:

Data Preprocessing: Prepare an AnnData object containing the raw or normalized gene expression matrix, with cell type annotations stored in adata.obs.
Embedding Generation: Pass the preprocessed data through the target scFM in evaluation mode to extract cell embeddings.
- For Geneformer: Use the geneformer.get_embeddings() method with emb_mode="cell" (v1) or emb_mode="cls" (v2) [61].
- For scGPT: Use the model's forward pass to generate the cell embeddings as described in the official documentation [62].
Dimensionality Reduction: Apply Principal Component Analysis (PCA) to the embeddings, retaining the top 50 principal components.
Clustering and Evaluation: Perform k-nearest neighbors (K-NN) clustering on the PCA-reduced embeddings. Calculate the Average BIO score and Average Silhouette Width (ASW) to quantify cluster purity and separation [4].

Analysis:

Compare the scores against established baselines (HVG, scVI, Harmony) run on the same dataset.
A higher BIO score (closer to 1) indicates better alignment between clusters and ground-truth cell types.

Figure 1: Workflow for zero-shot clustering evaluation.

Protocol 2: Benchmarking Gene Expression Prediction

Objective: To evaluate a model's core understanding of gene-gene relationships by testing its ability to predict the expression of held-out genes.

Materials:

As in Protocol 1.

Methodology:

Data Splitting: For a given cell, mask the expression values of a random 10% of its genes.
Model Inference: Input the cell with masked genes into the model and collect its predictions for the masked values.
Performance Calculation: For each masked gene, calculate the Mean Absolute Error (MAE) or Mean Squared Error (MSE) between the predicted and actual expression values. Aggregate these scores across all cells and masked genes in the test set [40] [3].

Analysis:

A perfect predictor would have an MAE/MSE of zero. Compare the model's error against a simple baseline, such as predicting the median expression value of each gene across the training set.
As noted in benchmarking, some models may struggle, performing only slightly better than this median baseline, particularly for low-to-medium expressed genes [3].

Figure 2: Workflow for expression prediction benchmarking.

The Scientist's Toolkit: Research Reagent Solutions

The following table details key software and data resources required for working with scFMs.

Table 5: Essential Research Reagents for scFM Evaluation

Reagent / Resource	Type	Function / Application	Source / Reference
CellxGene Database	Data	Primary source of large-scale, publicly available single-cell data for pre-training and benchmarking.	https://cellxgene.cziscience.com/ [60]
scGPT Repository	Software	Provides code for loading pre-trained weights, generating embeddings, and fine-tuning.	https://github.com/bowang-lab/scGPT [62]
Geneformer Repository	Software	Official implementation of Geneformer. Requires Git LFS to download model weights.	https://github.com/bschnorrerlab/Geneformer [62]
Zero-shot Evaluation Code	Software	Benchmarking code from Microsoft Research for reproducing cell clustering and batch integration tests.	https://github.com/microsoft/zero-shot-scfoundation [62]
Helical Package	Software	A unified package facilitating easy access and evaluation of various bio-foundation models, including Geneformer.	https://github.com/helical-ai [61]
OBO Foundry / Wikipedia	Data	Sources of structured and free-text biological knowledge for multimodal pre-training (e.g., cell type descriptions).	https://obofoundry.org/ [60]

Based on the consolidated findings from recent benchmarks, the following recommendations are provided for researchers and drug development professionals:

Temper Expectations for Zero-Shot Discovery: Practitioners should be cautious about using current scFMs for pure zero-shot discovery on critical tasks. Simpler methods like HVG selection, scVI, or Harmony may provide more reliable and interpretable results for tasks like initial clustering and batch correction [4] [3].
Validate with Baselines: Always include established baselines in any evaluation pipeline. The superior performance of simple methods in recent benchmarks underscores that model complexity and scale do not automatically translate to better zero-shot performance [63] [4].
Consider Fine-Tuning for Specific Tasks: If labeled data is available, fine-tuning can significantly improve performance, as evidenced by the gains seen in Geneformer v2 [61]. scFMs should currently be viewed as powerful base models for transfer learning rather than out-of-the-box discovery engines.
Evaluate on Multiple Metrics and Datasets: Model performance is highly dataset-dependent. A comprehensive evaluation should use multiple metrics (e.g., BIO, ASW, batch mixing scores) across diverse biological contexts to build confidence in a model's utility [40] [4].
Monitor Multimodal Advances: Emerging models like LangCell and scMMGPT, which integrate textual knowledge, represent a promising direction. They have shown improved performance in cell annotation and better generalization, suggesting that multimodal learning may be key to unlocking more robust zero-shot capabilities [60] [64].

In conclusion, while single-cell foundation models are a rapidly evolving and powerful new class of tools, their application in zero-shot settings requires careful validation. By adhering to standardized benchmarking protocols and maintaining a critical perspective relative to simpler methods, the research community can best leverage these models to drive meaningful biological discovery.

In the rapidly evolving field of single-cell genomics, foundation models (scFMs) pretrained on millions of cells have emerged as powerful tools for extracting biological insights from complex data. These models, including scGPT, Geneformer, and scBERT, leverage transformer architectures to learn universal representations of cellular states [1]. However, their practical application, particularly in zero-shot learning settings where models are applied without task-specific fine-tuning, requires careful consideration of the inherent trade-offs between performance, interpretability, and computational cost. This framework is essential for researchers and drug development professionals who must select appropriate models for discovery-driven research where predefined labels are often unavailable [4].

The evaluation of these trade-offs is critical because, as recent studies indicate, scFMs do not consistently outperform simpler baseline methods in zero-shot settings. In some cases, selecting highly variable genes (HVG) can surpass foundation models in tasks like cell type clustering and batch integration [4]. This application note provides a structured approach to interpreting evaluation results, enabling informed decision-making for biological discovery and therapeutic development.

Quantitative Performance Benchmarking in Zero-Shot Settings

Rigorous evaluation of scFMs against established baselines is crucial for assessing their practical utility. Performance benchmarks should encompass multiple biological and technical contexts to reveal model strengths and limitations. The following metrics and comparisons provide a standardized framework for model assessment.

Performance Metrics and Evaluation Criteria

The table below outlines key metrics for evaluating scFMs across common single-cell analysis tasks:

Task Category	Specific Task	Key Evaluation Metrics	Interpretation Guide
Cell-level Tasks	Cell Type Clustering	Average BIO (AvgBIO) score, Average Silhouette Width (ASW)	Higher scores indicate better separation of known cell types [4].
	Batch Integration	Principal Component Regression (PCR) score, Batch mixing scores	Lower PCR indicates better batch effect removal; higher batch mixing scores indicate better integration [4].
Gene-level Tasks	Gene Function Prediction	Gene ontology enrichment, Prior knowledge alignment	Measures biological relevance of gene embeddings [16].
Clinical Applications	Drug Sensitivity Prediction	Accuracy, AUC-ROC	Model performance in predicting therapeutic responses [16].
	Cancer Cell Identification	F1 score, Precision-Recall	Accuracy in distinguishing malignant from benign cells [16].

Comparative Performance of scFMs and Baselines

Recent benchmarking studies reveal that no single scFM consistently outperforms all others across diverse tasks. The following table summarizes the zero-shot performance of leading scFMs compared to established baseline methods:

Model / Method	Cell Type Clustering	Batch Integration	Biological Relevance	Key Strengths and Limitations
scGPT	Variable performance; outperforms baselines on some datasets (e.g., PBMC 12k) but underperforms on others [4]	Robust on complex datasets with biological batch effects; outperforms Harmony and scVI on Immune and Tabula Sapiens datasets [4]	Captures meaningful biological insights into relational structure of genes and cells [16]	Strength: Strong across diverse tasks; Limitation: Inconsistent zero-shot clustering performance [17]
Geneformer	Underperforms HVG, scVI, and Harmony across most datasets and metrics [4]	Consistently ranks last across batch integration metrics; embeddings often retain batch effects [4]	Benefits from effective pretraining strategies for gene-level tasks [17]	Strength: Effective pretraining for gene-level tasks; Limitation: Poor batch integration and cell type clustering zero-shot [4]
scFoundation	Not specifically evaluated in clustering	Not specifically evaluated in batch integration	Demonstrates strong capabilities in gene-level tasks [17]	Strength: Gene-level task performance; Limitation: Limited evaluation on cell-level tasks
scBERT	Limited zero-shot evaluation available	Limited zero-shot evaluation available	Lags behind larger models likely due to smaller size and limited training data [17]	Strength: Architecture design; Limitation: Model scale constraints
HVG (Baseline)	Outperforms Geneformer and scGPT across all metrics [4]	Achieves best batch integration scores across all datasets [4]	Provides fundamental biological signal	Strength: Simple, effective, computationally efficient; Limitation: Limited capacity for complex pattern recognition
scVI (Baseline)	Outperforms proposed foundation models in cell type clustering [4]	Excellent technical batch effect correction; challenged by biological variation in Immune datasets [4]	Captures biologically meaningful variation	Strength: Robust probabilistic modeling; Limitation: May overcorrect biological variation
Harmony (Baseline)	Competitive performance with scFMs [4]	Effective technical integration; challenged by Tabula Sapiens complexity [4]	Preserves biological structure while removing technical artifacts	Strength: Fast, efficient integration; Limitation: Struggles with highly diverse datasets

Interpretability Frameworks and Methods

Model interpretability is essential for debugging, establishing trust, and deriving biological insights from scFMs. Various techniques can be applied to understand model decisions and the biological relevance of learned representations.

Interpretability Techniques for Foundation Models

The following table outlines key interpretability methods applicable to scFMs:

Interpretability Technique	Mechanism	Applicable Tasks	Biological Insights Generated
SHAP (SHapley Additive exPlanations)	Computes feature importance by measuring marginal contribution across feature combinations [65]	Cell type prediction, Gene expression prediction, Drug response	Identifies genes most influential to specific model predictions [65]
Attention Mechanism Analysis	Analynes patterns in transformer self-attention weights to identify gene-gene relationships [1]	Gene regulatory network inference, Cell state transitions	Reveals potential regulatory relationships and coordinated gene expression patterns [1]
Embedding Dimensionality Reduction	Projects high-dimensional cell embeddings to 2D/3D space using UMAP or t-SNE [4]	Cell type clustering, Batch integration assessment	Visualizes cellular heterogeneity and model representation quality [4]
Global Surrogate Models	Trains interpretable models to approximate complex foundation model predictions [65]	Model debugging, Feature importance analysis	Provides simplified, interpretable approximations of complex model behavior [65]
scGraph-OntoRWR (Novel Metric)	Measures consistency between cell type relationships in embeddings and prior biological knowledge [16]	Evaluation of biological relevance in embeddings	Quantifies how well model captures established biological hierarchies [16]

Interpreting Model Limitations

Interpretability analyses reveal why scFMs may underperform in zero-shot settings. For example, analysis of Geneformer's embeddings shows they often fail to retain sufficient cell type information, with clustering primarily driven by batch effects rather than biological signals [4]. Similarly, investigating attention patterns can reveal whether models focus on biologically plausible gene relationships or spurious technical correlations.

The Lowest Common Ancestor Distance (LCAD) metric provides a biologically-grounded approach to evaluating cell type annotation errors by measuring the ontological proximity between misclassified cell types, with smaller distances indicating more biologically reasonable errors [16].

Computational Resource Requirements

The scale of scFMs creates significant computational demands throughout the model lifecycle, from pretraining to deployment. Understanding these requirements is essential for practical implementation.

Computational Cost Analysis

Model	Parameter Count	Pretraining Dataset Size	Inference Memory Requirements	Fine-tuning Efficiency
scGPT	~50 million [16]	33 million non-cancerous human cells [16]	High (512-dimensional embeddings) [16]	Parameter-efficient methods available (adapters, prefix tuning) [31]
Geneformer	~40 million [16]	30 million cells [16]	Moderate (256-512-dimensional embeddings) [16]	Requires full fine-tuning in standard approach
UCE	~650 million [16]	36 million cells [16]	Very high (1280-dimensional embeddings) [16]	Limited information on efficient fine-tuning
scFoundation	~100 million [16]	50 million cells [16]	High (3072-dimensional embeddings) [16]	Architecture supports various fine-tuning approaches
scBERT	~6 million [1]	1.12 million human cells [31]	Lower than larger models	Less computationally intensive fine-tuning

Efficient Fine-Tuning Strategies

Recent advances in parameter-efficient fine-tuning enable adaptation of scFMs with minimal computational overhead:

Adapter-based Approaches: Insert small trainable layers within transformer blocks, training less than 1% of original parameters while maintaining performance [31]
Prefix Tuning: Prepends trainable tensors to each transformer block, achieving comparable results to full fine-tuning with 0.1% of parameters [31]
Drug-Conditional Adapters: Enable conditioning on unseen modalities (e.g., molecular structures) while preserving biological knowledge from pretraining [31]

Experimental Protocols for Trade-off Evaluation

Standardized protocols enable consistent evaluation of the trade-offs between performance, interpretability, and computational cost.

Protocol 1: Zero-Shot Cell Type Clustering Assessment

Purpose: Evaluate model performance in discriminating cell types without task-specific training.

Materials:

Pretrained model (scGPT, Geneformer, or alternatives)
Benchmark dataset with ground truth cell labels (e.g., Tabula Sapiens, Pancreas)
Baseline methods (HVG, scVI, Harmony)
Computing environment with adequate GPU memory

Procedure:

Data Preprocessing:
- Load target dataset and apply standard normalization
- Generate cell embeddings using foundation model's zero-shot capability
- Apply same preprocessing for baseline methods

Embedding Generation:
- For transformer models: forward pass through pretrained network without fine-tuning
- Extract cell embeddings from model's representation layer
- Reduce dimensionality using PCA (50 components) for baseline comparisons
Clustering and Evaluation:
- Apply Leiden clustering to embeddings across all methods
- Calculate AvgBIO and ASW scores against ground truth labels
- Compare results across methods and datasets
Interpretability Analysis:
- Apply UMAP visualization to embeddings from each method
- Use SHAP analysis to identify genes driving cluster formation
- Calculate scGraph-OntoRWR score to assess biological consistency

Interpretation: Models with higher AvgBIO/ASW scores and scGraph-OntoRWR values provide better separation of biologically meaningful cell types. Superior performance of simple baselines may indicate limitations in foundation model pretraining.

Protocol 2: Batch Integration Capability Assessment

Purpose: Evaluate model ability to remove technical artifacts while preserving biological variation.

Materials:

Dataset with known batch effects (e.g., Pancreas dataset with 5 sources)
Pretrained foundation models and baseline methods
Evaluation metrics (PCR, batch mixing scores)

Procedure:

Embedding Generation:
- Generate cell embeddings using foundation models and baseline methods
- Ensure consistent gene space alignment across datasets

Quantitative Evaluation:
- Calculate PCR score measuring proportion of variance explained by batch
- Compute batch mixing scores assessing neighborhood purity
- Compare metrics across methods
Biological Preservation Assessment:
- Assess whether integrated embeddings maintain separation of known biological groups
- Compare cell type clustering performance before and after integration

Interpretation: Effective batch correction shows low PCR scores (effective batch removal) while maintaining biological structure. Models that over-correct by removing biological variation should be identified and potentially avoided.

Protocol 3: Computational Efficiency Benchmarking

Purpose: Quantify computational resources required for training and inference.

Materials:

Standard benchmarking environment (specified GPU, CPU, memory)
Model implementations with consistent deep learning framework
Timing and memory profiling tools

Procedure:

Inference Speed Assessment:
- Measure time to generate embeddings for standardized dataset (e.g., 10,000 cells)
- Profile GPU memory usage during inference
- Compare across model architectures

Fine-tuning Efficiency:
- Implement parameter-efficient fine-tuning methods (adapters, prefix tuning)
- Measure training time and memory requirements compared to full fine-tuning
- Assess performance retention with reduced parameter updates
Scaling Analysis:
- Evaluate how inference time scales with increasing dataset size
- Measure memory requirements for different batch sizes

Interpretation: Models with favorable performance-compute trade-offs enable broader application, particularly in resource-constrained environments. Performance gains of large models must be justified by their computational costs.

Integrated Decision Framework

Selecting the appropriate scFM requires balancing multiple factors based on specific research goals and constraints. The following workflow visualizes the decision process:

Essential Research Reagents and Computational Tools

Successful implementation of scFMs requires both computational tools and biological resources. The following table details key components of the research toolkit:

Tool/Resource	Type	Function	Application Context
BioLLM Framework	Software Tool	Unified interface for integrating and evaluating diverse scFMs [17]	Standardized benchmarking and model switching across different architectures
CELLxGENE Dataset	Data Resource	Curated single-cell datasets with standardized annotations [1]	Pretraining and evaluation of foundation models
SHAP (SHapley Additive exPlanations)	Interpretability Library	Explains model predictions by quantifying feature importance [65]	Identifying genes driving model decisions and detecting potential biases
Parameter-efficient Fine-tuning Methods	Algorithmic Approach	Adapters, prefix tuning for model adaptation with minimal parameters [31]	Adapting foundation models to new tasks with limited data and compute
scGraph-OntoRWR	Evaluation Metric	Quantifies consistency between embedding relationships and biological knowledge [16]	Assessing biological relevance of learned representations
WebAIM Contrast Checker	Accessibility Tool	Verifies color contrast ratios for data visualizations [66]	Creating accessible figures that meet WCAG guidelines

Interpreting the trade-offs between performance, interpretability, and computational cost in single-cell foundation models requires a multifaceted approach. Current evidence suggests that while scFMs show promise in capturing complex biological relationships, their zero-shot performance does not consistently surpass simpler methods across all tasks. Researchers should select models based on specific task requirements, dataset characteristics, and computational constraints, using the structured evaluation framework presented here. As the field evolves, continued benchmarking and development of interpretability methods will be essential for realizing the full potential of foundation models in biological discovery and therapeutic development.

Conclusion

The current generation of single-cell foundation models represents a promising yet maturing technology. While they offer the potential for versatile, generalizable biological insights and have demonstrated success in specific applications like efficient fine-tuning for drug response prediction, rigorous zero-shot evaluations reveal they do not consistently outperform established, simpler methods on core tasks like cell type clustering and batch integration. Their true value appears to be task-dependent, excelling where their learned representations of biological relationships can be leveraged. Future progress hinges on developing more biologically meaningful pretraining objectives, creating standardized and rigorous evaluation frameworks that prioritize zero-shot capability, and improving model interpretability. For researchers and clinicians, this means a pragmatic approach is essential: scFMs are powerful new tools for the arsenal, but their application should be guided by specific task requirements and validated against traditional baselines. Their continued evolution holds the key to unlocking deeper insights into cellular function, disease mechanisms, and accelerating personalized therapeutic discovery.